Support Vector Machine
Support vectors
- SVs are the data points taht would change the position of the decision boundary if they were removed.
- SVs are the critical data points that help define the decision boundary.
Notes:
- Neural Networks are similar in nature to Support Vector Machines, but they are more like a black box. SVMs are more interpretable and can be used to understand the data better.
- SVMs can find the optimal hyperplane that separates the data into two classes, but NNs cannot.
- Orthogonal means perpendicular in higher dimensions, affine means parallel in higher dimensions.
- Affine indicates that the subspace need not pass through the origin.
Hinge Loss
- Hingeloss: The loss function used for training classifiers. It is used for maximum-margin classification, most notably for SVMs. Mathematically, it is defined as:
- $\ell (y)=\max(0,1-t\cdot y)$
- where y is the output of the classifier, t is the target value.
- Hinge loss penalizes predictions y < 1, corresponding to the notion of a margin in a support vector machine.
Gradient of Hinge Loss
The gradient of the hinge loss with respect to the weights ($\mathbf{w}$) and bias ($b$) depends on whether the sample is correctly classified with a margin or not.
If ($y_i (\mathbf{w} \cdot x_i + b) \geq 1$):
The hinge loss is zero, and the gradient is: \(\frac{\partial L}{\partial \mathbf{w}} = 2 \lambda \mathbf{w}\) \(\frac{\partial L}{\partial b} = 0\)
If ($y_i (\mathbf{w} \cdot x_i + b) < 1$):
The hinge loss is positive, and the gradient is: \(\frac{\partial L}{\partial \mathbf{w}} = 2 \lambda \mathbf{w} - y_i x_i\) \(\frac{\partial L}{\partial b} = -y_i\)
Derivation of Gradient Descent for SVM Let’s derive the gradient descent updates for the SVM with hinge loss and L2 regularization.
Objective Function
The objective function for a linear SVM with hinge loss and L2 regularization is: \(J(\mathbf{w}, b) = \frac{1}{n} \sum_{i=1}^{n} \max(0, 1 - y_i (\mathbf{w} \cdot x_i + b)) + \lambda |\mathbf{w}|^2\) where:
- $ \mathbf{w} $ is the weight vector
- $b$ is the bias
- $y_i$ is the true label for the $i$-th sample
- $x_i$ is the feature vector for the $i$-th sample
- $\lambda$ is the regularization parameter
Hinge Loss Gradient
The hinge loss for a single sample is: \(L(y_i, f(x_i)) = \max(0, 1 - y_i (\mathbf{w} \cdot x_i + b))\)
The gradient of the hinge loss with respect to $ \mathbf{w} $ and $ b $ depends on whether the sample is correctly classified with a margin or not.
If $ y_i (\mathbf{w} \cdot x_i + b) \geq 1 $:
The hinge loss is zero, and the gradient is: \(\frac{\partial L}{\partial \mathbf{w}} = 0\) \(\frac{\partial L}{\partial b} = 0\)
If $ y_i (\mathbf{w} \cdot x_i + b) < 1 $:
The hinge loss is positive, and the gradient is: \(\frac{\partial L}{\partial \mathbf{w}} = -y_i x_i\) \(\frac{\partial L}{\partial b} = -y_i\)
L2 Regularization Gradient
The L2 regularization term is: \(\lambda |\mathbf{w}|^2 = \lambda \sum_{j=1}^{d} w_j^2\)
The gradient of the L2 regularization term with respect to $ \mathbf{w} $ is: \(\frac{\partial}{\partial \mathbf{w}} (\lambda |\mathbf{w}|^2) = 2 \lambda \mathbf{w}\)
Combined Gradient
Combining the gradients of the hinge loss and the L2 regularization, we get the total gradient for the objective function.
If $ y_i (\mathbf{w} \cdot x_i + b) \geq 1 $:
The total gradient is: \(\frac{\partial J}{\partial \mathbf{w}} = 2 \lambda \mathbf{w}\) \(\frac{\partial J}{\partial b} = 0\)
If $ y_i (\mathbf{w} \cdot x_i + b) < 1 $:
The total gradient is: \(\frac{\partial J}{\partial \mathbf{w}} = 2 \lambda \mathbf{w} - y_i x_i\) \(\frac{\partial J}{\partial b} = -y_i\)
Gradient Descent Update
Using gradient descent, the weights and bias are updated as follows: \(\mathbf{w} \leftarrow \mathbf{w} - \eta \frac{\partial J}{\partial \mathbf{w}}\) \(b \leftarrow b - \eta \frac{\partial J}{\partial b}\) where $ \eta $ is the learning rate.
Enjoy Reading This Article?
Here are some more articles you might like to read next: