Support vectors

  • SVs are the data points taht would change the position of the decision boundary if they were removed.
  • SVs are the critical data points that help define the decision boundary.

Notes:

  • Neural Networks are similar in nature to Support Vector Machines, but they are more like a black box. SVMs are more interpretable and can be used to understand the data better.
  • SVMs can find the optimal hyperplane that separates the data into two classes, but NNs cannot.
  • Orthogonal means perpendicular in higher dimensions, affine means parallel in higher dimensions.
  • Affine indicates that the subspace need not pass through the origin.

Hinge Loss

  • Hingeloss: The loss function used for training classifiers. It is used for maximum-margin classification, most notably for SVMs. Mathematically, it is defined as:
    • $\ell (y)=\max(0,1-t\cdot y)$
    • where y is the output of the classifier, t is the target value.
    • Hinge loss penalizes predictions y < 1, corresponding to the notion of a margin in a support vector machine.

Gradient of Hinge Loss

The gradient of the hinge loss with respect to the weights ($\mathbf{w}$) and bias ($b$) depends on whether the sample is correctly classified with a margin or not.

If ($y_i (\mathbf{w} \cdot x_i + b) \geq 1$):

The hinge loss is zero, and the gradient is: \(\frac{\partial L}{\partial \mathbf{w}} = 2 \lambda \mathbf{w}\) \(\frac{\partial L}{\partial b} = 0\)

If ($y_i (\mathbf{w} \cdot x_i + b) < 1$):

The hinge loss is positive, and the gradient is: \(\frac{\partial L}{\partial \mathbf{w}} = 2 \lambda \mathbf{w} - y_i x_i\) \(\frac{\partial L}{\partial b} = -y_i\)

Derivation of Gradient Descent for SVM Let’s derive the gradient descent updates for the SVM with hinge loss and L2 regularization.

Objective Function

The objective function for a linear SVM with hinge loss and L2 regularization is: \(J(\mathbf{w}, b) = \frac{1}{n} \sum_{i=1}^{n} \max(0, 1 - y_i (\mathbf{w} \cdot x_i + b)) + \lambda |\mathbf{w}|^2\) where:

  • $ \mathbf{w} $ is the weight vector
  • $b$ is the bias
  • $y_i$ is the true label for the $i$-th sample
  • $x_i$ is the feature vector for the $i$-th sample
  • $\lambda$ is the regularization parameter

Hinge Loss Gradient

The hinge loss for a single sample is: \(L(y_i, f(x_i)) = \max(0, 1 - y_i (\mathbf{w} \cdot x_i + b))\)

The gradient of the hinge loss with respect to $ \mathbf{w} $ and $ b $ depends on whether the sample is correctly classified with a margin or not.

If $ y_i (\mathbf{w} \cdot x_i + b) \geq 1 $:

The hinge loss is zero, and the gradient is: \(\frac{\partial L}{\partial \mathbf{w}} = 0\) \(\frac{\partial L}{\partial b} = 0\)

If $ y_i (\mathbf{w} \cdot x_i + b) < 1 $:

The hinge loss is positive, and the gradient is: \(\frac{\partial L}{\partial \mathbf{w}} = -y_i x_i\) \(\frac{\partial L}{\partial b} = -y_i\)

L2 Regularization Gradient

The L2 regularization term is: \(\lambda |\mathbf{w}|^2 = \lambda \sum_{j=1}^{d} w_j^2\)

The gradient of the L2 regularization term with respect to $ \mathbf{w} $ is: \(\frac{\partial}{\partial \mathbf{w}} (\lambda |\mathbf{w}|^2) = 2 \lambda \mathbf{w}\)

Combined Gradient

Combining the gradients of the hinge loss and the L2 regularization, we get the total gradient for the objective function.

If $ y_i (\mathbf{w} \cdot x_i + b) \geq 1 $:

The total gradient is: \(\frac{\partial J}{\partial \mathbf{w}} = 2 \lambda \mathbf{w}\) \(\frac{\partial J}{\partial b} = 0\)

If $ y_i (\mathbf{w} \cdot x_i + b) < 1 $:

The total gradient is: \(\frac{\partial J}{\partial \mathbf{w}} = 2 \lambda \mathbf{w} - y_i x_i\) \(\frac{\partial J}{\partial b} = -y_i\)

Gradient Descent Update

Using gradient descent, the weights and bias are updated as follows: \(\mathbf{w} \leftarrow \mathbf{w} - \eta \frac{\partial J}{\partial \mathbf{w}}\) \(b \leftarrow b - \eta \frac{\partial J}{\partial b}\) where $ \eta $ is the learning rate.