Support vectors

SVs are the data points taht would change the position of the decision boundary if they were removed.
SVs are the critical data points that help define the decision boundary.

Notes:

Neural Networks are similar in nature to Support Vector Machines, but they are more like a black box. SVMs are more interpretable and can be used to understand the data better.
SVMs can find the optimal hyperplane that separates the data into two classes, but NNs cannot.
Orthogonal means perpendicular in higher dimensions, affine means parallel in higher dimensions.
Affine indicates that the subspace need not pass through the origin.

Hinge Loss

Hingeloss: The loss function used for training classifiers. It is used for maximum-margin classification, most notably for SVMs. Mathematically, it is defined as:
- $\ell (y)=\max(0,1-t\cdot y)$
- where y is the output of the classifier, t is the target value.
- Hinge loss penalizes predictions y < 1, corresponding to the notion of a margin in a support vector machine.

Gradient of Hinge Loss

The gradient of the hinge loss with respect to the weights ($\mathbf{w}$) and bias ($b$) depends on whether the sample is correctly classified with a margin or not.

If ($y_i (\mathbf{w} \cdot x_i + b) \geq 1$):

The hinge loss is zero, and the gradient is: $\frac{\partial L}{\partial \mathbf{w}} = 2 \lambda \mathbf{w}$ $\frac{\partial L}{\partial b} = 0$

If ($y_i (\mathbf{w} \cdot x_i + b) < 1$):

The hinge loss is positive, and the gradient is: $\frac{\partial L}{\partial \mathbf{w}} = 2 \lambda \mathbf{w} - y_i x_i$ $\frac{\partial L}{\partial b} = -y_i$

Derivation of Gradient Descent for SVM Let’s derive the gradient descent updates for the SVM with hinge loss and L2 regularization.

Objective Function

The objective function for a linear SVM with hinge loss and L2 regularization is: $J(\mathbf{w}, b) = \frac{1}{n} \sum_{i=1}^{n} \max(0, 1 - y_i (\mathbf{w} \cdot x_i + b)) + \lambda |\mathbf{w}|^2$ where:

$ \mathbf{w} $ is the weight vector
$b$ is the bias
$y_i$ is the true label for the $i$-th sample
$x_i$ is the feature vector for the $i$-th sample
$\lambda$ is the regularization parameter

Hinge Loss Gradient

The hinge loss for a single sample is: $L(y_i, f(x_i)) = \max(0, 1 - y_i (\mathbf{w} \cdot x_i + b))$

The gradient of the hinge loss with respect to $ \mathbf{w} $ and $ b $ depends on whether the sample is correctly classified with a margin or not.

If $ y_i (\mathbf{w} \cdot x_i + b) \geq 1 $:

The hinge loss is zero, and the gradient is: $\frac{\partial L}{\partial \mathbf{w}} = 0$ $\frac{\partial L}{\partial b} = 0$

If $ y_i (\mathbf{w} \cdot x_i + b) < 1 $:

The hinge loss is positive, and the gradient is: $\frac{\partial L}{\partial \mathbf{w}} = -y_i x_i$ $\frac{\partial L}{\partial b} = -y_i$

L2 Regularization Gradient

The L2 regularization term is: $\lambda |\mathbf{w}|^2 = \lambda \sum_{j=1}^{d} w_j^2$

The gradient of the L2 regularization term with respect to $ \mathbf{w} $ is: $\frac{\partial}{\partial \mathbf{w}} (\lambda |\mathbf{w}|^2) = 2 \lambda \mathbf{w}$

Combined Gradient

Combining the gradients of the hinge loss and the L2 regularization, we get the total gradient for the objective function.

If $ y_i (\mathbf{w} \cdot x_i + b) \geq 1 $:

The total gradient is: $\frac{\partial J}{\partial \mathbf{w}} = 2 \lambda \mathbf{w}$ $\frac{\partial J}{\partial b} = 0$