Paper: Xtreme Margin: A Tunable Loss Function for Binary Classification Problems
This is a paper summary generated from summarizepaper.com. I edited for better understanding.
How Xtreme Margin is different
Xtreme Margin offers greater flexibility with tunable hyperparameters λ1 and λ2
Hyperparameters allow users to adjust training based on desired outcomes (precision, AUC score, conditional accuracy)
Xtreme Margin is also non-convex and non-differentiable in certain cases where it does not predict correctly.
Gradient-based optimization methods may not be directly applicable, alternative techniques like subgradient optimization can be used
The subgradient method is a simple algorithm for minimizing a nondifferentiable convex function. The method looks very much like the ordinary gradient method for differentiable functions, but with several notable exceptions. For example, the subgradient method uses step lengths that are fixed ahead of time, instead of an exact or approximate line search as in the gradient method. Unlike the ordinary gradient method, the subgradient method is not a descent method; the function value can (and often does) increase.
The subgradient method is far slower than Newton’s method, but is much simpler and can be applied to a far wider variety of problems. By combining the subgradient method with primal or dual decomposition techniques, it is sometimes possible to develop a simple distributed algorithm for a problem.
Xtreme Margin is a promising alternative for binary classification problems.
Xtreme Margin loss function
\(L(y, t_true; \lambda_1, \lambda_2) = \frac{1}{1+ (\sigma(y, y_{true}) + \gamma)}\)
\(\gamma = \ \ 1_{[ytrue = ypred \ \ \& \ \ ytrue = 0]} \ * \ \lambda_1 (2y - 1)^2 + \ \ 1_{[ytrue = ypred \ \ \& \ \ ytrue = 1]} * \ \lambda_2 (2y - 1)^2\)
\lambda_1 (2y - 1)^2 term of the expression below is the extreme margin term, and is derived from the squared difference between the true conditional probability prediction score of belonging to the default class and the true conditional probability prediction score of belonging to the non-default class.
\(1_A (x) := \begin{cases} 1 \ if \ x \in A \\ 0 \ if \ x \notin A \end{cases}\)
\(\sigma(y, y_{true}) := \begin{cases} 0 \ \ if \ \ |y - y_{true}| \ \ < \ \ 0.5 \\ \frac{1}{e|y_{true} - y|} - 1 \end{cases}\)
\(y_{pred} := \begin{cases} 1 \ if y \ge 0.50 \\ 0 \ if \ y \le 0.50 \end{cases}\)
On the Ionosphere dataset used for our experiment, even though the binary cross-entropy loss function achieved a higher mean cross-validation accuracy compared to the Xtreme Margin loss function, its conditional accuracy cannot be manually controlled, as it is internally chosen during the training process on the loss function. In some situations, it suffers from a low conditional accuracy for one or both classes.
The tunable component of Xtreme Margin enables practitioners to choose whether they want to maximize precision or recall. Since there is a tradeoff between precision and recall (as the precision increases, the recall decreases and vice versa), one has to place more emphasis on a particular metric depending on the use case.
Written on July 16th, 2023 by Karthik