Tunable loss functions for binary classification problems

Paper: Xtreme Margin: A Tunable Loss Function for Binary Classification Problems

This is a paper summary generated from summarizepaper.com. I edited for better understanding.

Introduction

Loss functions are crucial in optimizing machine learning algorithms. The choice of loss function impacts the training process and model learning. Binary classification is widely used in various applications. Traditional loss functions for binary classification include binary cross-entropy and hinge loss

How Xtreme Margin is different

Xtreme Margin offers greater flexibility with tunable hyperparameters λ1 and λ2
Hyperparameters allow users to adjust training based on desired outcomes (precision, AUC score, conditional accuracy)
Xtreme Margin is also non-convex and non-differentiable in certain cases where it does not predict correctly.

Gradient-based optimization methods may not be directly applicable, alternative techniques like subgradient optimization can be used

The subgradient method is a simple algorithm for minimizing a nondifferentiable convex function. The method looks very much like the ordinary gradient method for differentiable functions, but with several notable exceptions. For example, the subgradient method uses step lengths that are fixed ahead of time, instead of an exact or approximate line search as in the gradient method. Unlike the ordinary gradient method, the subgradient method is not a descent method; the function value can (and often does) increase.
  
The subgradient method is far slower than Newton’s method, but is much simpler and can be applied to a far wider variety of problems. By combining the subgradient method with primal or dual decomposition techniques, it is sometimes possible to develop a simple distributed algorithm for a problem.

Xtreme Margin is a promising alternative for binary classification problems.

Formula

Xtreme Margin loss function

\(L(y, t_true; \lambda_1, \lambda_2) = \frac{1}{1+ (\sigma(y, y_{true}) + \gamma)}\)

\(\gamma = \ \ 1_{[ytrue = ypred \ \ \& \ \ ytrue = 0]} \ * \ \lambda_1 (2y - 1)^2 + \ \ 1_{[ytrue = ypred \ \ \& \ \ ytrue = 1]} * \ \lambda_2 (2y - 1)^2\)

\lambda_1 (2y - 1)^2  term of the expression below is the extreme margin term, and is derived from the squared difference between the true conditional probability prediction score of belonging to the default class and the true conditional probability prediction score of belonging to the non-default class. 

\(1_A (x) := \begin{cases} 1 \ if \ x \in A \\ 0 \ if \ x \notin A \end{cases}\)

\(\sigma(y, y_{true}) := \begin{cases} 0 \ \ if \ \ |y - y_{true}| \ \ < \ \ 0.5 \\ \frac{1}{e|y_{true} - y|} - 1 \end{cases}\)

\(y_{pred} := \begin{cases} 1 \ if y \ge 0.50 \\ 0 \ if \ y \le 0.50 \end{cases}\)

Tensorflow implementation

Conclusion

On the Ionosphere dataset used for our experiment, even though the binary cross-entropy loss function achieved a higher mean cross-validation accuracy compared to the Xtreme Margin loss function, its conditional accuracy cannot be manually controlled, as it is internally chosen during the training process on the loss function. In some situations, it suffers from a low conditional accuracy for one or both classes.

The tunable component of Xtreme Margin enables practitioners to choose whether they want to maximize precision or recall. Since there is a tradeoff between precision and recall (as the precision increases, the recall decreases and vice versa), one has to place more emphasis on a particular metric depending on the use case.