Karthik Yearning Deep Learning

Statistically stopping of neural network training

Paper: Statistically Significant Stopping of Neural Network Training

Github: Code Repository



Much learning of neural network does not take place once the optimal values are found, the condition does not impact the final accuracy of the model.

According to the runtime perspective, this is of great significance when numerous neural networks are trained simultaneously.

This paper introduces a statistical significance test to determine if a neural network has stopped learning.

Additionally, this method can be used as a new learning rate scheduler.



Currently, the optimal place to stop the neural network’s training is when the test data error is minimum. Existing solution such as early stopping looks for


In the context of Auto ML, along with above conditions, two of the most popular conditions are used.

This paper introduces a statistical significance test to determine if a neural network has stopped learning, by only looking at the testing set accuracy curve. The test used in this paper is an extension of Shapiro Wilk test, and named as Augmented Shapiro Wilk Stopping (ASWS).

This method stops in 77% or less steps than all popular conditions. Other methods stop too early at an expense of 2-4% final accuracy even with tuned hyperparameter.



Some Background

Alt Text


Shapiro Wilk Test determines the probability that a sample of data points was drawn from a normal distribution. It is the most powerful normality test.

Single Sample T-test determines the probability that a sample of data points was drawn from a distribution with a mean other than a specified one.

Clipped Exponential Smoothing is a method for smoothing time series data.



Augmented Shapiro Wilk Stopping

While training, accuracy on the test dataset will be increasing, with a high degree of noise from random sampling of the data, and numeric errors amongst other sources. When the variations in the test accuracy curve become purely noise and their mean is zero, then you can be fairly confident that learning has stopped.

Per the central limit theorem, when these variations are random they will also be normally distributed. The Shapiro Wilk test can tell if the recent accuracy values are normally distributed, and the simple sample t-test will tell you if they have zero mean.

The problem with this is the nature of the noise during training. If the noise of an error curve is too extreme, then any changes will become washed out. Furthermore, the noise seen is very dependent on the neural networks loss landscape, meaning that the variations are not Independent and Identically Distributed (I.I.D) random variables on small time scales. These factors make any statistical analysis very challenging.

There are three mitigations to the problem of noise together, good results can still be achieved.



Results

The only stopping methods able to consistently achieve higher test accuracy, when compared to ASWS method are the performance stopping method and the average difference stopping method. The difference in test accuracy is 0.5%.

This paper shows that the ASWS learning rate scheduler can achieve comparable performance to schedulers which are commonly used in fewer iterations. All of these advances are of potentially great use towards amore environmentally sustainable machine learning, faster prototyping, and less far computational expense during large AutoML training endeavours.



comments powered by Disqus