jhm - Tao-Szemeredi, Significant Bits, and Under-Over Fitting

We want to introduce an idea illustrating Tao-Szemeredi Lemma. The idea is that the under/overfitting inflection point in supervised machine learning models can be understood as the dimension on which the most significant bits relating the input and output data. First we review the basics of neural nets.

Back Propagation Gradient Step

Our reference for Machine Learning is (Chollet 2021). It’s well known that “the fundamental issue in machine learning is the tension between optimization and generalization.” Here optimization refers to adjusting models to get the best performance possible on the training data (so called “learning” aspect of ML), and generalization refers to the performance of the trained model on new data inputs. The goal of ML is to obtain good generalized performance while training the model on limited training data.

Neural networks (NN) are models consisting of various layers. Each layer is a transformation (nonlinear) of the input data. NN’s are compositions of various layers of the form \(x\mapsto relu(Wx+b)\) and \(x\mapsto K.x\). The parameters \(W, b, K\) are the weights of the layer. Training a NN consists in adjusting the weights such that the model is nearly optimal on training data. The training loop is iterated as many times as necessary. To paraphrase from (Chollet 2021, 2.4), the training loop consists of:

Draw a batch of training samples \(x\) and corresponding targets \(y\).
Run the model on \(x\) ( a step called forward pass) to obtain predictions \(y_{pred}\).
Compute the loss of the model on the batch, i.e. measure the mismatch between \(y\) and \(y_{pred}\).
Update all weights in the model which slightly reduces the loss on this batch.
Repeat as necessary.

Step 4. is the so-called backpropagation step. Using the differentiability of the loss function, we can abstractly realize Step 4. using backwards gradient flow. I.e. we differentiate the loss function with respect to the weights, and take a small negative gradient step to decrease the loss. In the literature this is called “stochastic gradient descent” since the batch drawn in Step 1. is basically random, and therefore the backwards gradient step is somewhat randomly chosen. In practice loss function is a composition of many relu layers and tensor operations, and therefore the gradient of the loss is computed symbolically via chain rule.

The backwards gradient step is somewhat undefined. Indeed the step size of the gradient flow is arbitrary, and likewise the number of gradient step iterations is underdetermined. A priori we do not know which step sizes to use, and how many iterations until the model becomes overfit.

Under Over Fitting, Entropy and the Inflection Point.

When the model is beginning to be trained, there is typically some apparent correlation between optimization and generalization. As the model begins to perform better on training data, the generalized performance on new data will also be improving. At this stage the model is considered “underfit”. There are still correlations in the training data which the model is learning, and which correlations are relevant to the generalized data. But after a certain amount of training (i.e. backpropagation gradient steps), the generalized performance stops improving and begins to decline and degrade. The model is “overfit” at this stage of training, and is understood to be learning patterns which are hyper specific to the training data and irrelevant to the new data.

The broader question which we are studying is seeing entropy as dimension. Given the problem of predicting correlations, we want to identify the dimension of the correlation support. Influential in our thinking is Tao-Szemeredi Regularity Lemma (Tao 2005). The basic problem of under/overfitting is that the gradient step back propagation disregards the dimension of the correlation, and the usual optimization step will approach and then “blow past” the dimension of the correlation. Our proposal is that the “inflection point” where underfitting becomes overfitting can be characterized as the dimension of the correlation.

Our critique of backpropagation gradient step is that the variational principle is not strictly correct. Is the purpose of training to optimize the loss function on training data? Or is the purpose of training to identify the significant coarse bits which support the majority of the correlation? We argue that generalization requires the significant coarse bits to be identified, especially since we are interested in new inputs beyond the training batch. Indeed then the training should emphasize “performance loss” with respect to weights defined on the coarse parameters.

This raises the interesting question of deciding the dimension of various correlations. For example, probabilists can always argue via Borel Isomorphism Theorem (Hatko 2013) that the naive topological dimension of a probability measure is not invariantly defined, with every measure space being measure isomorphic to a subset of the unit interval. But the entropy of the measure space is invariantly defined, and related essentially to the minimal dimension by which the states of the measure are distinguished.

Transferability Problem.

Spotify wants to predict user preferences. But users come and go, they drop in and drop out, and the user distribution can sharply change over time. But the GNN which are developed would like to be transferred onto new users. Is it necessary to retrain the GNN? This is unreasonable. But can we predict a priori the performance of these pretrained GNNs on new user distributions? What if there is a sharper change in preferences, at what point is it necessary to retrain the GNN?

In our approach, the transferrability problem and the underfitting/overfitting problem are examined from Tao-Szemeredi. The point of Szemeredi, in the Spotify setting, is that you don’t need an algorithm trained on the total fine user bits. There is a maximal number of coarse bits which correlate, and everything is independant of all further bits. For example, highly detailed fine information about the users is not relevant to Spotify preferences. There are only a few coarse bits which are relevant (age, gender, basics) to predicting \(E\) events, and such that \(E\) is independant of any further bits.

Therefore we train models on given distributions, but look to also train and decide the maximally correlated coarse variables. The transfer to new distributions will then immediately know which coarse variables are significantly correlated.

The inflection point for underfitting/overfitting are the max entropy coarse bits. When the backpropagation (gradient step) is applied beyond the coarse bits, then the model is being forced to gradient step on bits which are independant of the events. This is where performance on generalized input will diminish. The model is training parameters based on irrelevant bits!

What happens to GNNs when they are trained on inputs which are independant of their outputs? Alot of energy is wasted and one obtains a uniform measure (i.e. maximal entropy).

Examples of subbit and coarse bits are provided by images of deterministic maps/reductions \(X \mapsto Z_1\) and \(Y_1 \mapsto Z_2\). This is what happens when a customer is asked to provide some “personal information”.

What is the answer to the under/over problem? Can we predict whether we are before or after the inflection? Is it worth retraining, can we have any expectation of performance improvement? This would depend on the current dimension of the correlation.

[-to be contined, JHM]

References

Chollet, Francois. 2021. Deep Learning with Python. Simon; Schuster.

Hatko, Stan. 2013. “Borel Isomorphic Dimensionality Reduction of Data and Supervised Learning.” arXiv Preprint arXiv:1307.8333.

Tao, Terence. 2005. “Szemeredi’s Regularity Lemma Revisited.” arXiv Preprint Math/0504472.