# Modelling with more variables than data points

It’s certainly possible to fit good models when there are more variables than data points, but this must be done with care.

When there are more variables than data points, the problem may not have a unique solution unless it’s further constrained. That is, there may be multiple (perhaps infinitely many) solutions that fit the data equally well. Such a problem is called ‘ill-posed’ or ‘underdetermined’. For example, when there are more variables than data points, standard least squares regression has infinitely many solutions that achieve zero error on the training data.

Such a model would certainly overfit because it’s ‘too flexible’ for the amount of training data. As model flexibility increases (e.g. more variables in a regression model) and the amount of training data shrinks, it becomes increasingly likely that the model will be able to achieve a low error by fitting random fluctuations in the training data that don’t represent the true, underlying distribution. Performance will therefore be poor when the model is run on future data drawn from the same distribution.

The problems of ill-posedness and overfitting can both be addressed by imposing constraints. This can take the form of explicit constraints on the parameters, a penalty/regularization term, or a Bayesian prior. Training then becomes a tradeoff between fitting the data well and satisfying the constraints. We mentioned two examples of this strategy for regression problems: 1) LASSO constrains or penalizes the 1ℓ1 norm of the weights, which is equivalent to imposing a Laplacian prior. 2) Ridge regression constrains or penalizes the 2ℓ2 norm of the weights, which is equivalent to imposing a Gaussian prior.

Constraints can yield a unique solution, which is desirable when we want to interpret the model to learn something about the process that generated the data. They can also yield better predictive performance by limiting the model’s flexibility, thereby reducing the tendency to overfit.

However, simply imposing constraints or guaranteeing that a unique solution exists doesn’t imply that the resulting solution will be good. Constraints will only produce good solutions when they’re actually suited to the problem.

A couple miscellaneous points:

• The existence of multiple solutions isn’t necessarily problematic. For example, neural nets can have many possible solutions that are distinct from each other but near equally good.
• The existence of more variables than data points, the existence of multiple solutions, and overfitting often coincide. But, these are distinct concepts; each can occur without the others.