The main approach that we are going to take here is to interpret the outputs of our model as probabilities.
We will optimize our parameters to produce probabilities that maximize the likelihood of
the observed data. Then, to generate predictions, we will set a threshold, for example, choosing
the label with the maximum predicted probabilities.
Put formally, we would like any output y^j to be interpreted as the probability that a given item
belongs to class j. Then we can choose the class with the largest output value as our prediction
argmaxj yj . For example, if y^1, y ^2, and y ^3 are 0.1, 0.8, and 0.1, respectively, then we predict category 2
You might be tempted to suggest that we interpret the logits o directly as our outputs of interest.
However, there are some problems with directly interpreting the output of the linear layer as a
On one hand, nothing constrains these numbers to sum to 1. On the other hand,
depending on the inputs, they can take negative values. These violate basic axioms of probability.
To interpret our outputs as probabilities, we must guarantee that (even on new data), they will be
nonnegative and sum up to 1. Moreover, we need a training objective that encourages the model
to estimate faithfully probabilities.
Of all instances when a classifier outputs 0.5, we hope that half of those examples will actually belong to the predicted class. This is a property called calibration .
The softmax function, invented in 1959 by the social scientist R. Duncan Luce in the context of
choice models, does precisely this. To transform our logits such that they become nonnegative and
sum to 1, while requiring that the model remains differentiable, we first exponentiate each logit
(ensuring non-negativity) and then divide by their sum (ensuring that they sum to 1):
Therefore, during prediction we can still pick out the most likely class by
Although softmax is a nonlinear function, the outputs of softmax regression are still determined by an affine transformation of input features; thus, softmax regression is a linear model.
Reference: Dive into Deep Learning Release 0.16.2 (Aston Zhang, Zachary C. Lipton, Mu Li, and Alexander J. Smola Mar 20,)