Linear regression may be both the simplest and most popular among the standard tools to regression.
Dating back to the dawn of the 19th century, linear regression flows from a few simple
assumptions. First, we assume that the relationship between the independent variables x and the
dependent variable y is linear, i.e., that y can be expressed as a weighted sum of the elements
in x , given some noise on the observations. Second, we assume that any noise is well-behaved
(following a Gaussian distribution).
To motivate the approach, let us start with a running example. Suppose that we wish to estimate
the prices of houses (in dollars) based on their area (in square feet) and age (in years). To actually
develop a model for predicting house prices, we would need to get our hands on a dataset consisting
of sales for which we know the sale price, area, and age for each home. In the terminology of
machine learning, the dataset is called a training dataset or training set, and each row (here the data
corresponding to one sale) is called an example (or data point, data instance, sample). The thing we
are trying to predict (price) is called a label (or target). The independent variables (age and area)
upon which the predictions are based are called features (or covariates).
Typically, we will use n to denote the number of examples in our dataset. We index the data examples
by i, denoting each input as
and the corresponding label as
The linearity assumption just says that the target (price) can be expressed as a weighted sum of
the features (area and age):
are called weights, and b is called a bias (also called an offset or intercept).
The weights determine the influence of each feature on our prediction and the bias just says what
value the predicted price should take when all of the features take value 0. Even if we will never
see any homes with zero area, or that are precisely zero years old, we still need the bias or else we
will limit the expressivity of our model. Strictly speaking, is an affine transformation of input
features, which is characterized by a linear transformation of features via weighted sum, combined
with a translation via the added bias.
Given a dataset, our goal is to choose the weights w and the bias b such that on average, the predictions made according to our model best fit the true prices observed in the data. Models whose
output prediction is determined by the affine transformation of input features are linear models,
where the affine transformation is specified by the chosen weights and bias.
In disciplines where it is common to focus on datasets with just a few features, explicitly expressing
models long-form like this is common. In machine learning, we usually work with high dimensional
datasets, so it is more convenient to employ linear algebra notation. When our inputs
consist of d features, we express our prediction y^ (in general the “hat” symbol denotes estimates)
Collecting all features into a vector
and all weights into a vector
we can express our model compactly using a dot product:
the vector x corresponds to features of a single data example. We will often find it
convenient to refer to features of our entire dataset of n examples via the design matrix
Here, X contains one row for every example and one column for every feature.
For a collection of features X, the predictions
can be expressed via the matrix-vector product:
where broadcasting is applied during the summation. Given features of a training
dataset X and corresponding (known) labels y, the goal of linear regression is to find the weight
vector w and the bias term b that given features of a new data example sampled from the same
distribution as X, the new exampleʼs label will (in expectation) be predicted with the lowest error.
Even if we believe that the best model for predicting y given x is linear, we would not expect to
find a real-world dataset of n examples where
for all 1 < i < n . For example, whatever instruments we use to observe the features X and labels y might suffer small amount of measurement error. Thus, even when we are confident that the underlying relationship is linear, we will incorporate a noise term to account for such errors.
In my upcoming post we can go about searching for the best parameters (or model parameters) w and b, we will need two more things: (i) a quality measure for some given model; and (ii) a procedure for updating the model to improve its quality.
Reference: Taken from Dive into Deep Learning Release 0.16.2 (Aston Zhang, Zachary C. Lipton, Mu Li, and Alexander J. Smola Mar 20,) for only knowledge sharing purpose.