Linear regression may be both the simplest and most popular among the standard tools to regression.

Dating back to the dawn of the 19th century, linear regression flows from a few simple

assumptions. First, we assume that the relationship between the independent variables **x **and the

dependent variable **y **is linear, i.e., that **y** can be expressed as a weighted sum of the elements

in **x** , given some noise on the observations. Second, we assume that any noise is well-behaved

(following a Gaussian distribution).

To motivate the approach, let us start with a running example. Suppose that we wish to estimate

the** prices of houses** (in dollars) based on their **area** (in square feet) and **age** (in years). To actually

develop a model for predicting **house prices**, we would need to get our hands on a dataset consisting

of sales for which we know the **sale price**, **area**, and **age **for each home. In the terminology of

machine learning, the dataset is called a **training dataset** or **training set**, and each row (here the data

corresponding to one sale) is called an **example **(or data point, data instance, sample). The thing we

are trying to predict (price) is called a **label** (or target). The independent variables (age and area)

upon which the predictions are based are called **features **(or covariates).

Typically, we will use **n** to denote the number of examples in our dataset. We index the data examples

by **i**, denoting each **input **as

and the corresponding **label **as

The linearity assumption just says that the **target** (**price**) can be expressed as a weighted sum of

the **features **(**area **and **age**):

are called **weights**, and **b** is called a **bias **(also called an offset or intercept).

The weights determine the influence of each feature on our prediction and the bias just says what

value the predicted price should take when all of the features take value 0. Even if we will never

see any homes with zero area, or that are precisely zero years old, we still need the bias or else we

will limit the expressivity of our model. Strictly speaking, is an** affine transformation** of input

features, which is characterized by a linear transformation of features via weighted sum, combined

with a translation via the added bias.

Given a dataset, our goal is to choose the weights **w **and the bias **b **such that on average, the predictions made according to our model best fit the true prices observed in the data. Models whose

output prediction is determined by the affine transformation of input features are linear models,

where the affine transformation is specified by the chosen weights and bias.

In disciplines where it is common to focus on datasets with just a few features, explicitly expressing

models long-form like this is common. In machine learning, we usually work with high dimensional

datasets, so it is more convenient to employ linear algebra notation. When our inputs

consist of * d *features, we express our prediction

**y^**(in general the “

**hat**” symbol denotes estimates)

as

Collecting all features into a vector

and all weights into a vector

we can express our model compactly using a dot product:

the vector **x **corresponds to features of a single data example. We will often find it

convenient to refer to features of our entire dataset of **n **examples via the design matrix

Here, **X **contains one row for every example and one column for every feature.

For a collection of features **X**, the predictions

can be expressed via the matrix-vector product:

where broadcasting is applied during the summation. Given features of a training

dataset **X** and corresponding (known) labels **y**, the goal of linear regression is to find the weight

vector **w** and the bias term **b** that given features of a new data example sampled from the same

distribution as **X**, the new exampleʼs label will (in expectation) be predicted with the lowest error.

Even if we believe that the best model for predicting **y** given **x** is linear, we would not expect to

find a real-world dataset of **n** examples where

exactly equals

for all **1 < i < n** . For example, whatever instruments we use to observe the features **X **and labels **y **might suffer small amount of measurement error. Thus, even when we are confident that the underlying relationship is linear, we will incorporate a noise term to account for such errors.

In my upcoming post we can go about searching for the best parameters (or model parameters) **w **and **b**, we will need two more things: (i) a quality measure for some given model; and (ii) a procedure for updating the model to improve its quality.

**Reference: **Taken from Dive into Deep Learning Release 0.16.2 (Aston Zhang, Zachary C. Lipton, Mu Li, and Alexander J. Smola Mar 20,) for only knowledge sharing purpose.