Estimating and Interpreting the Coefficients

In this lesson we will show how the coefficients of the slope and intercept are estimated in two different ways, as well as interpret them. This lesson will include some linear algebra, we will do our best to emphasize the key points in new concepts and go through it step-by-step so you may be able to follow along even if you have not seen this type of math before.

Estimating the coefficients - using the SciKit Learn library

In the previous lesson we gave the slope and intercept of the best-fit-line as slope = 1.8 and intercept = -7.1. Let's see how we can estimate these values using a popular data science library called SciKit Learn.

In the following code block, we will create a LinearRegression object and fit it to our data. The fit method takes two arguments: the first is the independent variable, x, and the second is the independent variable, y. The coef_ attribute of the fitted object contains the estimated coefficients of the model. Since we have only one variable x on the right-hand side, we access it's value by indexing the zero-th item. The intercept of the equation is given by the intercept_ attribute.

Running the code in the above code editor we see we get the same values as before: slope = 1.8 and intercept = -7.1, just with additional decimal places.

Estimating the coefficients - by hand using linear algebra

In this section we will estimate the coefficients of the best-fit-line by hand using linear algebra. We will use the same data as before, but this time we will use the numpy library to create a matrix of the independent variable, x, and a vector of the dependent variable, y. We will then solve for m and b in the equation:

y = mx + b + \varepsilon

First for notation, we will switch to using matrix form representation. In this case, we will collect all of the right hand side variables into a matrix, X. We can do this by letting the first column of X be a column of ones, and the second column be the values of x.

X= \begin{pmatrix} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_n \\ \end{pmatrix} \beta= \begin{pmatrix} b \\ m \end{pmatrix}

The first column will estimate the intercept, and the second will estimate the slope, these estimated coefficient we will also collect into a vector, which we will call \beta. Making these substitutions, we can rewrite the equation as:

y = X \beta + \varepsilon

The next part gets a bit tricky. We want to solve for \beta, by finding an equation of \beta = something, but we have a problem, the error term, \varepsilon, will get in the way. We solve this problem by considering that since we want to find the best-fit-line, this means the line will minimize the sum of the square of errors. The second assumption we make with linear regression is that there is no systematic relationship between the residuals and the independent variables. In mathematical representation, these facts leads to the following two statement:

\sum_{i=1}^n \varepsilon_i = 0 \;\;\; \text{and} \;\;\; \sum_{i=1}^n x_i \varepsilon_i = 0

The above two statements follow from our estimations being the Best Linear Unbiased Estimators (BLUE), and more formally this is called the Gauss-Markov theorem that underlies the theory of linear regression. Therefore since we are including an intercept in the equation, and collected a vector of ones in the first column of X, if we multiply the error vector by the transpose of X, which we denote by X^T, this term will equal zero.

X^T \varepsilon = \begin{pmatrix} 1 & 1 & \cdots & 1 \\ x_1 & x_2 & \cdots & x_n \end{pmatrix} \begin{pmatrix} \varepsilon_1 \\ \varepsilon_2 \\ \vdots \\ \varepsilon_n \end{pmatrix} = \begin{pmatrix} \sum_i^n \varepsilon_i \\ \sum_i^n x_i \varepsilon_i \\ \end{pmatrix} = 0

Using this fact we can and solve for \beta with the following steps.

\begin{aligned} y = X \beta + \varepsilon \; \; &(\text{Start with our initial equation}) \\ \\ X^T y = X^T X \beta + X^T \varepsilon \; \; &(\text{pre-multiply both sides by } X^T) \\ X^T y = X^T X \beta \; \; &(X^T \varepsilon = 0, \text{so remove it}) \\ (X^T X)^{-1} X^T y = (X^T X)^{-1} X^T X \beta \; \; &(\text{pre-multiply both sides by } (X^T X)^{-1}) \\ (X^T X)^{-1} X^T y = \beta \; \; &((X^T X)^{-1} X^T X = 1, \text{so remove it} ) \\ \end{aligned}

That was a lot of steps, but we got what we want, a closed form solution for \beta. We can now use the numpy library to solve for \beta using this equation and compare it to the values we got from the SciKit Learn library. In the code below we use the numpy.linalg.inv function which we import as an alias as inverse, to take the inverse of a matrix. We also use the T attribute of a matrix to take the transpose of a matrix. Finally, we use the @ operator to multiply matrices together (we cannot use *, since this multiplies matrices element-wise instead of performing matrix multiplication).

Running the code in the above code editor we see we get the same values as before: slope = 1.8 and intercept = -7.1, just with additional decimal places.

Interpreting the coefficients

To conclude this lesson let's go over how to interpret the coefficients. Starting with the intercept, we can see that the value is -7.1. This means that when sepal length is equal to zero, the predicted value of petal length is -7.1. This is not a very meaningful value, since sepal length cannot be equal to zero. However it is still necessary to include in the equation so that the line we estimate is the best-fit-line.

The slope of the line is 1.8. This means that for every unit increase in sepal length, the predicted value of petal length increases by 1.8. The meaning behind this, from a data scientist's, perspective is that since this coefficient is positive, we can make the inference that there is a positive correlation between sepal length and petal length. In other words, if we find a flower with a larger sepal length, we can predict that it will also have a larger petal length. This topic of prediction is the focus of the next lesson. So see you there!