In this lesson we will show how the coefficients of the slope and intercept are estimated in two different ways, as well as interpret them. This lesson will include some linear algebra, we will do our best to emphasize the key points in new concepts and go through it step-by-step so you may be able to follow along even if you have not seen this type of math before.
In the previous lesson we gave the slope and intercept of the best-fit-line
as slope = 1.8
and intercept = -7.1
. Let's see how
we can estimate these values using a popular data science library called
SciKit Learn.
In the following code block, we will create a
LinearRegression
object and fit it to our
data. The fit
method takes two arguments: the first is the
independent variable, x, and the second is the independent variable,
y. The coef_
attribute of the fitted object
contains the estimated coefficients of the model. Since we have only one variable
x on the right-hand side, we access it's value by indexing the zero-th item.
The intercept of the equation is given by the intercept_
attribute.
Running the code in the above code editor we see we get the same values as
before: slope = 1.8
and intercept = -7.1
, just
with additional decimal places.
In this section we will estimate the coefficients of the best-fit-line by
hand using linear algebra. We will use the same data as before, but this time
we will use the numpy
library to create a matrix of the
independent variable, x, and a vector of the dependent variable,
y. We will then solve
for m and b in the equation:
First for notation, we will switch to using matrix form representation. In this case, we will collect all of the right hand side variables into a matrix, X. We can do this by letting the first column of X be a column of ones, and the second column be the values of x.
The first column will estimate the intercept, and the second
will estimate the slope, these estimated coefficient we will also collect into a vector,
which we will call
The next part gets a bit tricky. We want to solve for
The above two statements follow from our estimations being
the Best Linear Unbiased Estimators (BLUE),
and more formally this is called the Gauss-Markov theorem
that underlies the theory of linear regression. Therefore since we are including
an intercept in the equation, and collected a
vector of ones in the first column of
Using this fact we can and solve for
That was a lot of steps, but we got what we want, a closed form solution for
numpy
library to
solve for numpy.linalg.inv
function which we import as an alias as
inverse
, to take the inverse of a matrix. We also use the
T
attribute of a matrix to take the transpose of a matrix.
Finally, we use the @
operator to multiply matrices together
(we cannot use *
, since this multiplies matrices element-wise instead of
performing matrix multiplication).
Running the code in the above code editor we see we get the same values as
before: slope = 1.8
and intercept = -7.1
, just
with additional decimal places.
To conclude this lesson let's go over how to interpret the coefficients. Starting with the intercept, we can see that the value is -7.1. This means that when sepal length is equal to zero, the predicted value of petal length is -7.1. This is not a very meaningful value, since sepal length cannot be equal to zero. However it is still necessary to include in the equation so that the line we estimate is the best-fit-line.
The slope of the line is 1.8. This means that for every unit increase in sepal length, the predicted value of petal length increases by 1.8. The meaning behind this, from a data scientist's, perspective is that since this coefficient is positive, we can make the inference that there is a positive correlation between sepal length and petal length. In other words, if we find a flower with a larger sepal length, we can predict that it will also have a larger petal length. This topic of prediction is the focus of the next lesson. So see you there!