Regression analysis is a statistical technique for studying the linear relationships.

Regression analysis is done for one of two purposes.

  1. To predict which independent variables have an impact on dependent variable.
  2. To estimate the effect of some explanatory variable on the dependent variable.


Linear regression is a machine learning algorithm that is used to predict the output (dependent features) continuous values based on the input( independent) features using the linear function y=b1x+b0 where b1is the slope, b0is the intercept x is the independent variables and y is the dependent variables.

Simple Linear Regression: y=b0+b1x ( one independent variable)

Multiple Linear Regression : y=b0+b1*x1+b2*x2+…….+bn*xn ( more independent variables) n- number of independent features. x1,x2,x3 -Parameters.

Variance (var(x)):

variance is the average of the squared differences from the mean. It tells how one independent(x) spread of variable with respect to mean.


Covariance (Cov(x1,x2)) and correlation(R):

Covariance measures how change in one variable is associated with other variable.

It gives the together spread of data.(x1 and x2)

For example: if we want to compare weight of the person is related to age or height we can’t able to compare because it is in units kg/yrs and kg/cm so we need to break the unit so comes concept called correlation it scales the unit and break the units and gives linear association between 2 numeric variables. Now we get weight-age is 0.7 and weight-height is 0.88 so we can say weight is highly dependent on height variable.


STATSMODELS (ORDINARY LEAST SQUARE) - Using hypothesis test gives output lots of understanding on how parameters related to target.

SCIKIT LEARN MODEL (SKLEARN) -Gives model accuracy.


Simple Linear Regression — one independent and one dependent variable.

Linear Regression line:

y=b0+b1*x+ e

y- set of values taken by dependent variable/Target variable/Response variable

x- set of values taken by independent variable/predictor variable

e -Random error component

Error term also called as Residual, represents the distance of the observed value from the value predicted by the regression line.

Error term=Actual value — predicted value

The linear regression line which explains the trend in the data is the best fit line. Ordinary least square method is used to find the best fit line in the data. Ordinary least squares (OLS) is a non-iterative method that fits a model such that the sum-of-squares of differences of observed and predicted values is minimized.

This method aims at minimizing the sum of squares of the error terms, that is it determines those values of b0 and b1at which error terms are minimum.


b1(Slope): It gives the amount of change in response variable per unit change in response variable.

b0(intercept):It is the y intercept which means when x=0 ,y is b0.

slope (b1)
Intercept (b0)

SST: Sum of the squared difference between the observation and its mean.

SSR: Sum of the squared difference between the predicted value and mean of the response value.(EXPLAINED VARIATION)

SSE: Sum of the squared difference between observed response variable and its predicted value.(UNEXPLAINED VARIATION)

Rsquared: Coefficient of Determination.


Dependent Variable -Numeric

LINEARITY: Linear relationship between dependent and independent variables

MULTICOLLINEARITY: Between independent variables no high correlation.

AUTOCORRELATION: Independence of observation should exists

HOMOSCEDASTIC: Error terms should be homoscedastic.

NORMALITY: Error terms should follow normal distribution.

After satisfying all the above conditions build an OLS model.

Lets understand this better looking at this example, I have taken a simple dataset — Advertising data:

import statsmodels.api as sm

Pvalue is less than and equal to 0.05 then the model is significant with target otherwise not. ( Newspaper pvalue-0.8 not significant with sales so drop the column)

Average increase in sales due to TV(0.0458)

Sales will happen when no adverting is involved is 2.939(b0) at (x1,x2,x3=0)


Cost function tells us how good the model performs in making predictions for a given set of parameters (m and c)

Loss is the error that occurs between predicting value and actual value.


STEP 2: Differentiate with respect to b0 and b1 and equate to zero.

STEP 3: Solving this equation we get the best values of b0 and b1

STEP 4: Best fit line using OLS.


Rsquared is coefficient of determination (R2), gives the percentage of variation in the dependent variable explained by the independent variables.

Taking square root of Rsquared we get correlation(R).

Rsquared range between o to 1. near 1 is a good model.

Adjusted Rsquared gives the percentage of variation explained by independent variables that actually affects the dependent variable.

If a new variable added Rsquared will not see the variable is significant or not with target the value of Rsquared increases so its called statistical fluke. Rsquared is not the proper metric to judge on model rely on adjusted Rsquared also it tells that if a new variable is added if it is not significant the value of adjusted Rsquared decreases then realise the variable has no relation with target.

It is used to check the significance of the regression model. It is similar to Anova test.


Cost function tells us how good the model performs in making predictions for a given set of parameters (m and c)

Loss is the error that occurs between predicting value and actual value. our objective is to minimize the error by optimizing value of m and c. We will be using Mean squared error to find the loss.

Three steps to find the Mean square function.

  1. Find the difference between actual value and the predicted value.

2. Square the distance in order to avoid the negative values.

3.Find the mean of squares for every given value in x.


Gradient Descent is used to obtain model parameters (slope-m and intercept- c).Gradient means Slope ,Descent means Moving downwards.

Gradient descent is an iterative optimizing algorithm which finds the parameters( m and c )such that error term( loss function) is minimum.

Step size is the learning rate. Gradient Descent has a hyper parameter called learning rate . If the learning rate is high it will not reach the goal it will keep oscillating. If value is too small it will take lot of iterations to get the goal. Thus it is important to choose an appropriate learning rate.

STEP 1:Assume initially random values of m and c, Here m(slope)=0 and intercept(c)=0,Learning rate =0.0001( small) in order to obtain the good accuracy.

STEP2: Partial derivative of loss function with respect to m .

Partial derivative of loss function with respect to c.

STEP 3: Update the m and c using the following equations

Each update of m and c we can improve the accuracy and reach the goal.

STEP 4: Repeat step 2 and step 3 until we get loss function ideally zero.

I hope you liked this article on linear regression using OLS model ,that you should know as a Data Scientist. Feel free to ask your valuable questions in the comments section below.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Vignesh S

Vignesh S

Data scientist Aspirant passionate in learning new technologies and sharing my thoughts to others .