# REGRESSION ANALYSIS

Regression analysis is a statistical technique for studying the linear relationships.

## Purpose of Regression analysis

Regression analysis is done for one of two purposes.

**To predict which independent variables have an impact on dependent variable.****To estimate the effect of some explanatory variable on the dependent variable.**

# LINEAR REGRESSION

*Linear regression is a machine learning algorithm that is used to predict the output (dependent features) continuous values based on the input( independent) features using the linear function y=b1x+b0 where b1is the slope, b0is the intercept x is the independent variables and y is the dependent variables.*

** Simple Linear Regression**: y=b0+b1x ( one independent variable)

** Multiple Linear Regression **: y=b0+b1*x1+b2*x2+…….+bn*xn ( more independent variables) n- number of independent features. x1,x2,x3 -Parameters.

*Variance (var(x)):*

variance is the average of the squared differences from the mean. It tells how one independent(x) spread of variable with respect to mean.

*Covariance (Cov(x1,x2)) and correlation(R):*

Covariance measures how change in one variable is associated with other variable.

It gives the together spread of data.(x1 and x2)

For example: if we want to compare weight of the person is related to age or height we can’t able to compare because it is in units kg/yrs and kg/cm so we need to break the unit so comes concept called **correlation** it scales the unit and break the units and gives linear association between 2 numeric variables. Now we get weight-age is 0.7 and weight-height is 0.88 so we can say weight is highly dependent on height variable.

## Linear Regression two methods as shown below:

STATSMODELS (** ORDINARY LEAST SQUARE**) - Using hypothesis test gives output lots of understanding on how parameters related to target.

SCIKIT LEARN MODEL (** SKLEARN**) -Gives model accuracy.

# ORDINARY LEAST SQUARE

**Linear Regression line:**

y=b0+b1*x+ e

y- set of values taken by dependent variable/Target variable/Response variable

x- set of values taken by independent variable/predictor variable

e -Random error component

## Random error component:

Error term also called as ** Residual**, represents the distance of the observed value from the value predicted by the regression line.

Error term=Actual value — predicted value

The linear regression line which explains the trend in the data is the *best fit line*. ** Ordinary least square method** is used to find the

**in the data. Ordinary least squares (**

*best fit line***OLS**) is a

*non-iterative*method that fits a model such that the sum-of-squares of differences of observed and predicted values is minimized.

## OLS OBJECTIVE

This method aims at *minimizing the sum of squares of the error terms, that is it determines those values of b0 and b1at which error terms are minimum.*

## INTERPRETATION OF BETA COEFFICIENTS (b0 and b1):

**b1(Slope)**: It gives the amount of change in response variable per unit change in response variable.

**b0(intercept)**:It is the y intercept which means when x=0 ,y is b0.

## MEASURES OF VARIATION:

**SST**: Sum of the squared difference between the observation and its mean.

**SSR**: Sum of the squared difference between the predicted value and mean of the response value.(EXPLAINED VARIATION)

**SSE**: Sum of the squared difference between observed response variable and its predicted value.(UNEXPLAINED VARIATION)

**Rsquared: **Coefficient of Determination.

**ASSUMPTIONS OF LINEAR REGRESSION**

Dependent Variable -Numeric

LINEARITY: Linear relationship between dependent and independent variables

MULTICOLLINEARITY: Between independent variables no high correlation.

AUTOCORRELATION: Independence of observation should exists

HOMOSCEDASTIC: Error terms should be homoscedastic.

NORMALITY: Error terms should follow normal distribution.

After satisfying all the above conditions build an OLS model.

Lets understand this better looking at this example, I have taken a simple dataset — Advertising data:

## #OLS MODEL CODE

import statsmodels.api as sm

xc=sm.add_constant(x)

ols_model=sm.OLS(y,xc)

ols=ols_model.fit()

ols.summary()

*Pvalue **is less than and equal to 0.05 then the model is significant with target otherwise not. ( Newspaper pvalue-0.8 not significant with sales so drop the column)*

*Explainability*

*Explainability*

*Average increase in sales due to TV(0.0458)*

*Sales will happen when no adverting is involved is 2.939(b0) at (x1,x2,x3=0)*

**MATH BEHIND OLS**

## LOSS FUNCTION / COST FUNCTION/ERROR FUNCTION

*Cost function tells us how good the model performs in making predictions for a given set of parameters (m and c)*

Loss is the error that occurs between predicting value and actual value.

STEP 1: ERROR FUNCTION

STEP 2: Differentiate with respect to b0 and b1 and equate to zero.

STEP 3: Solving this equation we get the best values of b0 and b1

STEP 4: Best fit line using OLS.

**MODEL EVALUATION METRICS**

## Rsquared and Adjusted Rsquared:

Rsquared is coefficient of determination (R2), gives the percentage of variation in the dependent variable explained by the independent variables.

Taking square root of Rsquared we get correlation(R).

Rsquared range between o to 1. near 1 is a good model.

Adjusted Rsquared gives the percentage of variation explained by independent variables that actually affects the dependent variable.

If a new variable added Rsquared will not see the variable is significant or not with target the value of Rsquared increases so its called statistical fluke. Rsquared is not the proper metric to judge on model rely on adjusted Rsquared also it tells that if a new variable is added if it is not significant the value of adjusted Rsquared decreases then realise the variable has no relation with target.

**F-STATISTIC**

It is used to check the significance of the regression model. It is similar to Anova test.

**OPTIMIZATION-GRADIENT DESCENT**

## LOSS FUNCTION / COST FUNCTION/ERROR FUNCTION

*Cost function tells us how good the model performs in making predictions for a given set of parameters (m and c)*

Loss is the error that occurs between predicting value and actual value. our objective is to minimize the error by optimizing value of m and c. We will be using ** Mean squared error **to find the loss.

## MEAN SQUARED FUNCTION

Three steps to find the Mean square function.

- Find the difference between actual value and the predicted value.

2. Square the distance in order to avoid the negative values.

3.Find the mean of squares for every given value in x.

MSE:

## GRADIENT DESCENT ALGORITHM

Gradient Descent is used to obtain model parameters (slope-m and intercept- c).Gradient means Slope ,Descent means Moving downwards.

*Gradient descent is an iterative optimizing algorithm which finds the parameters( m and c )such that error term( loss function) is minimum.*

** Step size is the learning rate**.

**Gradient Descent has a hyper parameter called learning rate . If the learning rate is high it will not reach the goal it will keep oscillating. If value is too small it will take lot of iterations to get the goal. Thus it is important to choose an appropriate learning rate.**

**STEP 1:**Assume initially random values of m and c, Here m(slope)=0 and intercept(c)=0,Learning rate =0.0001( small) in order to obtain the good accuracy.

**STEP2**: Partial derivative of loss function with respect to m .

Partial derivative of loss function with respect to c.

**STEP 3:** Update the m and c using the following equations

Each update of m and c we can improve the accuracy and reach the goal.

**STEP 4:** Repeat step 2 and step 3 until we get loss function ideally zero.

I hope you liked this article on linear regression using OLS model ,that you should know as a Data Scientist. Feel free to ask your valuable questions in the comments section below.