Parametric Vs Non parametric model selection for Regression and classification based on Statistical test.

PARAMETRIC MODEL

In a parametric model, the number of parameters is fixed with respect to the sample size. It must satisfy all the assumptions.

A learning model that memorizes the data with the help of parameters given (i.e. independent features )is called linear/parametric model.

Benefits of Parametric Machine Learning Algorithms:

  • Explainability is good(Easy to interpret to client or stakeholders).
  • Suits for simple data.
  • Parametric models are very fast to learn from data.

A linear regression line has an equation of the form Y = a + bX, where X is the explanatory variable and Y is the dependent variable. The slope of the line is b, and a is the intercept (the value of y when x = 0).Suppose in linear regression if you have 10 independent variables then the no of parameters is 11 (10 slopes and one intercept).

NON PARAMETRIC MODEL

Algorithms that do not make strong assumptions about the form of the mapping function are called nonparametric machine learning algorithms. By not making assumptions, they are free to learn any functional form from the training data.

In a nonparametric model, the (effective) number of parameters can grow with the sample size.

Benefits of Nonparametric Machine Learning Algorithms:

  • Suits for Complex data.
  • No assumptions (or weak assumptions) about the underlying function.
  • Can result in higher performance models for prediction.

Hypothesis Test Generic Conditions:

Linearly separable for regression can be found by doing statistics test.

BASIC CONDITION FOR ALL STATISTICAL TEST:

Hypothesis test:

H0(Null hypothesis): Mean1=Mean2=Mean3 (No Relation)

Ha(Alternate Hypothesis): Mean1 != Mean2 != Mean3 ( Relation)

TSTAT : Test-Statistic tells how far the mean is varying between the groups ( Standard deviation). It talks about the distance between the means whether they are overlapping or not. T-stat is significant if it is greater than 1.96 for 95% confidence level.

PVALUE: P value tells the how much area they are overlapping. If the p-value is less than or equal to 0.05 then Alternate hypothesis holds good ( significant with target it has relation ) ,if p-value greater than 0.05 then null hypothesis holds good no relation with target.

REGRESSION:

Based on the data that you are working you have to select the model whether we can go by parametric models or by non parametric models.

For the target variable continuous (Regression) check whether the conditions are satisfying before building the model

  1. Dependent variable must be numeric
  2. Independent variables does not show multicollinearity
  3. Linear Relation between dependent and independent variables
  4. Absence of Autocorrelation
  5. Error terms must be homoscedastic
  6. Error terms must follow normal distribution.

OLS summary if any of the condition not satisfied (i.e. suppose age is having multicollinearity with target showing pvalue>0.05 we can drop the variable and make the data that satisfy the conditions for building the Linear regression) or if you are not comfortable with drop the age better go for non parametric model i.e. Decision tree regressor,KNN regressor because it has the ability to learn .

suppose if you keep the age column which is having multicollinearity and build the Linear Regression model then it will lead to overfitting.

Consider BMI as target and Diabetic case Less,medium,severe to understand the linear and non linear dependency.

Linear dependency, slope increases, pvalue<0.05 in this case. As BMI increases diabetic severity increase
Even though means are far apart slope is inconsistent linear regression finds difficult to interpret.Non linear dependency pvalue>0.05.

Parametric model provides the better Explainability about the data, Non Parametric models provides the better accuracy.

CLASSIFICATION

For the target variable Category (Classification) ,Check the statistical test for parametric models like Logistic regression and Naive Bayes.

If the t-static >1.96 and p-value <0.05 satisfies the condition ,i.e. Means are apart, so it is Linearly separable we can built the Parametric model .

Consider Diabetic case high, medium, severe as target and BMI as independent variable to understand the linear and non linear dependency.

Linear dependency maintains linear relationship as BMI increases diabetics risk increases. Slope is consistent.
It is non maintaining linear relationship slope is inconsistent, Eventhough they are statistically significant.the linear model finds struggle to model the relationship.

If means are so closer , not linearly separable, overlapping .Built a Non parametric models for classification because it has the ability to learn the overlapping.

Finally it is in the hand of stakeholders if they want to have better Explainability from the data ,go for satisfying the all the assumptions and built a linear model if they focus on accuracy built Non linear model. For non linear model there is no assumptions checking, we can directly build the model.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Vignesh S

Vignesh S

Data scientist Aspirant passionate in learning new technologies and sharing my thoughts to others .