TRAIN TEST SPLIT VS CROSS VALIDATON

There are various pipelines in Data science projects with machine learning used cases like Data gathering, Feature Engineering, Feature selection, Model creation, Model deployment.

TRAIN TEST SPLIT

These 70 % train data and 30% test data are randomly selected.

Random selection is done by Random_state variable we can choose random state=10 or 100,but each time if you fix different random state you will get shuffled data points on train and test data.

Whenever random selection happens our accuracy keeps on fluctuating i.e. for 70% train 30% test the accuracy may end up with 85% and if you change the random state to 80% train 20% test accuracy changes to 87%.

DRAWBACK

CODE:

CROSS VALIDATION

  1. Leave one out cross validation (LOOCV)
  2. K-Fold cross validation
  3. Stratified cross validation
  4. Time series cross validation

In this we will see Leave one out cross validation and KFold cross validation

LEAVE ONE OUT CROSS VALIDATION

For example: If you are considering 1000 data points(rows) and in the first iteration consider one data points as test and remaining 999 data points as train data and keep on doing 1000 iterating changing the test and train data points.

Note: LeaveOneOut() is equivalent to KFold(n_splits=n) and LeavePOut(p=1) where n is the number of samples (i.e. 1000)

DRAWBACK

  1. No of iterations is more, computation time is more.
  2. It will leads to bias error i.e. if a new test data point which is not part of this dataset is given in production phase it will not able to predict accurately since it is not trained on the data point so accuracy will be reduced.

KFOLD CROSS VLIDATION

In KFold cross validation if you set n_splits=5 it will split it into 5 iterations or experiments.

If you have 1000 data points it will split 1000/5=200, considering 200 for test data points and remaining 800 for train data points based on the test it will give accuracy assume 0.85,likewise repeat for remaining 4iterations .so we will end up with 5 different accuracy for various randomly shuffled data points.

From that we can take mean accuracy and explain the stakeholders model accuracy , we can also able to explain what will be the min and max accuracy the model will predict.

The purpose of using Random_state in KFOLD is when you use three models with different features(log_reg-x1,x2,x3,KNN-x1,x3,x4,Decision tree-x2,x3,x5) by fixing same random state we can say that the result changes is due to the change of variables or due to change of values.

DRAWBACK

CODE:

Contact Links:

Linkedin: https://www.linkedin.com/in/22vignesh97/

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Vignesh S

Data scientist Aspirant passionate in learning new technologies and sharing my thoughts to others .