TRAIN TEST SPLIT VS CROSS VALIDATON

Vignesh S
3 min readAug 14, 2021

--

There are various pipelines in Data science projects with machine learning used cases like Data gathering, Feature Engineering, Feature selection, Model creation, Model deployment.

TRAIN TEST SPLIT

Before model creation we will be doing Train test split, on the entire data. Suppose if you are having 1000 data points, Assume we are splitting it into train 70% test 30 % or 80% train 20 % test based on the data you have.

These 70 % train data and 30% test data are randomly selected.

Random selection is done by Random_state variable we can choose random state=10 or 100,but each time if you fix different random state you will get shuffled data points on train and test data.

Whenever random selection happens our accuracy keeps on fluctuating i.e. for 70% train 30% test the accuracy may end up with 85% and if you change the random state to 80% train 20% test accuracy changes to 87%.

DRAWBACK

so we cant able to explain the stakeholders the exact accuracy.so in order to prevent this we are using cross validation

CODE:

CROSS VALIDATION

There are various types of cross validation

  1. Leave one out cross validation (LOOCV)
  2. K-Fold cross validation
  3. Stratified cross validation
  4. Time series cross validation

In this we will see Leave one out cross validation and KFold cross validation

LEAVE ONE OUT CROSS VALIDATION

In leave one out cross validation ,in the name itself we can see we are just leaving one data point for test and considering remaining all for train data.

For example: If you are considering 1000 data points(rows) and in the first iteration consider one data points as test and remaining 999 data points as train data and keep on doing 1000 iterating changing the test and train data points.

Note: LeaveOneOut() is equivalent to KFold(n_splits=n) and LeavePOut(p=1) where n is the number of samples (i.e. 1000)

DRAWBACK

  1. No of iterations is more, computation time is more.
  2. It will leads to bias error i.e. if a new test data point which is not part of this dataset is given in production phase it will not able to predict accurately since it is not trained on the data point so accuracy will be reduced.

KFOLD CROSS VLIDATION

In order to prevent the train test split you can use KFold cross validation

In KFold cross validation if you set n_splits=5 it will split it into 5 iterations or experiments.

If you have 1000 data points it will split 1000/5=200, considering 200 for test data points and remaining 800 for train data points based on the test it will give accuracy assume 0.85,likewise repeat for remaining 4iterations .so we will end up with 5 different accuracy for various randomly shuffled data points.

From that we can take mean accuracy and explain the stakeholders model accuracy , we can also able to explain what will be the min and max accuracy the model will predict.

The purpose of using Random_state in KFOLD is when you use three models with different features(log_reg-x1,x2,x3,KNN-x1,x3,x4,Decision tree-x2,x3,x5) by fixing same random state we can say that the result changes is due to the change of variables or due to change of values.

DRAWBACK

For classification consider binary classification 0,1 if test data has all 1 and in train data also has most 1 it will lead to imbalance to overcome this we use stratified cross validation.

CODE:

Contact Links:

Linkedin: https://www.linkedin.com/in/22vignesh97/

--

--

Vignesh S

Data scientist Aspirant passionate in learning new technologies and sharing my thoughts to others .