STATISTICS FOR DATA SCIENCE

Statistics is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. Statistics is used for decision making by understanding the data.

Types of Statistics:

  1. Descriptive statistics - Understand the sample
  2. Inferential statistics - Understand the population

In descriptive statistics no prediction will be there understand the data of current samples (Mean, Median, Mode, IQR, Range).

Descriptive statistics will answer the question to client what happened? by means of EDA.

After Completing EDA in prediction with the help of model it will answer the question to client what will happen? is called Inferential statistics

Descriptive Statistics:

Mean: It is defined as sum of all observation divide by total number of observation.

Median: Middle value after sorting the data.

Mode: Values with highest frequency/ most repeated observation

Comparison of Mean , Median ,Mode

Range: Difference between the Max and Minimum value. (MAX — MIN)

Variance: Variance is the arithmetic mean of square of deviation taken from mean.

standard deviation: Standard deviation of the variable is square root of variance.

coefficient of variation or Percent of variation: The coefficient of variation represents the ratio of the standard deviation to the mean.

For example: If consider the team player selection for IPL match.

Let player 1 history of score for 5 matches be(100,50,85,63,45) and player 2 score be (75,83,82,78,75) we can see the player 2 will be selected because he is consistent( std-dev) for all matches and also mean score is good. Even though player 1 scored 100 in one match he will not be selected because he is not consistent one match he overperforms and other he underperforms.

IQR-Inter quartile Range =(q3-q1):

Quartiles are the points in the data set that divides the data set into four equal parts. Q1, Q2 and Q3 are the first, second and third quartile of the data set.

  • 25% of the data points lie below Q1 and 75% lie above it.
  • 50% of the data points lie below Q2 and 50% lie above it. Q2 is nothing but Median.
  • 75% of the data points lie below Q3 and 25% lie above it.

Skewness: Measure of symmetric and asymmetric.

Mean=Median : symmetry ,Mean>Median : Right-skewed , Mean < Median : Left-skewed.

kurtosis: Measure of peakedness.

  • Mesokurtic — This is the case when the kurtosis is zero, similar to the normal distributions.
  • Leptokurtic — This is when the tail of the distribution is heavy (outlier present) and kurtosis is higher than that of the normal distribution.
  • Platykurtic — This is when the tail of the distribution is light( no outlier) and kurtosis is lesser than that of the normal distribution.

Covariance and correlation:

Covariance measures how change in one variable is associated with other variable.

It gives the together spread of data.

For example: if we want to compare weight of the person is related to age or height we can’t able to compare because it is in units kg/yrs and kg/cm so we need to break the unit so comes concept called correlation it scales the unit and break the units and gives linear association between 2 numeric variables. Now we get weight-age is 0.7 and weight-height is 0.88 so we can say weight is highly dependent on height variable.

COVARIANCE -TOGETHER SPREAD OF X AND Y
CORRELATION: COV(x)/SIGMA (x) IS ZSCALED FORMULA SO IT IS DOING SCALING

INFERENTIAL STATISTICS:

Inferential statistics main aim is to take sample data from population do some statistical test and come to conclusion of the specific data.

STATISTICAL TESTS :Top Down Approach

1.Start with Hypothesis

2.Then data satisfies the claim against or favour with the population

Machine Learning: Bottom up Approach

  1. Starts with data
  2. Arriving the insights y=f(x) and finding hypothesis.

Linkedin : https://www.linkedin.com/in/22vignesh97/

Data scientist Aspirant passionate in learning new technologies and sharing my thoughts to others .