For centuries, statisticians have been analyzing data. In 1970, John Tukey brought this idea about performing statistical testing to derive hypotheses. In 1996, the International Federation of Classification Societies considered the use of Data Science, its classification, and related methods.
Statistics plays a very important role in helping with understanding data complexities.
It provides us with the tools to understand the data structures and find existing patterns within them.
Data scientists have to churn loads of data to make predictions, understanding the data with the help of statistical methods is crucial.
Statistics Perceives Data In 2 Types Of Variables
- Discrete Variables: These variables are values that can consider only a finite number of values. It is countable over a finite period.
- Continuous Variables: These values are numeric and contain an infinite number of values between 2 values. It is not countable over a finite time.
Next, we will learn about distribution and its types. Later, we will understand the features of distribution with the help of statistical measures.
Distribution describes all the possible values and their occurrence probabilities. It describes the likelihood of the occurrence of every event.
Types Of Distributions
- Bernoulli Distribution: It is a distribution where only two possible outcomes are listed, and there is a single trial. In this distribution, random variables can take value 1 with a probability of success and zero for a probability of failure.
- Uniform Distribution: When a set of events has the probability of getting a set of outcomes equally, it is known as uniform distribution.
- Binomial Distribution: In this type of distribution, only two outcomes are possible, and the probability of outcomes is the same for all the trials.
- Normal Distribution: It represents the distribution of values which have the following characteristics
- Mean, median, and mode of data are same
- The curve of the distribution is bell-shaped. Half values are on the left, and the other half is on the right side.
- Poisson Distribution: It is a distribution where the following assumptions are acceptable
- A successful event does not impact another successful event
- Probability of success does not variate based on the time interval
- The probability of a successful event becomes zero when the interval becomes smaller.
- Exponential Distribution: It is a distribution used to represent the time interval between events.
How To Find The Distribution Of Your Data?
- Plot the data using a histogram to see the curve of data spread.
- Hypothesis tests/Distribution tests to check whether the sample data follows hypothesized distribution. Small p-values means reject the null hypothesis and conclude that data does not follow the specified distribution.
- Probability plots also help in understanding distribution. Also known as the ‘fat pencil’ test. When data follows a straight line, your data follows the distribution.
How Statistics Lay The Ground For Understanding The Data?
1. Descriptive Statistics
It is used to represent the data in a summarized fashion.
Measures Of Central Tendency: These measures provide us a way to understand the data through one value. It also helps us understand the dispersion of data.
- Mean: It is the average of a set of values.
- Median: It is the middle value of an ordered set of values.
- Mode: It is the most common value in a set of values. Modality is defined based on how many numbers are appearing a maximum number of times. Unimodal means one number, bimodal means two numbers, and multimodal means multiple numbers.
Importance Of Measures Of Central Tendency
- It helps you understand the spread of data around a central value. When mean and median values are in proximity, the distribution is normal, and when they are distant, the distribution is skewed.
- A combination of mean and median can help in determining the outlier existence based on whether the mean and median are around the same range.
- The Median represents the data better in a skewed distribution.
- It helps you fill the missing values in data when the data distribution is normal.
- The mode is the only central tendency measure to help with categorical data because we can’t order the data. It shows us the most common category.
- It helps in determining the appropriate hypothesis test. Mean of parametric and median for non-parametric hypothesis tests.
- Median and Mode are robust from outliers.
Measures of Dispersion: It describes the spread of data around central tendency’s measures (Mean, Median, Mode)
- Variance: It measures the distance between each point in the dataset from the mean. High Variance means the distance between points, and mean is high, and the points are spread widely. Low variance means the distance between points is low, and the points are surrounding the mean.
- Standard Deviation: It is the square root of variance.
- Range: It refers to the difference between maximum and minimum points in the dataset.
- Quartile: It refers to the point being divided into 4 segments.
- Skewness: It defines the amount of asymmetry between the points. Positive Skew is when the mean is greater than the mode. Graphically, the tail on the right side of the curve is bigger than that on the left side. Negative Skew is when the mode is greater than mean. Graphically, the tail on the left side of the curve is bigger than that on the right side. When skewness is 0, the distribution is symmetrical.
- Kurtosis: It describes the comparison between your data and normal distribution. If the data is light-tailed(no outliers), and if the data is heavy-tailed(outliers present). Mesokurtic kurtosis means the kurtosis is zero and it is similar to the normal distribution. Leptokurtic kurtosis means that the kurtosis is higher than the normal distribution, and the tail of the distribution is heavy. Platykurtic means that the kurtosis is lower than the normal distribution, and the tail of the distribution is light.
Importance Of Measures Of Dispersion
- It helps in understanding the relationship between points and measures of central tendency. Hence, determining whether representing data through mean, median, the mode is appropriate or not.The range is useful for smaller sample sizes.
- InterQuartile Range helps in understanding the spread of data in the central region
- Variance is helpful in inferential statistics.
- It helps you understand the distribution of data
- It helps in comparison to 2 samples of data.
2. Inferential Statistics
- It is helpful to make inferences about the population from the sample.
- It also helps in finding how samples are relevant to the population.
- Hypothesis testing
- Doing feature engineering for model
- Comparisons of model performance.
Ways To Perform Inferential Statistics
Z-Score Statistic: It is the probability of occurrence of an event. Technically, it’s measured as the number of standard deviations above or below the population mean. It is calculated as z= x-μ/σ where x is the value for which we want to calculate the z — value. μ and σ are the population mean and standard deviation respectively.
Properties Of Z-score:
- When z-score is 0, it is the same as the mean.
- When z-score is positive, it means that standard deviation is z-score values above the mean.
- When z-score is negative, it means that standard deviation is z-score values below the mean.
Central Limit Theorem: This theorem states that the distribution of sample means looks similar to a normal distribution as the sample size gets larger not impacted by the shape of population distribution.
Properties Of Central limit Theorem:
- The mean of the sample is approximately similar to the mean of the population.
- The standard deviation of a sample known as the standard error is equal to the population standard deviation divided by the square root of sample size. There is an inverse relationship between sample size and standard deviation. Greater sample size achieves greater accuracy in determining sample mean from the population mean.
- The distribution of sample means is normal without any impact of the shape of population distribution. The mean of the sample means is normally distributed even when the population distribution is skewed or other than the normal distribution.
Confidence Intervals: when we wish to calculate the population mean based on the sample mean. The problem at hand is sample statistics may not represent the underlying population well. A confidence interval provides a solution and gives a range of value which likely represents population parameters.
2 types Of Confidence Intervals
Two-Sided Confidence Interval: It includes the subrange of population parameters from above and below the range of the population.
One-Sided Interval: It includes a subrange of population parameters from removing values from above and below the range of the population.
Source: Luis Fok, Oregon State University
- The Margin Of Error: It is very important in confidence intervals. It means sample mean is within the margin of error then the actual value represents the population mean and difference is not important.
We read about the various statistical tools in our arch and saw the importance of these statistical components. We now know that the statistics can help you understand the data you will be working with. We saw how to work with sampling distribution based on population distribution and understand the difference between them. Our team has worked with various types of data and are well-versed with ways to understand data structures and patterns.