Population, Sample and Central Limit Theorem (CLT)
In this blog, this concept is very core and fundamental you need to know as a Data Science/Machine Learning.
So, Let Get Started!
Population and Sample
‘Population’ term here means all people around in the world.
‘Sample’ term here means selecting random people around in the world.
Look Simple! Now, let me ask you a question
Q: Tell me the average height of human around in the world (population)?
I thought that it can be done if I could survey with online and people could enter their own height value. Let me tell you two limitations:
- There is high chance that people might enter the wrong height value.
- Even if it correctly entered their height value, there are 7.7 billions!! total number of people around in the world. So, the size of the file will increase so significantly and most of the system can’t afford to load that large amount of data.
So, We can’t find the average height of human around in the world.
Instead of that, How about we ask randomly 1000 people around in the world (Sample) and estimate it?
Yes, We can do that!!
So, from the above example, we got a sample while we asked randomly 1000 people. So, put Ns=1000
To represent mean, we typically put bar
As sample size increases, it tends to closer of population (THINK INTUITIVELY!)
Central Limit Theorem (CLT)
Before, we understand it definition, let understand its concept. (Because once you understand the concept, then you can able to define this theorem)
Let consider, we have a data (random variable) which follows any distribution.
Then I randomly sampled (Let say no. of sampling=N) with each number of sample size = Ns
Then, on each sampling data, we calculate mean/average.
After that, we plot distribution of that mean/average sample data.
So, from the observation, we found that
So, if you have a sample data (S) and want to know the mean and variance of the population, then
Now, from the understanding this concept and formula above figure, we can say that
CLT theorem state that if we have a population with finite mean and variance and if we do sampling to create multiple sample data with each sample size ’n’, then calculating the mean/average of all the samples, combined all samples mean/average will be a Gaussian Distribution manner which will be equal to the population mean and variance (sigma/n).
In order to illustrate this, we go to some coding which will make you more sense in practical way…:D
Import Necessary Libraries
Create Population data with some random numbers (Number of population data: 100k, Range: 0 to 200)
Calculate Population mean and its variance
Plot the Distribution of Population data
Let take random Sample data from the Population data
Observation I (Number of Samples = 100, Each sample size=30)
Note: Let observe why we divide by sample size
By dividing the sample size give more closeness value between the sample data variance and population variance. So, that means equally approximation of variance (See above Fig:Mean and Variance are in Equally Approximation (NOT ACCURATE)) is satisfied. Please remember , variance statistics part is not in the definition of CLT theorem. I just want to know you that why we divide variance with sample size, That’s it! So, that we can find the kind of relationship of sample variance from population variance.
Observation II (Number of Samples = 100, Each sample size=10,000)
Observation III (Number of Samples = 100, Each sample size=20,000)
From the above all 3 observations, we found that as the sample size increased, the mean of sample data tends to equally approximation to the mean of population (difference of sample mean and population mean).
Now, Why study CLT theorem?
By Deborah J. Rumsey
The normal distribution is used to help measure the accuracy of many statistics, including the sample mean, using an important result called the Central Limit Theorem. This theorem gives you the ability to measure how much the means of various samples will vary, without having to take any other sample means to compare it with. By taking this variability into account, you can use your data to answer questions about a population, such as “What’s the mean household income for the whole U.S.?”; or “This report said 75% of all gift cards go unused; is that really true?” (These two particular analyses are made possible by applications of the Central Limit Theorem called confidence intervals and hypothesis tests, respectively.)
The Central Limit Theorem (CLT for short) basically says that for non-normal data, the distribution of the sample means has an approximate normal distribution, no matter what the distribution of the original data looks like, as long as the sample size is large enough (usually at least 30) and all samples have the same size. And it doesn’t just apply to the sample mean; the CLT is also true for other sample statistics, such as the sample proportion. Because statisticians know so much about the normal distribution, these analyses are much easier.
I hope you enjoy reading it.
Keep Learning and Have Fun!