How do you make statistical inferences from data?


This post was originally published by Rakesh Chintha at Towards Data Science

A quick google search above revealed that the average age of a person in the US is 38. Have you ever wondered how statisticians in Census Bureau came up with that number? Do you think they would go up and ask everyone in person or by mail? Not because that would be a mere waste of time, money, and resources just to find some statistic and put up on their website all bold and fancy.

So how do they do it? They use some basic principles of inferential statistics.

Alright, so in this article, we will be finding an answer to the following question using statistical inferences.

Are women paid less than men ?

Let us scratch some surface of inferential statistics before diving into the case study.

Population: The set that contains all data points in our experimenting space. Population size is denoted by N.

Sample: It is a randomly selected subset from the population — the sample size is denoted by n.

Distribution: It describes the data/population/sample range and how data is spread in that range.

Mean: Average value of all data from your population or sample. This is denoted by µ for populations and for samples.

Standard Deviation is a measure of how to spread your population is — denoted by σ (Sigma).

Normal Distribution: When your population is spread perfectly symmetrical with σ standard deviations around the mean value, you get the following bell-shaped curve.

Central Limit Theorem

From Wikipedia:

In probability theory, the central limit theorem (CLT) establishes that, in some situations, when independent random variables are added, their properly normalized sum tends toward a normal distribution (informally a bell curve) even if the original variables themselves are not normally distributed.

Below video has a very intuitive explanation for Central Limit Theorem

In other words, all that this theorem states that no matter what the shape of the initial population is, the sampling distribution will always approximate to a normal distribution.

The standard error is the measure of how much the sample mean deviates from the population mean.

Standard Error Formula where σ is the standard deviation and n is the sample size.

sample size (n) is the size of the sampled population. The below plot shows the relationship between sample size and standard error. As sample size increases, standard error decreases.

While selecting a large sample size is no problem, however, this is not feasible in most real-world complex problems. Hence an optimal sample size is needed.

Confidence intervals represent the range of values between which we are fairly sure that our population means lies. In the below image, both the lower limit and upper limit represents the confidence interval. The area between confidence intervals is called the acceptance region while the area outside is called the rejection region.

the p-value is the probability that the test result happened by chance. In other words, it is the probability that our population means falls in the rejection region. The lower p-value indicates higher confidence in the test result.

significance level (α) is the threshold p-value set to decide if the test results are statistically significant. The significance level is usually set to 0.05, 0.01, or 0.001. If the test result’s p-value is less than the significance level (α), then we can conclude that the obtained test results are statistically significant and they are not due to a random chance or noise.

For our analysis, we will use data collected from the General Social Survey (GSS) who are conducting annual surveys since 1972 from the general American public mainly through face-to-face interviews. Below is the description from their website.

The GSS aims to gather data on contemporary American society in order to monitor and explain trends and constants in attitudes, behaviors, and attributes; to examine the structure and functioning of society in general as well as the role played by relevant subgroups; to compare the United States to other societies in order to place American society in comparative perspective and develop cross-national models of human society; and to make high-quality data easily accessible to scholars, students, policy makers, and others, with minimal cost and waiting.

The GSS sample is drawn using an area probability design that randomly selects respondents in households across the nation from a mix of urban, suburban, and rural geographic areas. Because random sampling was used, the data is representative of the US population as a whole.

Alright, so now that we have our data ready, let us dive into our case studies and find answers.

Spread the word

This post was originally published by Rakesh Chintha at Towards Data Science

Related posts