Stochastic simulation helps you grasp concepts of statistics

towards-data-science

This post was originally published by Tirthajyoti Sarkar at Towards Data Science

Grasping statistics-related concepts can be hard

Do you find grasping the concepts of statistical analysis — law of large numbers, expectation value, confidence interval, p-value — somewhat difficult and troublesome?

You are not alone.

Our human brain and psyche have not evolved to deal with rigorous statistical methods. In fact, a study of why people struggle to solve statistical problems reveals a preference for complicated rather than simpler, more intuitive solutions — which often leads to failure in solving the problem altogether.

As you might know from the famous book of Nobel Laureate Daniel Kahneman, “Thinking, Fast and Slow”, our intuition does not lie in the same system where our rationality resides (see the video below).

Image source: Author-made collage with Pixabay

We are good with a small set of numbers. The short-term working memory of the human brain is around 7–8 items/numbers.

Therefore, whenever a process presents itself with a scale of thousands or millions, we tend to lose our grasp on the ‘inherent nature’ of that process. The laws and patterns, which are only manifested at the limit of large numbers, seem random and meaningless to us.

Statistics deals with large numbers and almost all theories and results in the statistical modeling and analysis are valid at the limit of large numbers only.

Ginormous Numbers Could Create a Mental Black Hole

A trillion. A googol. A centillion. TREE(3). Somewhere between zero and infinity is a host of finite, but…

Data science/Machine learning is rooted in statistics — what to do?

In this era of data science and machine learning, where the knowledge of the core statistical concepts are considered essential for success in those fields, this can be worrisome for data science practitioners and folks who are on their journey to learn the trade.

But do not despair. There is a surprisingly easy way to tackle this. And it is called ‘simulation’. In particular — discrete, stochastic, event-based simulation.

Therefore, whenever a process presents itself with a scale of thousands or millions, we tend to lose our grasp on the ‘inherent nature’ of that process.

The expected value of dice throw

Suppose we are throwing a (fair) dice with 6 possible faces — 1 to 6. This event of the dice face taking up a value from the set {1,2,3,4,5,6} is represented by a random variable. In a formal setting, the so-called ‘expectation value’ (denoted by E[X]) of any random variable X is given by,

where f(x) is the probability distribution function (PDF) or probability mass function (PMF) for X i.e. the mathematical function that describes the distribution of the possible values that X can assume.

For a dice throwing situation, the random variable X is of discrete nature i.e. it can assume discrete values only, so it has a PMF (and not a PDF). And it is a very simple PMF,

This is because the random variable has a ‘uniform probability distribution’ over the sample space {1,2,3,4,5,6} i.e. any dice throw can result in any one of these values, completely randomly, and without any bias towards any particular value. Therefore, the expected value is,

So, as per theory, 3.5 is the expected value of the dice throwing process.

Is it the most probable value? No. Because a dice does not even have a face with 3.5! So, what’s the meaning of this quantity?

Is it some kind of probability? No. Because the value is clearly greater than 1 and probability values are always between 0 and 1.

Does it mean we can expect the face to turn up either 3 or 4 most times (3.5 is the average of 3 and 4)? No. Because the PMF tells us that all the faces are equally likely to turn up.

Fortunately, the answer is provided by a fundamental tenet of statistics — The Law of large numbers — which says that, in the long run, the expected value is simply the average of all the values that the random variable will take.

Notice the phrase “in the long run”. How do we verify this? Can we simulate such a scenario?

Sure we can. Simple Python code can help us simulate the scenario and verify the Law of Large Numbers.

Python to rescue

Define an array with dice faces and a function to simulate a single throw.

Throw around the dice a few times,

As you might have noticed, for every invocation of dice_throw(), I am using the np.random.choice()function to pick a single random item out of the array dice. If you run this code, you will get a completely different sequence on your machine.

We left the statistics behind, we are in a simulation zone

Image source: Pixabay (Free for commercial use)

Take a pause and realize what is happening.

We are not dealing anymore with formal probabilities and definitions. We are simulating a random event — dice throw — just like in real life. This is the lure of simulation. It constructs a replica of real life on your computing hardware 🙂

We could leave all the coding behind and just do that — throw a dice, note down the face, rinse and repeat — for real. But it will take a whole lot of time to verify the Law of Large Numbers following that route.

That’s why we have the computer and the Python programming language, don’t we?

So, we just simulate it for a sufficiently long time, keep a running average, and plot it. Here is what I got.

We are not dealing anymore with formal probabilities and definitions. We are simulating a random event — dice throw — just like in real life.

Initially, the running average is pretty wild and moves around. As we increase the number of simulations, the average converges to 3.5, as expected from the theory.

This way, we come back to statistics again, with the help of simulation. The Law of Large Number could be verified by repeated stimulations of a random event — with a minimal amount of programming.

Some essential definitions

Population: The whole collection of which we want to measure some property. We can (almost) never get enough data about the whole population. Therefore, we can never know the true values of population properties.

Sample: A fraction (subset) of data from the population, which we can gather, and which helps us estimate the properties of the population. Because we cannot measure the true values of the population properties, we can only estimate them. This is the central job of statisticians.

Statistic: A statistic is a function of a sample. It is a random variable because every time you take a new sample (from the same population) you will get a new value for the statistic. Examples are the sample mean or the sample variance. These are good (unbiased) estimates of the population.

Confidence interval: A range/bound around the statistic (of our choice). We need this min/max bound to quantify the uncertainty of the random nature of our sampling. Let’s clarify this further with the example of the confidence interval for the mean.

Depending on where and how we are drawing the sample, we may get a good representation of the population or not. So, if we repeat the process of drawing the sample many times, in some cases the sample will contain the true mean of the population, and in other cases, it will miss it.

Can we say anything about the proportion of our success in drawing a sample which contains the true mean?

The answer to this question is found in the confidence interval. If some assumptions are met, then we can calculate the confidence interval that will contain the true mean (when we sample a large number of times) with a certain fraction.

The necessary formulas are given below. We won’t get into details about this formula or why the particular t-distribution is used in this equation. Readers can refer to any undergraduate level stats text or excellent online resources to understand the rationale.

Confidence Intervals with the z and t-distributions

1) Understand the concept of a confidence interval and be able to construct one for a mean2) Understand when (for what…

Confidence interval illustration: Image source (Public university course material)

Can we say anything about the proportion of our success in drawing a sample which contains the true mean? The answer to this question is found in the confidence interval.

What is the practical utility?

Be careful about the definition and the process to understand the true practical utility of the confidence interval.

When you are calculating a 95% confidence interval of mean, you are not calculating any probability (0.95 or otherwise). You are calculating two specific numbers (min and max bounds around the sample mean) which creates a range of values that will contain the true population mean (unknown) if we were to repeat the process.

Here lies the practical utility. We are not repeating the process. We are just drawing the sample once and constructing this range.

If we could repeat the process a million times, we would be able to verify the claim that the true mean lies inside this range in 95% cases.

But sampling a million times can be quite expensive and downright impossible in real life. So, the theoretical calculation of the confidence interval provides us with the min/max range, just from one draw of the sample. This is amazing, isn’t it?

But in the simulation, we can experiment a million times!

Yes, simulation is fantastic. We can repeat the sampling process a million times and verify the claim that our theoretical confidence interval truly contains the population mean, approximate 95% of the time.

Let’s verify it using a real-life example of factory production. Let’s say in a factory, a certain machine produces 20 tons of product on average, with a standard deviation of 5 tons. These are the true population mean and standard deviation. So, we can write simple Python code to generate a typical production run over a year (52 weeks) and plot it.

Then, we can write the following function to simulate the process an arbitrary number of times to count how many times the confidence intervals truly contained the population mean. Remember that we know the population mean for this case — it is 20.

If we run this function 10,000 times, every time counting if the C.I. contained the true mean or not, and then check the frequency/ratio, we get the following.

The ratios came amazingly close to the theoretical calculation 0.9 (90%) and 0.99 (99%), didn’t they?

We can repeat the sampling process a million times and verify the claim that our theoretical confidence interval truly contains the population mean.

Simulation is a powerful tool for large-scale data science

In the example above, we talked about the C.I. of the mean. But we can construct the C.I. around any other statistic like variance or even quantiles. We can even construct C.I. of the difference of means between two experiments. The exact formula and calculations may be slightly different in each case but the idea remains the same.

As the process complexity increases and we deal with not one but a multitude of interconnected processes, calculating simple summary statistics may not always be possible in practice. We must master the art of stochastic simulation to deal with such situations for large data science and analytics tasks.

In this article, we demonstrated the power of simulation to understand concepts of statistical estimation like expected value and confidence interval. In reality, we do not get the chance to repeat a statistical experiment thousands of times, but we can simulate the process on a computer, which helps us to distill down these concepts in a clear and intuitive manner.

Once you master the art of simulating a stochastic event, you can investigate the properties of the random variables and the esoteric statistical theory behind them, with a new weapon of analysis.

For example, you can investigate, using stochastic simulation,

  • The convergence of the mean of many stochastic events to a Normal distribution (verifying the Central Limit Theorem by numerical experiment)
  • Check what happens when you mix or transform many statistical distributions together in this way or that? what kind of resulting distributions do you get?
  • If a stochastic event does not follow the theoretical assumptions, what kind of aberrant behavior you can get in the result? In this case, the simulation could be your only friend because the standard theory fails if the assumptions are not met.
  • What kind of statistical properties emerges from the operation of a Deep Learning network?

For learning the foundational principles of data science and machine learning, the importance of these kinds of exercise cannot be emphasized enough.

Spread the word

This post was originally published by Tirthajyoti Sarkar at Towards Data Science

Related posts