Statistical inference through confidence interval estimation


This post was originally published by ARIMITRA MAITI at Towards Data Science

The balance between ignorance and confidence

All you need in this life is ignorance and confidence, and then success is sure. ~Mark Twain


It is intriguing to see when an aspirant in Master Chef presents a dish the judges do not eat all. Rather they take a small bite elegantly, their eloquent chewing creates an optimal environment of thrill, and finally, their curated judgment renders insight into the quality of the food. That golden judgment holds a strong belief from both the audience and aspirants which gives a sense of uniqueness, precision and a good representative of those who is concerned to achieve similar results in cooking that particular genre of a dish.

It is similarly intriguing of a particular idea of rough approximation practices which has been seen to draw the consideration across age gatherings to improve thinking abilities. In one way or another, I feel it gives a feeling of tremendous delight when the disclosure of our closeness to the specific is exact without knowing the width of a hypothesis.

In this article (as a continuation of my past) we would take cautious steps to encounter some stimulating real factors of certain realities we inspected in the sections above.

Population, Sample, and Non-sampling error

Target Population vs Sample (Simpson Image Source)

A population is the set of all members about which a study intends to make inferences (et al Albright & Winston).

Each unit of a population is known as a frame. The dispersion and spatial distribution characteristic is the only important attribute of a population that we can carry forward in this discussion.

A sample may refer to a small portion of the population which is simpler to control yet all the while ought to be doable to separate and must hold genuine qualities of the population.

If we practice writing or pronouncing the term “Target” together with “Population” instead of plain vanilla “Population” then I strongly believe we can narrow down our intent about a purpose rather than a blind date with the English Dictionary. This undeniable basket of two terms says a thousand words eradicating the need for much reading or explanation.

The process of picking a sample or a set of such samples is known as sampling and data science has been bestowed upon various kinds of sampling techniques. It is futile to divert and discuss each one of them now, however, it is worth mentioning that every technique has its unique design and cognitive approach. It solely depends on the concerned person what technique to choose to reach the desired guesstimate, and there is no such rule inscribed on stone. That being said I would like to share my own opinion with my readers on the different common techniques. Every one of them has been scaled in a request for four characters (i.e. implementation, representativeness, cost, and randomness). The dispersion and spatial distribution feature of the population are responsible to an extent for the variation in the aforementioned four characters.

Tableau Graph — Quadrant of Sampling techniques (Source: Self)

The more we shift from left to right we get an easier technique in terms of implementation. The more we move upward the technique becomes relatively less precise in terms of representing all traits of a population. The size of the circles represents the cost of execution. The bigger the size, the greater is the cost of executing the technique. Finally, the red color covers the techniques which involve a degree of probability or randomness in selecting the samples. Whereas the blue circles solely focus on non-probabilistic or judgemental techniques where there is no randomness assigned. Although we would deal with only one of the techniques from the second quadrant which is Simple Random Sampling, yet iterating that the above view is purely based on my fair comprehension rather than prejudice. I dare not to claim that the view holds a signature from the National or International statistical body, therefore do keep this in mind before blindly accepting my view. It may or may not match your version of the story.

Now allow me to clarify the reason why I made you imagine an image of “ignorance” and “balance” right at the Sub-title. Firstly, despite the artistry imbibed by the judge while taking a bite of the dish to evaluate the aspirant chef, there would be some margin of error because finishing the entire dish is always left out and impossible to accommodate in the allotted time. Secondly, keeping the former error aside, despite the designing or cognitive expertise of the judge, the contestant in question may miss some ingredients or techniques that do not abide by the theme or objective of the competition. There can be some error from the judge too in interpreting the submission of the contestant. I would better reveal that the second kind of error may be referred to as a Non-sampling error, whereas the first one is more of a sampling error. Irrespective of both types of errors the balance is customary to make the show go on and allowing more and more participants to encompass the process. It is inevitable to submit an answer to a guesstimate question without a little margin of errors. I would either get a prize or I won’t, but the process has to survive to ignore the impact of errors.

Examples of Non-sampling error

To keep you from dozing off, instead of looking at the definitions of different kinds of Non-sampling error, let us look at a few examples that could lead to such possibility.

Tableau Graph — Non-response bias (Source: Self)

The above is a classic case of Non-response error which is a common kind of Non-sampling error. Non-voters with a relatively higher percentage than voters in the 18–29 age group may raise questions on the awareness of voting rights from early high school days which in turn may affect voting patterns of taxpayers.

Tableau Graph — Non-truthful bias (Source: Self)

The above graph shows the response from all data analysts employed by an organization named “Life of Data Science”. The responses came out as part of a survey conducted by the CTO to guess the scale of VBA training planned. Although everything looks majorly okay, except the fact that a good 5% data analysts don’t use excel at all in their job role. This sounds potentially nontruthful where the organization very well knows that each data analyst has renewed his/her annual Excel license.

Tableau Graph — Measurement Bias (Source: Self)

The above graph shows the response of applicants to a job post from a company that requires a Programmer explicitly. Options like Usually, Sometimes or Rarely do not give a clear qualifying picture to the recruiter who should be more focussed on candidates who either program or not. Therefore the setting of the question may not truly capture the intention of the recruiter causing a measurement bias.

Tableau Graph — Voluntary response bias (Source: Self)

Students who passed out last year had always aspired to a hangout area inside the campus where students (who are relatively new to the college) can spend a good time with seniors to exchange course curriculum and other extracurricular activities. A lot of presentations and convincing later they got permission from management to conduct a survey based on which the establishment cost would be decided. Second-year students did everything to maintain fair means but missed the critical need to campaign this objective to current 1st-year students. The result says that 90% of 1st-year students who voted, do not want a night time cafe, whereas they were initially the target audience last year. Therefore the current 1st-year mindset differs in some respect from potential respondents causing potential Voluntary response bias.

Population parameter, Sample Statistic

Population vs Sample Traits (Simpson Image Source)

We don’t have the foggiest idea about the qualities in child Simpson that have originated from which phase of human advancement, in any case, the attributes permit us to infer upon specific attributes of mankind.

A population characteristic is described by a measure known as Population parameter. Mostly this value always exists but remains unknown in real-life scenarios except for some experimental cases. The characteristic of a sample is described by a measure known as the Sample statistic which qualifies as a Point estimate later. We can make use of the sample statistic to make certain conclusions about the population. Mean and Standard deviations are two such measures common to both Population parameter and Sample statistic. Let us consider another example to discuss the above a little more.

Spread the word

This post was originally published by ARIMITRA MAITI at Towards Data Science

Related posts