Hello. In this article, I will talk about how statistics is based on partial (incomplete) information. The statistics concept of a sample size will be used.
What is Statistics?
Before we talk about information and (statistical) sample sizes, we need an understanding about the definition of statistics. The American Statistical Association (ASA) defines statistics as “the science of learning from data, and of measuring, controlling, and communicating uncertainty”.
Statistics deals mostly with learning from data, modeling data (through probability distributions and regressions for example), and prediction given limited data. The tools required for these tasks include (programming) software such as R, Python, SQL, etc.
“Lies, damned lies, and statistics“
The famous quote above by Mark Twain suggests that statistics does not contain truth and are often misleading.
I would agree that statistics are sometimes misleading but I’d also say that they can be truthful. It is mostly about interpretation and belief which makes statistics a subjective field at times.
When we want information from a large population in the millions. It is too slow and expensive to survey or sample everyone in a country. To have an idea about the citizens in a country or region, we randomly select people (through a census or questionnaire) and gather data about them. This is a random sample. Once data is collected, the data is analyzed, conclusions are made about the sample and those results may suggest something about the larger region/population.
In (frequentist) statistics, it is considered “better” in theory to have a larger sample size (survey more people for example) to get a better idea about the data. A larger sample gives us more (accurate) information about the sampled data versus a smaller sample.
A certain city has about 24 182 households. You want to get an idea of the average income of a household for this city. However, it is expensive and slow to survey each one so you go with a sample of a 1000. A sample size of 1000 give us more reliable data than sampling 10 households. Say the sample average household income from those 1000 households is $66 000 CDN per year. Does this mean that:
- the “true” average household income is $66 000 CDN?
- every household in that city (or in the world) makes $66 000 CDN?
The answer is not necessarily. Notice I did not say no as the possibilities of those events can occur but with a VERY LOW probability. Some households make more than $66 000 and others make less. The region sampled here may have a higher average household income than region B which has a household average income of $45 000 CDN out of 1000 households.
The point I was making is that you can make judgments based on the data (the number of households sampled) given a region. It is dangerous (and not necessarily true) to extend those judgments to the whole city of 24 182 when you did not ask/sample/survey everyone.
A Simpler Example
Let’s say you purchase a bag of potato chips which have around 100 chips in the bag. You try the first 8 chips as a sample. Out of the eight, 5 of the 8 tasted bad and the other 3 were just okay. Does that mean the whole bag of chips is bad based on 5 bad chips out of a sample of 8? Should you throw the bag of chips away?
The answer is unknown until all the 100 chips in the bag were observed. The whole bag of chips may be bad or good. But based on the sample of 8, it is likely that the bag of chips is more bad than good. If you had tried 24 chips (3 times 8) and 15 of them were bad and 9 were just okay, there is a higher likelihood that the bag of chips is bad and you may want to throw that bag away.
Note that because you did not try all 100 chips in that bag, it is uncertain whether or not how many bad chips there were.
Statistics in News
I usually watch the news with the family around dinner to eat and converse about daily world news and such. From time to time you get economics news with headlines such as “Ten percent of Canadians make less than $24 000 CDN a year!”. I hate these type of headlines. Why?
I dislike when media treat statements like the above as fact without mentioning sample sizes or how many people surveyed. It is statements like the above which support Mark Twain’s “Lies, damned lies, and statistics” quote. It is misleading and may not be true.
If the headline had “Based on a poll of 1000 Canadians, ten percent make less than $24 000 CDN a year.”, it seems to be a bit more honest. However, the sample size of 1000 Canadians is way less than the population of Canada which is at least 30 million. In addition, it is not known where these 1000 Canadians come from in terms of provinces and/or cities. They can be from Toronto, Montreal, Vancouver, Halifax, Alberta, etc. The survey method may not be specified.
A Political Poll Example
I dislike how polls are used in political campaigns. There is too much importance on them and the sample sizes are hard to see on TV since they are in small print and they are usually around in the few thousands which is way less than a country population (in Canada/USA for example).
The news would say candidate A is in the lead by 37% and candidate B% is in second with 28% and sometimes conclude that candidate A is winning or (worse yet in my opinion) candidate A will win the election based on a small sample size.
The candidate that wins the election is the one that wins the election when thousands or millions of votes are brought in, not the one who wins polls. Polls give a (very) small idea of who is “winning” or has “momentum” and are not prefect predictors for future events.
- Be aware of partial and incomplete information.
- One must be careful in making judgments given limited data (small sample).
- The sample size matters. A larger sample size gives you a better idea about the sample and the population.
- There are times when you do not have the full truth but only a part of it.
The image is taken from http://data.library.virginia.edu/statlab/.