Talking about the central limit theorem, I encountered this theorem many times while studying probability and statistics, without quite understanding it and as a result having a fundamental lack of clarity when it came to hypothesis testing. Why are we using the normal distribution to talk about average number of heads in a series of coin tosses? What is so ‘normal’ about tossing a coin. What about those light bulb failure rates? Why are they so faulty and how do I know they all fall in a bell curve, maybe the distribution of time to failure looks like a dinosaur tail, why a bell curve? Maybe I should just get a beer.

So today, we’ll understand a few things about the central limit theorem, twiddle around with it, with our own hands, and as a result understand a thing or two about hypothesis testing. There are many versions of this theorem, but I will restrict this discussion to the classical central limit theorem which talks about the mean of independently and identically distributed random variables. For a large enough number of such random variables, their mean will approach a normal distribution.

Before talking about what the parameters of the distribution would be, I’ll talk about the beauty of this which makes it so applicable to a wide range of problems. Remember the dinosaur tail looking distribution of time to failure for light bulbs? That may actually be so! but if I sample enough such light bulbs, the mean of their failure times, will lead to a normal distribution. The same with the average number of heads in a sample of coin tosses. You can see at once, how the convergence of all these distributions into the normal distribution is at once, frightfully wonderful and useful.

To be a little more specific. If we sample from a distribution any probability distribution, with mean variance , then as the sample size increases, the mean of the sample tends to a normal distribution with a mean and variance

So we already get an idea of how this may be useful in testing hypotheses, given that the normal distribution is well understood (as compared to dino tails) but before delving into that. Let us play around with what we know. Observe, tinker, be silly. The jupyter notebook in the link below allows you to simulate the toss of a coin and observe how for larger sample sizes, the number of heads in a sample approximates to the well known bell curve. (The distribution of the sum of heads in a sample approaches a normal distribution as the sum is a constant times the mean. This concept, called the normal approximation to the binomial distribution can be explored in detail in the sources below.)

Press the play button on the left of the notebook cell to run the tool and observe the animation.

(Opens in a new tab, give it a bit to load the environment)

]]>(Alberto Cairo’s paper Graphics Lies, Misleading Visuals Reflections on the Challenges and Pitfalls of Evidence-Driven Visual Communication gave guidance to the below analysis)

Humans love visual representation of data. A computer may look at long rows of data, or unstructured data even, and draw insights from it. For us humans though, that information needs to be presented as graphics we can understand, often with various shapes and colors added to drive home a key point. While I’m all for making information and trends visually insightful to humans, we must proceed with caution as often such representations can be misleading or downright dishonest. I highly recommend reading Cairo’s paper to gain a deeper understanding of this problem.

Here, I’d like to provide a quick analysis of a graph I saw on a medium article titled ‘Why We Need to Recognize and Consider Organic Foods’ .

[1] https://medium.com/@mcmahonadam2/why-we-need-to-recognize-and-consider-organic-foods-f127f69261df

I’m leaving out the statistical information on the top of the graph, including debates on the relevance of p values and R square goodness of fit values, or even the fact that correlation doesn’t imply causation, to focus simply on the visual deception of the graphic.

The deceptive tricks used fall into two categories:

- Too much data is represented to obscure reality
- Using graphic forms in inappropriate ways.

The graph proclaims to plot two different correlations:

between glyphosate usage and death rates from end stage renal disease

between the percentage of US corn and soy crops that are GE and death rates from end stage renal disease.

What does it show in reality though – Three data time series superimposed on each other at the same time.

Note how the x axis is time, meaning the graph doesn’t show the correlation between any two series, instead it simply shows how three different series of data are correlated with time!

Need I point out how the series all start at different points in time. For eg: Death rates from renal disease are plotted from 1985 to 1991 even though there is no information plotted about the supposedly causal glyphosate usage and percentage of soy and corn crops that are GE.

Now look at the Y axes.

For one, they are both truncated, also why are there two axes ? Is there a third axis for the % GE Soy and Corn series.( btw how does the same percentage apply for soy and corn)

Truncating the Y axis helps to magnify and hence distort the magnitude of change in a series.

For a series(40,50) let’s say if the y axis is truncated at 40, the point with value 50 would look like infinite growth from the previous point!

Including multiple y axes in data is a way to suggest correlations or superimpositions in values that don’t really exist. If I’m allowed to change the scale of the y axis and its origin, I can make almost any two series look like they correlate.

To illustrate, I constructed two series of numbers random 1 and random 2, with 1 data point each from 1991 to 2009, both series are the sum of a random number and a linear time trend.

In the above figure, the two series are plotted against time, with a common Y axis starting at the origin 0.

Above, I’ve included two y axes with truncated origins.

Hid some of the values of Random1 above, overall suggesting to a user at a first glance that the sudden occurrence of the blue line caused the changes in the orange line.

So, in conclusion, graphs are great, but they are worth pondering over beyond the initial aha moment they might create in us.

]]>]]>

I should make a longer post about this one. I’ve had trouble distinguishing between ‘love as a friend’ and ‘something more’. Well, what I have been sure of, is the intensity of the love and the limited number of commitments I am capable of given the laws of physics, the limitations of biology etc.

]]>People often tell me, “think about the children!” when I talk about progressive attitudes towards love and sex. Well. I’m pretty sure the children have other things concerning them.

]]>

Detoxing is nonsensical and sometimes dangerous pseudoscience.

A big thanks to Prof Timothy Caulfield for his series ‘A user’s guide to cheating death’ that debunks health myths. It is available on Netflix in North America but I hear it is not available in India. Also, check out the links below

]]>