Can you taste the difference?

The Tea Lady Experiment - Numberphile

Estimated read time: 1:20

Summary

The Numberphile video discusses hypothesis testing through a unique blend of storytelling and scientific exploration, focusing on a classic experiment: the Lady Tasting Tea. The video explains how hypothesis testing works using an illustrative example of determining if someone can tell whether milk or tea was poured first just by tasting. This leads to a broader discussion on statistical significance, random guessing, and permutation tests applied in research. With engaging narration, the video enhances understanding of complex statistical concepts through real-world applications and historical experiments.

Highlights

The video kicks off with an exploration of hypothesis testing with infectious disease modeling. 🦠
Discussions lead to the Lady Tasting Tea experiment, a classic tale of hypothesis testing in action. 🍵
Hypothesis testing helps determine if what you observe (like guessing tea orders correctly) is just random or significant. 📈
A test statistic helps decide whether to reject a null hypothesis based on observed data and its distribution. ⚖️
The concept of statistical significance (like a 5% cutoff) guides decisions in scientific testing. 📊

Key Takeaways

Hypothesis testing is a method to determine if results are significant or due to random chance. 📊
The Lady Tasting Tea experiment illustrates how we test a claim scientifically. 🍵
Results in hypothesis testing can show either support or contradiction to a null hypothesis. 🔍
Statistical significance is often set at a small percentage like 5%, indicating how likely results are not by chance. 📉
Permutation tests create a new set of data from existing data to test against a hypothesis. 🔄

Overview

In this intriguing dive into the world of hypothesis testing, Numberphile unravels the complexities of determining statistical significance using real-world examples. The video delves into a fascinating case study involving infectious disease spread models and explores how these models are tested against hypotheses. 📊

Moving into more familiar terrain, the video revisits the famous Lady Tasting Tea experiment by Sir Ronald Fisher. This whimsical yet scientifically profound tale illustrates the principles of hypothesis testing: determining if perceived abilities or observations are valid or just flukes. 🍵

With a mix of historical context and mathematical explanation, the video navigates through the intricacies of permutation tests and statistical thresholds. Whether in the field of epidemiology or enjoying a cup of tea, understanding how hypothesis testing works is made both engaging and educational. 📚

Chapters

00:00 - 02:00: Introduction to Hypothesis Testing The chapter introduces the concept of hypothesis testing, which involves comparing a null hypothesis with an alternative hypothesis. It is illustrated with a research project example on modeling infectious disease spread on networks. The explanation begins by describing the nature of null and alternative hypotheses, where the null hypothesis is considered the default or less exciting option, while the alternative represents the outcome of interest. The chapter also touches upon the typical framing involving data samples drawn from a distribution, providing the foundational context for hypothesis testing.
02:00 - 04:00: Null and Alternative Hypotheses The chapter discusses the concept of null and alternative hypotheses in statistical testing. It starts by describing a normally distributed dataset with an unknown mean (mu) and a known variance of one. The importance of hypothesis testing is highlighted, particularly setting the null hypothesis, which assumes the mean mu is less than zero, against the alternative hypothesis, which posits that mu is greater than zero. Emphasis is placed on the significance of these hypotheses in understanding the central tendency and testing assumptions about the dataset's distribution.
04:00 - 06:00: Test Statistic and Asymptotic Properties The chapter discusses the concept of hypothesis testing, using the example of determining the average height of four-year-olds in the country. It illustrates how a null hypothesis is set up, in this case, the null hypothesis being that the average height is greater than 36 inches. The discussion includes the motivation behind hypothesis testing, such as proving or disproving popular beliefs—in this case, the belief about the height of four-year-olds.
06:00 - 08:00: Understanding Permutation Test The chapter titled 'Understanding Permutation Test' explains the concept of using permutation tests in hypothesis testing. It starts by setting up the null hypothesis, often representing a 'boring' assumption such as a particular group being short. In contrast, the alternative hypothesis suggests a different outcome, such as the group being tall. The test statistic is introduced, which in this example involves the average height of a specific age group (e.g., four-year-olds). The decision to reject the null hypothesis is based on a certain rule, which utilizes properties of the test statistic. The approach described leans on asymptotic properties for making these decisions.
08:00 - 15:00: Fisher's Lady Tasting Tea Experiment The chapter discusses Fisher's 'Lady Tasting Tea' experiment, which illustrates the concept of statistical distributions and hypothesis testing. It explains the term 'asymptotic,' emphasizing that as sample size (n) increases, data tends to a certain distribution. The text highlights that if data were normally distributed, there would be no need to consider asymptotic properties, as the distribution would already fit the necessary criteria. However, in reality, certain variables (like human height) might not fit a normal distribution perfectly, as they can present outliers (e.g., negative values). Therefore, other distributions might be more appropriate for testing hypotheses, such as testing whether a mean is above or below a certain threshold. The narrative also touches on calculating test statistics in this context.
15:00 - 18:00: Statistical Significance and Permutation Testing in Research This chapter discusses the general asymptotic properties that apply to any distribution, focusing on hypothesis testing. The process involves choosing a large sample size 'n', which ensures that the test statistic behaves like a scaled normal distribution. The decision to reject the null hypothesis (h0) is based on comparing the test statistic value (t of x) against a threshold value (c) chosen beforehand. If t of x exceeds this threshold, it indicates rejection of the null hypothesis. This approach is linked to statistical significance and permutation testing in research.
18:00 - 19:30: Acknowledgment and Conclusion The chapter "Acknowledgment and Conclusion" reflects on the standard methods of hypothesis testing, specifically the handling of average heights as evidence. It explores the conditions under which the null hypothesis (H0) is rejected or not, based on a threshold value (C). Furthermore, the chapter touches upon a research project with a PhD student concerning disease spread over networks, conducted prior to the COVID-19 pandemic.

The Tea Lady Experiment - Numberphile Transcription

00:00 - 00:30 so I wanted to talk about hypothesis testing so maybe I'll talk about a research project I did several years ago that had to do with sort of modeling infectious disease spread on networks so hypothesis testing right you have a null hypothesis and you have an alternative hypothesis is it more exciting if the alternative comes true or Yeah null is more like boring the way that this is usually framed it's like you have you have data right maybe you have some some data samples x1 up to xn and they come from some distribution and maybe your
00:30 - 01:00 distribution is normal with mean mu and variance one so what that looks like is you know if you if you draw the density function of your data from which it's sampled then the the distribution is centered around mu but you don't know what that is so normally you would want to do some kind of hypothesis test that maybe it says well the null hypothesis is that mu is less than zero and the alternative is that mu is bigger than zero so what could that be that like
01:00 - 01:30 yeah so this could be like the average height of um four year olds in the country right and maybe it's not that mu is less than zero um maybe the the null hypothesis is that oh I'm American so I want to say like 36 in so let's say 36 in and the null is that it's bigger than 36 in okay and and maybe you want to do this because well someone tells you okay I think four-year-olds are short and you you say "Well I want to prove to you that four-year-olds are actually quite
01:30 - 02:00 tall right?" So that sets up what's the null so the null is like the boring okay they're they're short and then the alternative is that they're they're tall so then your your test statistic so let's say you get data from n people and and four year olds and four year olds exactly and and what you would do is you would say well I'm going to take the average of their height right so this is the test statistic you might say it's t of x right and then there will be some rule for for whether or not you reject the null hypothesis the idea is to use some properties of some asmtoic
02:00 - 02:30 properties asmtoic just means that well if n is large then it has a certain distribution if your data were normally distributed then it you wouldn't need to worry about asmtoically it would exactly have the distribution you want um a normal distribution might not be the best distribution for people's heights because there's some small chance that it's negative right so in reality there's some other common distribution of which you want to test is the mean smaller than something or bigger than something um and then you would calculate the test statistic and using
02:30 - 03:00 these general asmtotic properties that hold for any distribution you might choose right then you will reject h0 let's say by looking up a normal distribution table so the idea is that for large n t of n will will behave like a scaled normal distribution and then you look up the table and it says well if if t of x is bigger than let's say some some c you choose some some threshold c and if t of x is bigger than c then that means that you reject h0
03:00 - 03:30 because you have evidence that your average of the heights was quite large so you you thought that it was not true and if T of X is less than or equal to C then you then you do not reject H0 so that that's sort of the the standard way that hypothesis testing is is studied in school in some work I did um with my first PhD student about eight years ago I think we were interested in understanding disease spread um over a network this is before COVID but and and
03:30 - 04:00 I I want to qualify this by saying that we're theoreticians so the models as you'll see are quite simplistic right and so we have a null hypothesis and we have an alternative hypothesis but the data that we observe well the null hypothesis the boring hypothesis that you know I have these individuals and they're all sort of symmetrically related to each other in graph theory you would call this a completely connected graph and your alternative hypothesis is that there's something interesting that's happening and with our theory you know we can't
04:00 - 04:30 really cover all alternatives but maybe we have a certain structure that we're trying to test whether that was the case and for whatever reason we want to to have some evidence for saying do I like do I think that this configuration was more likely than this one based on the data so then you have to ask well what is your data the way that we set it up the data is just one observation of for these five individuals who was infected so maybe it's a a group of classmates at school and you know that these three
04:30 - 05:00 were infected and you want to understand do I have some statistically rigorous conclusion for choosing one hypothesis or the other so the issue here is that unlike the standard statistical hypothesis testing you don't have n observations i mean you sort of do but the n people are are different people in your in your class um and you you also can't really talk about asmtoic properties necessarily because maybe your graph only has five people and that's not very large so what we ended up doing was some notion of what's
05:00 - 05:30 called a permutation test this is one version of um resampling method and I'll I'll sort of explain to you what we did um in order to come up with some version of a test statistic and some rule for whether or not we want to um reject or not the null hypothesis is the problem here that you just you don't have lots of information you haven't got lots of data is that the problem you're trying to deal with yeah I guess the problem is that it's it's just not really clear what you should do but yes if you had many copies of the same thing then you
05:30 - 06:00 would probably calculate something and average it out right but basically you just don't have much data you don't you don't you kind of need to come up with some new creative method so so what I'm going to do is I'm going to back up a bit and talk about where permutation testing um first arose in statistics all right so so this is sometimes called Fischer's lady tasting tea experiment and the idea was that there was some lady who claimed that
06:00 - 06:30 they could tell whether the milk was added before the tea or the tea was added before the milk or just by tasting it you know I I'd heard of this when I was in America i thought it would all make sense when I moved to Britain and I still don't really get it but all right so so so Fischer wanted to do an experiment to to check this out do you drink tea yeah but I mean I Do you do the milk first or the tea you like we're having I do the milk second but I mean I I would be fine if if you put the milk first i don't think it would bother me and I don't know that I would notice so but this but this person
06:30 - 07:00 said they could right right right so So he set it up as a hypothesis test right so the null hypothesis Well so they were going to test whether she was telling the truth or not yeah yeah whether she was telling the truth and she had some special abilities right so the null hypothesis is like no special powers right and the the alternative hypothesis is that um she has some special powers okay but the question is you know how how are we going to do this and and one thing he could do for instance is he could say well let's just you know give her lots
07:00 - 07:30 and lots of cups of tea and maybe you can check like so you say you have n cups of tea and you want to estimate P which is equal to the probability of guessing right and so then you would say well okay the null hypothesis maybe is that um P is equal to 1/2 or less than or equal to 1/2 let's say and then because you should get it right half the time right right and then the the alternative hypothesis is like P is bigger than 1/2 but that's not what he
07:30 - 08:00 did so you know if you set it up in this way then um you can model the number of correct cups by what's called a binomial distribution you can look up a table and and and check it out so what he did instead was he said "Well let's just take eight cups cuz we don't want to do this forever." And he said that you know four of them have milk first and four of them have tea first so for milk four tea we know which ones they are she doesn't and then she went through and she identified which ones she thought were milk and which ones she thought were tea first so maybe what that looks like is
08:00 - 08:30 maybe she just got one milk and one tea wrong so she she switched those around right because she knows that they're four of each so you can check that um the correct number is either going to be all eight or six or four or two or zero so then what he did was he said "Well now we have our test statistic right so our our test statistic that would just be the number correct that's going to be either zero or two or four or six or eight and then he wants to do some kind of hypothesis test right so he set up
08:30 - 09:00 that null hypothesis was just random guessing and then the alternative hypothesis is that um something that's so better than random guessing okay so you see that the way I've set it up I'm not saying we want to pinpoint what P is right we just want to to do this hypothesis test or do I have evidence to reject the hypothesis that it's completely random so so let's say for for our example that the test statistic that was calculated so since I said T of X before let's just say it's T of X and
09:00 - 09:30 let's say that that's equal to six so she got six right out of eight and if it's eight you would probably say well she must be she must have special powers right but with six without doing a mathematical calculation you don't know right is that good enough or not so then what he said was well what if you know we sort of simulated things and we said well if you if you had this configuration of four M's and four T's and you looked at every possible assignment of four M's and four T's then what does the distribution look like in terms of how many you get right the
09:30 - 10:00 first thing you need to do is you need to think of how many sequences are there of four M's and four Ts right so you could start listing them down and say like m M t then maybe m t and so on right and if you if you do it you'll find that the total number is what we would write in combinatorics is 8 choose 4 so you have eight things and you choose four of them to be the m and that ends up being 8 * 7 * 6 * 5 / 4 * 3
10:00 - 10:30 * 2 * 1 um so these guys cancel out this is a two and you end up with 70 and what what you can do mathematically is you can then sort of calculate you know if so in in the first case right what would the the number correct be well that would be all eight in the second case there were six correct and you can go and enumerate everything i mean there are there are smarter ways to do this than listing everything out you can write down various binomial coefficients
10:30 - 11:00 so then you can draw yourself a little histogram and you'll find that well not surprisingly things are symmetric so there's the zero 2 4 6 and 8 her score her correctness score when she did her taste test yeah the number of of ways to get exactly eight right so not surprisingly that's one out of the seven 70 possibilities right and similarly um the number of possibilities for for zero right is also 1 out of 70 because if you
11:00 - 11:30 get zero right you just flip the t's and the m's and I think if you you add it up it's something like uh you'll get 36 in the middle and you get um 16 here and 16 here okay so so there's a way to to mathematically calculate this which you might have as an exercise in in an in an elementary probability theory class so now the point is that you take what you actually observe and there's her six that she got there's her six that she got right and then you you think about this as your distribution now it's it's a discrete distribution maybe if you
11:30 - 12:00 study hypothesis testing you want me to to draw sort of a curve over it and I can do that right but it's it's like a histogram it's discreet and in hypothesis testing I guess I didn't write this in my earlier picture but the way these tables come from that tell you whether to reject or not the null hypothesis it has to do with um what are called quantile or percentiles of a distribution and that's like if I chop off my distribution somewhere then how much density is on one side so that's exactly what we'll do we'll say that um what I observed is six right so that's
12:00 - 12:30 what I chop off and and then I want to calculate how much mass is off to the right okay so so in other words well sort of probabilistically you would say so probability um under the null hypothesis that the test statistic is bigger than or equal to 6 is equal to and then you write it down and that's this 16 over 70 + 1 over 70 so mathematically it says if in fact she was completely randomly guessing what's
12:30 - 13:00 the probability that she would get six or more correct and that's kind of the way we do hypothesis testing it's like you say something that or better something that or more extreme and then you would calculate this and this is 17 over 70 which I think is something like 24% what do these percentages mean so usually in statistics you would say that you have evidence to reject a hypothesis if the percentages are something like 5% um something small 5% 2.5% it depends on what you're trying to do if you're trying to detect a rare disease you know maybe you want it to be something like
13:00 - 13:30 2.5 dis 2.5% to be very very accurate all right so the conclusion is if it's six or more correct then you probably don't want to really believe the null hypothesis on the other hand you could do the same calculation and and look at t of x bigger than or equal to 8 and that will just be 1 over 70 and that's something very small so with eight cups Mhm the only way you were ever going to believe she had the power is if she got them all right right right right that that was her only chance yeah because
13:30 - 14:00 these were discreet right you know she couldn't there was there was nowhere else she could see right right so if there were more cups then this distribution would get filled in more and you would end up with you know potentially there would be um more little histogram bumps and if you were trying to say like I'm going to chop it off at a certain statistic and and sum up all the little histogram bumps then you would potentially be able to get closer to something like 5% if 5% was your cut off so there would have been more pillars she could have sat on there right right right what was the number
14:00 - 14:30 you said you need before you accept the hypothesis was it it's usually something so they call it a level of significance and usually it's something like 5% but it depends on on the situation you're in so this is sort of this illustration of in some sense like you only have one observation fischer's idea was to create more data right so you sort of create more data by saying well well under this null hypothesis of completely random guessing like these are the possible outcomes there could be and then they they kind of categorize themselves into
14:30 - 15:00 different test statistics and then you look at how significant that is so if we we sort of go back to this permutation testing what we did was we said well the first difficulty is to come up with a test statistic right so we want to based on what we saw we want to conclude some number we said that our our test statistic is equal to the number of edges between infected nodes according to the graph in the alternative hypothesis i mean if we sort of
15:00 - 15:30 superimpose this graph on this thing then we're counting these two edges so in this case it's equal to two and the idea is to try to parallel what Fischer did and say well if it was completely random guessing I can create more data sets right I can say on these same five nodes if three were infected then which three were there and the three you know it might be this one this one and this one and in that case if I were to superimpose this graph over here I would find that it's it's a value of zero this case might not be super interesting
15:30 - 16:00 because either it's zero or or I have something in the middle and it's two but in generality right you would end up with some sort of a distribution and it would again be some kind of a discrete distribution you know it might not be a mon monotone distribution it's just something right then you you sort of decide well where is my test statistic and do I have um enough uh conclusive evidence to to decide in favor it or not polling low research is supported by the lever trust through one of its prestigious Philip Lever prizes now one of the
16:00 - 16:30 trust's main focuses is supporting Blue Skies Research hoping to generate new maybe unexpected breakthroughs which could ultimately benefit society it's an independent charity that's been running for over a 100red years to find out more about what it does how it works who it supports I'd recommend you check out their website which I'll link to in all the usual places our thanks to the Lever Hume Trust for
16:30 - 17:00 what they're doing for all sorts of research and of course for helping make this video possible this would be fair that is if I wanted to generate a number between 1 and 8 if I rolled the ocahedron each of these faces is equally likely to come out somehow intuitively we know that you know that's a fair dice which has eight sides and a kid raises his hand in class and said "I have a 30sided die." You might be justified saying 30 as your guess by the way uh there is actually a justification for that because it
17:00 - 17:30 actually maximizes the probability what's the probability of picking these numbers