Exploring Null Hypothesis Significance Testing (NHST)

NHST1

Estimated read time: 1:20

Summary

In this engaging lecture by Erin Heerey, null hypothesis significance testing (NHST) is unravelled. It explains both statistical and methodological issues, focusing on how NHST is a core part of the frequentist approach in psychological research. The history of NHST is examined, including landmark experiments like Fisher's 'Lady Tasting Tea' which illustrates the dire essentiality of p-values in quantifying statistical surprise and decision-making in scientific experiments. From flipping coins to determining if a coin is biased, the talk explains the probabilistic foundation of NHST and its role in hypothesis testing, highlighting how decisions are made based on calculated probabilities of observed phenomena versus chance.

Highlights

Dive into the basics and applications of NHST, a fundamental statistical method in psychology and beyond 📘.
Explore historical anecdotes like the 'Lady Tasting Tea' experiment to understand p-values better ☕.
Discover the debate and methodologies surrounding the p-value and its role in scientific accuracy and credibility ✅.
Understand how surprise and probability intertwine to drive sound scientific conclusions 🔍.
See how past thinkers laid the groundwork for our modern statistical interpretation and hypothesis testing 📚.

Key Takeaways

NHST is a pillar of frequentist statistics used to determine the significance of experimental results 📊.
The famous 'Lady Tasting Tea' experiment by Fisher illustrates how very low p-values can indicate non-random phenomena ☕.
Key statistical concepts such as distributions, estimators, and probability come together in NHST to test hypotheses 🎲.
Despite other methods like Bayesian analysis gaining popularity, NHST remains a critical approach in psychological studies 📚.
Understanding and quantifying 'surprise' through probabilities is central to NHST and decision-making in research 🤔.

Overview

Erin Heerey's lecture on NHST opens with an exploration of both statistical and methodological issues, detailing NHST as a core frequentist method used in psychology. It explains the importance of this traditional method in making statistical decisions about experimental outcomes.

The lecture delves into the p-value, a crucial component of NHST, with historical anecdotes such as the 'Lady Tasting Tea' experiment that explain how it helps quantify statistical surprise and significance. With vivid examples and historical context, Heerey highlights how NHST helps determine whether observed phenomena are due to chance or a specific cause.

Understanding NHST involves weaving together various statistical elements like estimators, probability, and distributions. Despite the rise of methods like Bayesian analysis, NHST remains essential, providing a framework for hypothesis testing and understanding the probability of results occurring due to random chance.

Chapters

00:00 - 00:30: Introduction to Null Hypothesis Significance Testing (NHST) The chapter introduces Null Hypothesis Significance Testing (NHST), focusing on both statistical and methodological issues. Statistically, it covers distributions, statistics, and conducting tests of the null hypothesis. Methodologically, it delves into the theoretical background relevant to these topics.
00:30 - 01:00: Statistical and Methodological Issues in NHST The chapter focuses on Statistical and Methodological issues within Null Hypothesis Significance Testing (NHST), a frequentist statistic method. It addresses fundamental concepts such as posing hypotheses and study design. NHST is widely used in Psychology to make statistical decisions about experimental results, and it underpins most of the psychological literature.
01:00 - 01:30: Frequentist Statistics and Decision Making The chapter discusses the prevalent use of frequentist statistics in decision making, particularly in hypothesis testing. While Bayesian analysis is gaining ground, the frequentist paradigm remains dominant. The focus is on using data to address research questions and test hypotheses, with an emphasis on decision-making processes in statistical analysis.
01:30 - 02:00: Foundations of NHST: Descriptive Statistics and Probability The chapter covers the foundational concepts for Null Hypothesis Significance Testing (NHST). Key topics include descriptive statistics such as the mean, standard deviation, and variance. It discusses the qualities of good estimators and how estimates derived from sample data can be confidently applied to populations. The chapter also covers various probability distributions with a focus on the normal distribution, alongside a general discussion on probability.
02:00 - 02:30: Quantifying Surprise in NHST The chapter focuses on Quantifying Surprise in Null Hypothesis Significance Testing (NHST). It delves into the concepts of probability and certainty in predicting events or making decisions. The discussion extends to specific sampling methods, including the consideration of descriptors and the probability involved in drawing samples. Estimators and their role in tying all these elements together are also addressed, indicating a comprehensive look at how probability informs statistical hypothesis testing.
02:30 - 03:00: History of p-value Testing The chapter discusses the foundation of null hypothesis significance testing and its importance in the course. It introduces the concept of quantifying 'surprise' using examples such as coin flips to explain probabilities, such as a coin landing heads twice in a row as having a 25% probability.
03:00 - 03:30: Carl Pearson and the Chi-Square Test The chapter discusses the probability and surprise of consecutive outcomes when flipping a coin. Starting with the expectation that flipping a coin twice and getting heads both times is not surprising, the probability of this event is 0.25. As the number of consecutive heads increases, the likelihood decreases, making it more surprising. For instance, flipping five heads in a row has a probability of about 3%, making such an event relatively rare.
03:30 - 04:00: Significance of p-value and Fisher's Contribution The chapter delves into the concept of p-value and its significance in determining statistical significance. It begins with an example of calculating the probability of getting 20 heads in a row when flipping a coin, illustrating how such a low probability might suggest the coin is biased. The discussion introduces Fisher's contribution to statistical hypothesis testing, emphasizing how p-values help in deciding the likelihood that an observed distribution is due to chance.
04:00 - 04:30: The Lady Tasting Tea Experiment The chapter discusses the concept of probability and statistical analysis using the coin flip example, where getting a series of heads raises suspicion about fairness, much like how null hypothesis significance testing quantifies surprise in results. It introduces the thought process behind evaluating surprising outcomes under the assumption that the null hypothesis is true, with a promise to explain what a null hypothesis entails.
04:30 - 05:00: Conclusion and Continuation into the Next Video Segment The chapter delves into null hypothesis significance testing, explaining its purpose and logic. It focuses on assessing how surprising the observed data would be if the null hypothesis were true, essentially quantifying the unexpectedness of the results if they occurred by pure random chance.

NHST1 Transcription

00:00 - 00:30 so the next lecture we're going to be talking about null hypothesis significance testing so we're going to be talking both about statistical issues and about methodological issues the statistical issues will include distributions about statistics we'll be talking about how to make tests of the null hypothesis and our methodological issues will relate to things like the theoretical background that were
00:30 - 01:00 that we're thinking about we'll be talking about posing hypotheses and we'll be talking about study design so what is null hypothesis significance testing which you'll see abbreviated in this lecture is nhst so null hypothesis significance testing is a form of frequentest statistic and it's a traditional method for understanding and making statistical decisions about experimental results it's a formula we often use in Psychology and in fact most of the literature that we read will be based on
01:00 - 01:30 null hypothesis significance testing although there are also new methods that are starting to take root for example many more people are using Bayesian analysis than they used to we're still a long way from supplanting the frequentist paradigm so decision making in statistics is about how we use data to answer a research question and test a hypothesis and what we're going to be doing when we think about decision making and statistics is thinking about all the
01:30 - 02:00 pieces we've covered so far so we've talked about descriptive statistics like the mean and the standard deviation and the variance we've talked about estimators what makes a good estimator how we make estimates that we can be confident in using our sample data to make estimates of the population we've talked about distributions we've talked a lot about the normal distribution but we've also talked about some other ones as well we've talked about probability how
02:00 - 02:30 certain we are that some that that an event is likely to occur or how certain we are in our decision that we're making and so and we've also talked about specific sampling so when we're thinking about sampling and we're thinking about descriptives and we're thinking about the probability with which we drew a particular sample we're thinking about the estimators that we calculate this is the part where all of those pieces all of those elements
02:30 - 03:00 come together and null hypothesis significance testing is kind of each one of these is a foundation block for null hypothesis significance testing which is what we are going to be doing for the rest of this course so one of the things that we're doing in null hypothesis significance testing is we're quantifying the idea of surprise so think about a coin flip if you had a coin flip that landed heads and then it landed heads again the probability of that is 25 percent you
03:00 - 03:30 wouldn't be very surprised if you flipped a coin it got heads you flipped the same coin again and also landed heads it wouldn't be surprising three in a row the probability of that is 0.25 what about five in a row the probability of getting five heads in a row starts to go down to it starts to become surprising it happens only about three times out of a hundred um if we're thinking about 10 times in a
03:30 - 04:00 row the probability there is point zero zero nine zero zero zero nine seven six six so this is a really low probability and if we wanted to consider the probability of getting 20 heads in a row we would probably start to think maybe there's something the matter with this coin then maybe this coin maybe this coin has two head sides and zero tail sides um if all we were getting was heads in our coin flips there's still a chance that if it's a fine and Fair coin an unmanipulated coin there's still a
04:00 - 04:30 chance that we could get all those heads but most of the time we would be very surprised at that by the time we got to this point where we were getting this many heads in a row or even this point where we were getting this many hits in a row we would start to wonder whether that Quinn flip was rigged and what we're doing when we do null hypothesis significance testing is we're sort of quantifying this idea of surprise how surprised are we by these results if the null hypothesis is true if we have and we'll talk about what that null hypothesis is in a minute but if there's
04:30 - 05:00 nothing going on if our theory is false how surprised would we be to get the data that we got that's what we're doing when we do null hypothesis significance testing is quantifying that surprise so the logic of null hypothesis significance testing looks like this it relies on the probability that an outcome has been obtained by pure random chance if an experimental or study condition is so unusual that is unlikely to have
05:00 - 05:30 occurred by chance then we begin to suspect that it was caused by something about the experimental or study context so essentially what we do is we compare an observed result to the distribution of values we would see if there was nothing going on if our you know manipulated independent variable had no effect if no intervention had occurred if there was no relationship in the data and so what we're asking is how odd or unusual or surprising is the
05:30 - 06:00 statistic that we found given you know it given a world in which that effect or or Theory or whatever it is that we're testing given a world in which that thing is false the history of p-value testing has a very long it is a very long one p-value computations date back to the 1700s John R would know and Pierre Simon LaPlace studied birth records in London England from 1629 to 1710 and they thought that
06:00 - 06:30 the ratio of male to female birth should be about 50 50. right give or take a little bit and what they found was that there were substantially more male children that were born in London or male babies than female babies and they calculated that there was a one in four uh in a very large number of times that that ratio would have been occurred could have occurred by chance
06:30 - 07:00 and they of course suggested that this was due to Divine Providence that was responsible because the ratio of male to female births could not be due to Mere chance right this is a really surprising number it's like flipping 20 heads in a row so if there's only a one in four I don't even know what that number is but it's very big for gajillion if there's only a one in four gajillion chance that something could occur if there was
07:00 - 07:30 nothing going on if there was no difference between the ratio of male to female births then what were what they what they were suggesting here is it was so unusual that it couldn't have occurred by mere chance now there's another explanation here it's probably not necessarily Divine Providence causing this it might be how people register different kinds of births and at that time it was more likely that male births would be registered than female birds so this is really a registration problem and not a
07:30 - 08:00 birth problem um but they also suggested that this is this is why uh men become so powerful that of course you know God likes men likes boy babies better than girl babies but probably not now the history of the p-value was formally introduced to statistics by a man named Carl Pearson who is the inventor of the chi-square test in 1914. so he was studied heredity he studied
08:00 - 08:30 sort of how people inherit traits or how plants inherit traits and he studied he invented a test called chi-square which looks like this we'll talk about this in a future lecture and it's used to evaluate sets of categorical data and it evaluates How likely it is that any observed difference between the sets arose by chance so what you get here is a frequency distribution of certain events observed within a sample and what you're asking is is that observed sample consistent with a
08:30 - 09:00 specific theoretical distribution so what Pearson did was differentiate the test statistic from its distribution and he published a very large book that contained p-values for the chi-square distribution for both different values of chi-squared and different values of different numbers of of sample sizes and so what you could do is you could calculate your Chi Squared and you could look up your both your chi-squared and
09:00 - 09:30 your sample size how many individuals were tested in your sample so he was interested in the number of individuals in a particular sample that might carry a genetic trait um versus so one genetic trait versus the versus another genetic trait that's coded on the same gene so he was interested in things like that and that's perfectly perfect categorical data so he published great big long tables in which you could look up every
09:30 - 10:00 single value of chi-squared and every single at every single sample size which popularized this idea of or it allowed this idea of a p-value actually to be useful in research to researchers the p-value itself wasn't popularized until uh another gentleman which who we'll get to in a minute so one of the things we want to know in
10:00 - 10:30 Psychology or in any kind of statistical testing is what p-value is significant so we often report P values that's a useful tool for communicating certainty in science it's also a highly misused tool it's a good idea to look for converging and independent evidence there um but when we were thinking about when early psychologists were thinking about what a p-value was and why it might be important um and and how much uncertainty we were willing to take
10:30 - 11:00 you've all heard of this Benchmark right here P equals 0.05 and if P equals 0.05 then that idea what what you're quantifying is the likelihood of an effect being due to chance if there isn't really an effect out there in the existing population so that's what the p-value quantifies so what we're quantifying here is this idea
11:00 - 11:30 of oh what how much how certain do we need to be so in our lecturing confidence intervals we talked about 95 confidence interval and we talked about that matching up with this p-value of 0.05 so that idea the p-value of 0.05 was popularized by this gentleman Ronald Fisher who suggested that a one in 20 chance P equals 0.05 of a critical value
11:30 - 12:00 being exceeded by chance it's a reasonable Criterion for inference so 1 in 20 five times out of a hundred we would have a result that was as extreme or more Extreme as a particular critical value given that there was no effect nothing going on in the population given that your hypothesis is false so Fisher's Insight was that instead of
12:00 - 12:30 calculating p-values for different values of chi-squared and N he computed values of chi-squared that yielded specific p-values for different ends so he sort of flipped the tables around so that they became substantially more useful and that allowed computed values of a statistic from samples to be compared against cut off or critical values against threshold values so we talked about in the confidence interval lecture this kind of idea of an upper and lower boundary the p-value identifies that upper and lower boundary
12:30 - 13:00 and or the the yeah it essentially identical allows us to identify that upper and lower boundary those critical values and if the score you obtain is more extreme than the p-value the score of your test statistic where P equals exactly 0.05 if your score is more extreme than that so further out into the Tails of the distribution either tail the positive or
13:00 - 13:30 the negative one depending on what you're testing um it's statistically significant and that idea was popularized in a canonical experiment that fisher called the lady tasting tea so this is the archetypal example of the use of p-values so the woman you see in this rather grainy photo here is Dr Muriel Bristol and she worked in the same laboratory that fisher worked in and interestingly uh so the the
13:30 - 14:00 apocryphal story about this is that one day Fisher was making himself a tea and she also walked into the into the sort of lunch room and Fisher offered to make her a tea as well and she said you don't make a good cup of tea I'll make my own thanks and Fisher did not think that was a good idea he didn't think she was being very polite and so she they had they had a long-standing discussion about how to make a cup of tea um in Canada we don't do this very well
14:00 - 14:30 um in Britain um the the English the English like their tea a specific way and there are different people who like tea prepared differently so Bristol claims that a good cup of tea you put the um is one where you put the you put the milk in first and then you add the tea the hot water versus you make the tea and then you add the milk so that was the debate so Fisher was doing it the
14:30 - 15:00 wrong way Bristol said you gotta do it the other way because it's better that way and Fisher said I bet you can't tell and Bristol said I bet I can and so this little experiment this break room experiment was done it was recorded um and what happened in the break room experiment was that Bristol tasted eight cups of tea all of which had been prepared by Fisher he prepared four of them in in the first the tea and then
15:00 - 15:30 the milk he prefer preferred he made the other four cups of tea with the milk first and then the tea and they were the order of those cups was randomized and so they were sequentially presented to Dr Bristol in a randomized order after tasting a cup she stated which preparation method he had used was it tea first and then milk or milk first and then tea and interestingly she correctly classified all eight cups of tea now the chances of a perfect classification with eight cups of tea
15:30 - 16:00 are one out of 70. so the p-value was equal to point zero one four if you take one then you divide 70 into it this is the number you get and that allowed Fisher to reject the hypothesis that Dr Bristol has no special tea tasting abilities I'm sure she was very gratified at this result so Fisher emphasized this interpretation of a p-value as denoting the long run proportion of values that are at as extreme or more extreme than the observed value if chance was the only
16:00 - 16:30 Factor at play so if there's nothing going on so what he could conclude from his experiment was that Dr Bristol does indeed have some ability to differentiate how a tea is made based on her tasting of that tea um and to this day that idea Carries On Through the null hypothesis significance testing that we do in Psychology and we'll continue in the next segment of the video