Hypothesis Testing & Randomization

NHST5

Estimated read time: 1:20

Summary

In this engaging lecture by Erin Heerey, we explore the concept of hypothesis testing using randomization methods. It's a deep dive into understanding how random assignment of participants to different experimental groups can help test hypotheses effectively, especially when assuming that the null hypothesis holds true. Heerey emphasizes the importance of understanding the mechanics of randomization in experiments, with a specific focus on gender discrimination hypotheses. Through this method, participants are reassigned to different conditions to test the impact on hypothesized variables. Key statistical concepts such as permutation tests, Monte Carlo simulations, and determining P-values are elucidated, providing a comprehensive toolkit for conducting scientific experiments.

Highlights

The lecture emphasizes the efficiency of randomization methods in hypothesis testing 🎙️.
Gender discrimination hypothesis serves as a practical example in understanding random assignments 🚻.
The null hypothesis is always tested, not the research hypothesis 🔍.
Random assignments simulate redoing the experiment multiple times to check hypothesis validity 🔄.
P-values indicate the extremity of observed results under the null hypothesis 🎯.
Permutation tests, while comprehensive, can be time-consuming compared to randomization 🔄.
Fisher's arbitrary threshold of P=0.05 for hypothesis testing raises questions about its adequacy 📏.

Key Takeaways

Randomization is key to testing hypotheses by simulating various outcomes while assuming the null is true 🔄.
The null hypothesis posits no effect or difference, and experiments try to refute it with evidence 📊.
Understanding the P-value is crucial; it's the probability that the observed statistic is as extreme assuming the null is true 🎯.
Permutation tests are exhaustive but randomization is often preferred due to time constraints ⏳.
Recalculating test statistics with new random assignments helps verify experimental results effectively 🧪.
Monte Carlo simulations and bootstrapping are useful methods alongside randomization for hypothesis testing 🎲.
A P-value of 0.02 means a 2% chance of observing the result if the null is true, influencing how we accept or reject hypotheses 🎰.

Overview

Erin Heerey's discourse on hypothesis testing introduces randomization as a pivotal tool in examining hypotheses such as gender discrimination. The lecture suggests completing the bootstrap confidence intervals section for a smoother understanding of randomization. By simulating different random assignments of participants, scientists can test the null hypothesis which assumes no impact from the independent variable.

The focus then shifts to the methodical reshuffling of participants to verify the reliability of test statistics. Heerey explains that under the null hypothesis, any variable such as gender should not influence decisions, like promotions. This leads to a discussion on how re-assigned stats, calculated under multiple iterations, help form a distribution that represents the null hypothesis, essential for understanding the outcome validity.

Finally, the significance of P-values comes into play, showing how likely the observed data would occur if the null hypothesis were true. Heerey concludes by contemplating the arbitrary nature of the P=0.05 threshold set by statistician Fisher, prompting consideration about its sufficiency in making decisions about the validity of hypotheses.

Chapters

00:00 - 00:30: Introduction to Randomization Method in Hypothesis Testing This chapter serves as an introduction to the randomization method used in hypothesis testing. It begins by recommending that readers complete the previous section on bootstrap confidence intervals for better understanding. The chapter then transitions into explaining how to apply the randomization technique in hypothesis testing, building on the foundational knowledge of bootstrap intervals. The goal is to facilitate a smoother comprehension of randomization by leveraging prior concepts.
00:30 - 01:00: Testing Gender Discrimination Hypothesis The chapter introduces a method to test hypotheses, specifically focusing on the gender discrimination hypothesis. It explains the process of randomization used in hypothesis testing. An important aspect of this process is assuming that the null hypothesis is true, meaning that there is no effect or relationship present in the data concerning gender discrimination.
01:00 - 01:30: Random Assignment in Hypothesis Testing The chapter discusses the concept of random assignment in hypothesis testing, emphasizing its importance in ensuring that participants have equal chances of being in the test or control groups. This is crucial to the logic of randomization tests. The discussion highlights that the testing process is focused on the null hypothesis rather than the research hypothesis, and explains what it implies if the null hypothesis holds true.
01:30 - 02:00: Randomization and Permutation Testing Explained This chapter discusses the concept of randomization in experiments, focusing on the method of permutation testing. It explains how participants can be reassigned to different conditions randomly, without altering the outcomes or the relationship between scores and participants. This technique helps in understanding the effect of different conditions while keeping the outcomes constant.
02:00 - 02:30: Building Distribution of Test Statistics The chapter focuses on building the distribution of test statistics using randomization methods. It explains the process of randomly reassigning participants to different experimental groups and recalculating the statistic, similarly to bootstrapping. Although using the same sample and sample size, different randomization methods can be applied. The chapter specifically highlights randomization testing, which involves repeatedly performing the experiment to build the test statistic's distribution.
02:30 - 04:00: Randomization Test Example with Bank Managers In this chapter, the concept of a randomization test is explored, specifically in the context of testing with bank managers. The process involves repeating an experiment multiple times (K times) by randomly reassigning participants to different groups and recalculating the hypothesis test each time. Additionally, the chapter introduces the idea of a permutation check, where all possible participant assignments are tested. For instance, a participant in one group could be reassigned to another, and by adjusting the assignment of just one group member, different scenarios are tested.
04:00 - 05:00: Interpreting P-values in Randomization Test In the chapter titled 'Interpreting P-values in Randomization Test,' the discussion focuses on different methods for statistical analysis, particularly comparing permutation tests, randomization, Monte Carlo simulations, and bootstrapping. The transcript highlights that permutation tests involve checking all possible ways of random assignment but can be time-consuming with a large number of subjects. Consequently, randomization is more commonly used. Monte Carlo simulations and bootstrapping are also mentioned as methods to generate new data or observe a sample, using models for the null hypothesis. These techniques are essential for understanding and interpreting P-values in statistical tests.
05:00 - 06:00: Statistical Decision Making and P-value Thresholds The chapter discusses statistical decision-making focusing on the role and calculation of P-value thresholds. It covers the use of randomization methods to draw new samples and examines experimental design. Participants are randomly assigned to either a control or experimental group to measure the dependent variable. A test statistic is calculated to conduct a hypothesis test.

NHST5 Transcription

00:00 - 00:30 all right so let's continue by having another look or have by having a look at hypothesis testing using a method we call randomization if you haven't yet done the first part of the lab that includes bootstrap confidence intervals you may want to do that before we get to this section of the lab it will make understanding the randomization method a little bit smoother I think so you've calculated bootstrap
00:30 - 01:00 confidence intervals and now we're going to do another version of a similar kind of process that allows us to test our hypothesis so we're going to test the gender discrimination hypothesis so what we do when we do randomization for hypothesis testing is we make a little assumption the Assumption we make is that the null hypothesis is true now if the null hypothesis is really and truly true there's nothing going on in
01:00 - 01:30 the data then participants are as likely to be in our test group as our control group based on their scores there's no difference and that's a critical element of the logic of these randomization tests so what that means is if the null hypothesis is true remember we always test the null hypothesis we never test the research hypothesis if the null hypothesis is true we can
01:30 - 02:00 redo our randomization by randomly assigning participants to conditions afresh keeping the same outcomes so we're not going to change the outcomes we're not going to change the link between your score right and you what we are going to change is which condition you're in just randomly so randomization methods allow us to
02:00 - 02:30 randomly reassign participants to experiments and recalculate our statistic so this is a bit like bootstrapping we're using the same sample that we had it's a sample of the same size what we're doing in randomization methods is we're redoing our random assignment and there are a bunch of different randomization methods we could use we will the one we will focus on in this class is randomization testing where we basically re-perform our experiment a
02:30 - 03:00 bunch of different times K times in fact case the number we use to describe how many times we want to redo the experiment and we recalculate the hypothesis test based on randomly reassigning people to groups we can also do what we call a permutation check which is testing all the possible ways that participants could be assigned so you could be assigned in group one and everybody else that was in group one except for one could have been assigned to group two so we could flip just one person back and
03:00 - 03:30 forth and so forth we check all the possible ways of random assignment that's a permutation test we don't tend to do those they take our they can depending on the number of people you have they can take an awfully long time to do so instead we typically use randomization instead of permutation you can also use Monte Carlo simulations to generate new data using a model for the null hypothesis we can also use bootstrapping so we can observe a sample as a model for the
03:30 - 04:00 effect of the population we can draw K new samples of size n so these are randomization methods we can use so let's look at how they work so once again you've seen this this little diagram before we have our experimental design and we randomly assign people to the experimental group or the control group we measure the dependent variable and what we do now is we calculate a test statistic so we calculate our hypothesis test now
04:00 - 04:30 that could be the proportion of males to females who were promoted it could be a test statistic like T it could be a Chi Squared it could be whatever test statistic we're interested in and our null hypothesis is that the treatment or independent variable has no effect or a negative effect on the DV whether or not somebody gets promoted in this case a research hypothesis is that the treatment or IV
04:30 - 05:00 the name of the person's gender on the resume has a positive effect on the DV right so it has some kind of effect this is a directional hypothesis remember because we're specifying a direction for the effect so the treatment has an effect a positive effect on the DB so whether somebody's whether a CV is male or female in particular here male that's going to make promotion decisions more likely so what we do is we redo the random
05:00 - 05:30 assignment basically we take people and we shuffle them into different groups and we calculate our test statistic now under our null hypothesis the gender on the CV has no influence on the promotion decision so assuming the null hypothesis is correct we can obtain a distribution of likely results simply redoing the random assign the random assignment and then recomputing the test statistic now this works because according to the null hypothesis the independent variable has no
05:30 - 06:00 influence on the dependent variable and that makes the random assignment what we call exchangeable we can then check how probable the experimental result we actually observed was assuming that the null hypothesis is true so what we're doing the what we're doing when we calculate our P values we're looking at the observed statistic how often The observed statistic is greater than or equal to our expectation given that the null hypothesis is true
06:00 - 06:30 so we have a critical value that we're interested in our threshold value and we're checking how often The observed statistic is greater or more extreme than this particular critical value given that the null hypothesis is true so we know when we Shuffle people into groups that any dependency there might be between one group and another will be broken so gender discrimination might be a thing in this sample
06:30 - 07:00 however if we take people who and we take these genders and randomly assign them to different bank managers what we might end up seeing is we might end up seeing different statistics in fact we would certainly end up seeing different statistics right keep the same number of promotions and non-promotions but instead what we're doing is we're randomly assigning new new names essentially to the same bank managers so
07:00 - 07:30 if I'm bank manager number one and I decide to promote my person a bank manager number two decides not to promote their person my decision is always going to be the same so if I'm a cranky bank manager and say no promotion I'm always going to say no promotion no matter who the person is that gets assigned to me if I'm always a nice bank manager I'm always going to say promote no matter whether the person assigned to me is male or female so if the scores are shuffled into groups we
07:30 - 08:00 know that the null hypothesis is true we know that the scores are exchangeable so now what we can do is we can actually check the distribution of the statistic under the null hypothesis by building up a new distribution just like when we did bootstrapping we built up a distribution of scores of means and we in fact we built a distribution of sample means as well when we did um our our Monte Carlo simulations in a previous week
08:00 - 08:30 now here we're going to also build up a distribution of test statistics that we're going to calculate so let's return to our data this is the same data set I showed you earlier the difference between promotions of men and promotions of women is 29.2 percent so we already know this so what exactly does the randomization test do test do so here's the original data we have a bank manager ID
08:30 - 09:00 these run all the way to 48 and we have some promotion decisions promote promote deny promote promote all the way down and each of these bank managers is a different person and what happened originally is the first let's say bank managers 1 through 24 we're all given female CVS or resumes rather with a woman's name as the first name of the applicant and bank managers
09:00 - 09:30 25 through 48 were given the name of a man and then we can ask what they said whether they would promote or deny so what we're going to shuffle is this gender column here I'm going to leave everything else the same because bank manager one is always going to promote and bang monitor 2 is always going to deny so that won't change what will change however is what gender there is they're randomly assigned to so if we
09:30 - 10:00 randomly reassign gender distribution then we can recalculate our tables have and that will change the frequency with which men and women are promoted so if we have a new distribution here maybe this time 17 men were promoted in seven were not promoted 18 women were promoted and six were not promoted so the probability of being promoted given that your male is 0.71 versus 0.75 our
10:00 - 10:30 difference is minus 0.4 now we can do that again we can get a new contingency table here notice the numbers have changed our difference here is 12. and we can what we can then do is we can build up a plot that shows the distribution of simulated differences in promotion rates this plot here is based on 100 simulations and if you count these little blue dots you will see that there
10:30 - 11:00 are 100 simulations in this data set and we can look at the difference in promotion rates remember our the one that we really did was this one here so it was a probability of 0.29 so our observed result was at 29 about a 29 difference in promotion rates and the probability associated with that so there are two values that are as extreme or more extreme than that in our data
11:00 - 11:30 set remember we're interested in the likelihood that men are promoted more than women so if you count all the dots below here there will be 98 of them and there were two results that are as extreme or more extreme than the one we received or the one we obtained under the null hypothesis which we know because we randomly assign participants to groups so there was only a two percent chance to get the observed
11:30 - 12:00 result or a more dramatic effect if the null hypothesis is true and that is what RP value is probability of observing a result at least as Extreme as the one you got given that the null hypothesis is true so in the lab slash homework you'll be doing a randomization test this week and there's some pseudocode for it I've left this in pseudocode format because it's important to kind of
12:00 - 12:30 be able to think about these things in a non-code way but also the way variables work so what we want to do is we want to get a probability and we will write a function called randomization test that will take some data so the goal with the pseudocode is to calculate the original test statistic from the data first we're going to assign that to a variable y and then what we're going to do is we're going to iterate over some number of iterations maybe a thousand maybe five thousand
12:30 - 13:00 we will Assa reassign the independent variable randomly across the data set and then we'll calculate the test statistic from our edited data set so this data star here means that we've edited our data set and we'll record our test statistics so in the case of this particular example we're recording a percentage difference between men and women we then plot the histogram of of our differences here this array X that we have and what we
13:00 - 13:30 want to know is where does y fall in the context of X so our probability then is the proportion of X that is greater than or equal to Y and this is our p-value that gets returned from our randomization test and we'll talk about how to do this I'll give you an example in the lab this week and we can talk about what this looks like then based on our randomization test we have to make a statistical decision so given that we have a p-value of 2 and
13:30 - 14:00 100 or P equals 0.02 The observed results could have occurred by chance with a probability of two percent even if there is no discrimination against women even if our null hypothesis is true so that's what we're considering when we think about making these decisions now the question is do we reject or do we fail to reject our null hypothesis do we call our defendant guilty or do we call them not guilty two percent Beyond A Reasonable Doubt
14:00 - 14:30 now according to the psychologist Fisher or the statistician really he wasn't a psychologist he was a statistician according to Fisher two percent is beyond A Reasonable Doubt is two percent enough to send someone to prison would you want to send someone to prison on the back of such evidence that would depend s so the threshold for evidence and statistics tends to be P equals 0.05 so P equals 0.05 is considered
14:30 - 15:00 sufficient evidence to reject a null hypothesis in psychology but that number you need to remember is arbitrary it was chosen by Fisher because it seemed good enough at the time is it good enough that's a little bit unclear and what we're going to take up next is the topic of what happens when we make a mistake in our decisions and we'll take that up in the next section of the lecture