Mastering Hypothesis Tests

NHST6

Estimated read time: 1:20

    Summary

    In this lecture, Erin Heerey delves into the intricacies of null hypothesis significance testing (NHST), focusing on the potential mistakes that can occur and how they are quantified. She explains the importance of balancing type 1 and type 2 errors when making decisions based on statistics. Heerey illustrates these errors through real-world analogies, such as court trials, and discusses the significance of p-values in evaluating statistical hypotheses. She also introduces the concept of effect size and statistical power, highlighting their roles in improving the accuracy of hypothesis testing.

      Highlights

      • Hypothesis tests are not flawless; mistakes happen, similar to errors in legal judgments. ⚖️
      • Type 1 error is when the null hypothesis is rejected despite being true, likened to convicting an innocent person. 🚨
      • Type 2 error is when we fail to reject a false null hypothesis, similar to letting the guilty go free. 🚪
      • P-values signify the likelihood of a type 1 error, crucial in decision-making during hypothesis testing. 🧐
      • Effect size affects the likelihood of type 1 errors, with stronger effects leading to lower error chances. 🌟
      • Statistical power measures the chance of detecting true effects, reducing type 2 errors and improving test reliability. 🔍
      • Significance levels balance type 1 and type 2 error probabilities, tailored to the context of research errors and costs. ⚖️
      • P-value interpretation is pivotal, indicating the error probability in rejecting a true hypothesis. Critical for exams! 📚

      Key Takeaways

      • Hypothesis testing can lead to mistakes, like convicting an innocent person or freeing a guilty one, similar to type 1 and type 2 errors in statistics. ⚖️
      • Type 1 error occurs when the null hypothesis is wrongly rejected, while type 2 error happens when the null hypothesis is wrongly accepted. 🤔
      • P-values help indicate the likelihood of a type 1 error, with lower values suggesting a lower chance of mistakenly rejecting the null hypothesis. 📉
      • Statistical power is key to detecting true effects, reducing the chance of type 2 errors. It highlights the test's ability to identify genuine relationships. 💪
      • Choosing significance levels in tests depends on the cost of errors; stricter thresholds are needed for more costly errors. %
      • Understanding p-values is crucial, as they represent the probability of error in rejecting a true null hypothesis, impacting decision-making in studies. 💡

      Overview

      In this captivating lecture by Erin Heerey, we dive into the world of null hypothesis significance testing (NHST), exploring the potential pitfalls and mistakes such as type 1 and type 2 errors. She uses engaging analogies, likening these errors to courtroom scenarios to simplify understanding, and stresses the importance of interpreting p-values in hypothesis testing.

        Heerey introduces the concepts of effect size and statistical power, painting a vivid picture of how they contribute to the accuracy of statistical tests. Effect size refers to the strength of an observed effect, with stronger effects reducing the likelihood of type 1 errors. Statistical power, on the other hand, is all about the probability of finding a true effect, thereby mitigating type 2 errors.

          To wrap things up, Heerey discusses how setting appropriate significance levels can make or break your statistical testing approach. She emphasizes the need to weigh the costs of type 1 and type 2 errors in context, using significance levels that best suit the gravity of potential mistakes in scientific and real-world scenarios. Understanding these statistical nuances is key to mastering hypothesis tests and making informed decisions based on data.

            Chapters

            • 00:00 - 03:00: Mistakes in Hypothesis Testing In this section, the focus is on mistakes in hypothesis testing. It highlights the imperfection of hypothesis tests, drawing parallels to errors in legal judgments where innocent individuals might be wrongly convicted, or guilty ones set free. It emphasizes that statistical hypothesis testing can experience similar types of mistakes. However, unlike in the legal system, statistics provides tools to quantify such errors.
            • 03:00 - 06:00: True State and Decision Making Chapter Title: True State and Decision Making The chapter discusses the concept of hypothesis testing, highlighting the presence of two competing hypotheses: the null hypothesis and the research (alternative) hypothesis. These hypotheses are complementary, completing the sample space. The core of hypothesis testing is deciding which of these hypotheses is more likely true, involving understanding and managing errors in decision-making.
            • 06:00 - 09:00: Type 1 and Type 2 Errors This chapter discusses the concepts of Type 1 and Type 2 errors in hypothesis testing. It emphasizes the uncertainty of knowing the true state of the world and the potential for mistakes when making decisions based on data. The null hypothesis might be true, or the research hypothesis could be true, but we can never be certain. The chapter explains that we use collected data to decide whether to reject the null hypothesis or fail to reject it, acknowledging the inherent risks of errors in these decisions.
            • 09:00 - 12:00: Statistical Power and Decision Thresholds The chapter titled 'Statistical Power and Decision Thresholds' discusses the balance needed between the decision we make based on a hypothesis test and the true state of the world, which is unknown and described as a 'black box'. The true state is not visible to us, and if the null hypothesis is true and we fail to reject it, we retain the null hypothesis. The chapter emphasizes understanding the implications of the decisions made in the context of statistical analysis.
            • 12:00 - 15:00: Significance Levels and Consequential Errors In this chapter, the concept of significance levels and the types of errors associated with hypothesis testing are discussed. A correct decision is made when the null hypothesis is correctly rejected in favor of the research hypothesis. Errors arise when the null hypothesis is rejected without justification, despite it being true in reality. The correct and incorrect decisions are examined in the context of their consequences on the hypothesis testing process.

            NHST6 Transcription

            • 00:00 - 00:30 in this last section of the lecture for null hypothesis significance testing we need to talk about what happens when we make mistakes hypothesis tests are not flawless innocent people can be wrongly convicted guilty people can be wrongly set free and the same kinds of Errors occur in statistical hypothesis testing as well the difference is that in statistics we have the tools we need to quantify some
            • 00:30 - 01:00 of these errors and we're going to talk about them moving forward so when we have hypothesis testing we have two competing hypotheses we have a null hypothesis we have a research hypothesis and they are complementary to one another they fully complete our sample space now when a hypothesis test essentially what we're doing is deciding which one of those hypotheses is more likely to be true
            • 01:00 - 01:30 and of course sometimes we make mistakes so what we'll be looking for is we'll be looking for we're balancing the true state of the world which we never know so in the true state of the world the null hypothesis could be true or the research hypothesis could be true but we don't know which one it is really and then we have some data we collect and we're going to use the data we collect to make a decision in which we either reject the null hypothesis or we fail to reject the null hypothesis
            • 01:30 - 02:00 so that is the balance that we that we need to achieve so this balance between the decision we make and what is really and truly true out there in the world now remember this is a black box this this element here is invisible to us we don't know what the true state of the world is now in the true state of the world if the null hypothesis is really true and we fail to reject it I.E we retain our null hypothesis
            • 02:00 - 02:30 then we have made a correct decision if in the true state of the world the research hypothesis is true and we reject our null hypothesis in favor of the research hypothesis we've also made a correct decision good for us errors occur in the other locations in this box when we reject the null hypothesis so we make the decision to reject the null hypothesis and the null hypothesis is really and truly true we make what's
            • 02:30 - 03:00 called a type 1 error when we fail to reject the null hypothesis so we say oh yeah we have to retain we have to we cannot reject the null hypothesis we fail to reject the null hypothesis but the research hypothesis is actually true then what we've done is we've made a type 2 error a type 1 error is like sending someone to prison even though they're innocent and a type 2 error is a bit like letting a guilty person go free if we're kind of
            • 03:00 - 03:30 returning to our trial metaphor probability of making an error given that the null hypothesis is true so the p-value describes the likelihood of making a type 1 error so a type 1 error to reiterate is rejecting the null hypothesis when the null hypothesis is true
            • 03:30 - 04:00 the likelihood of making a type 1 error is signified by the p-value it's it's the p-value is the probability of making a type 1 error given that the null hypothesis is true it's the probability that our outcome could be greater than some threshold that we're interested in or some threshold that where P equals 0.05 given the null hypothesis is true a type 2 error is failing to reject the null hypothesis when the when the research hypothesis is actually true
            • 04:00 - 04:30 p-value doesn't tell us anything about this likelihood even though these events are complementary how do we set our thresholds well we have a two percent chance of claiming the gender different discrimination exists even if there really is no gender discrimination in our sample based on the statistics we got above when we did our randomization test is that good enough well that depends one of the things we
            • 04:30 - 05:00 need to be careful of is this notion of effect size now we talked about this briefly at the very end of the last lecture we talked about the idea of an effect size as being the strength of an effect the effect that we're looking at outside the confines of the Laboratories the real world magnitude of the effector relationship that we're dealing with as the strength of an effect increases the likelihood of making a type 1 error decreases and that makes sense as the relationships between variables get stronger our statistical tests are more
            • 05:00 - 05:30 likely to reach that critical threshold they're more likely to be as extreme or more extreme than our critical value remember this is the the 95 percent likelihood that or that the place in the distribution where 95 of the scores fall within a within a sort of close to the mean window and we only
            • 05:30 - 06:00 have a small likelihood of getting scores that are more extreme than that 95 percent of the distribution so strong effects are less likely to give us false positives and making a type 1 error is a false positive where we wrongly claim that we have an effect there's another element that we need to consider as well it's an element called statistical power and you'll learn more about this in fact
            • 06:00 - 06:30 you learn a lot more about this in class next semester if you continue on in the series statistical power is the likelihood of finding an effect if there's an effect out there to find power of a study or experiment to detect a relationship is its probability of rejecting the null hypothesis if the null hypothesis is false the tricky bit is we never know what the likelihood is of the null hypothesis being false as power increases the
            • 06:30 - 07:00 likelihood of making a type 2 error declines so with low power we have greater likelihood of a false negative so a false negative would be saying oh we don't have an effect when in fact there really is one out there so this is something that we need to balance we need to balance the probability of making a type 1 error with the probability of making a type 2 error so that we can increase our decision accuracy so what's our decision threshold so the thresholds we set are somewhat arbitrary so how do we decide
            • 07:00 - 07:30 what we choose is designed to balance the probability of a type 1 error with the probability of a type 2 error if we reduce the chance of making a type 1 error that increases the chance of a type 1 of A type 2 error we don't necessarily know by how much but the those two probabilities are linked to one another how we use them to make decisions might actually depend on which error is more harmful in the legal system putting an
            • 07:30 - 08:00 innocent person in jail is often considered the greater harm than letting a guilty person go free in science perhaps there's a cost to not approving an effective treatment as in which case we might want to err on the side of leniency with our decision tests with respect to our example if we claim that the status quo is acceptable when in fact discrimination is occurring maybe that outcome is worse than the
            • 08:00 - 08:30 other way around we typically use a significance level of P equals 0.05 5 is traditional and frequently used based on Fisher's popularization of that you know 20 is good enough or or five times in once in 25 times in 100 is good enough for government work however a specific context might necessitate a higher or lower level of of significance depending on how consequential errors are
            • 08:30 - 09:00 if making a type 1 error is especially dangerous or costly maybe we want to choose a smaller probability maybe P equals 0.01 so a one percent chance of a false positive and if we do that what we're going to be doing is demanding very strong evidence in favor of the null of the research hypothesis before we reject the null hypothesis if a type 2 error is especially costly we might want to choose a higher
            • 09:00 - 09:30 threshold so P equals 0.1 instead of 0.05 to reduce the chance of that type of error occurring and I'll leave you with a practice question what does a p-value of .002 tell you does it tell you and I want you to answer this yourself you should pause and answer it before you hear the answer that I tell you does it tell you that the null hypothesis is true with only a probability of two percent that the
            • 09:30 - 10:00 research hypothesis is false with a probability of two percent that we have a two percent chance of making an error if we reject the null hypothesis or that we have a two percent chance of erroneously rejecting the null hypothesis even though it is true to go ahead and pause now and answer this question for yourself before I tell you what the answer is the answer is D we have a two percent chance of erroneously rejecting the null hypothesis even though the null
            • 10:00 - 10:30 hypothesis is true remember that the p-value is the conditional probability of making an error given that the null hypothesis is true so make sure you understand that because that is going to be there will be a test question on that either on the midterm or on the final and we'll leave it there for today