Exploring the Intricacies of Correlation vs. Causation
Correlation&Regression2
Estimated read time: 1:20
Summary
In this insightful lecture, Erin Heerey delves into the statistical relationship of correlation and the nuances of determining causation. Through a detailed exploration, the lecture covers statistical significance, the methods of hypothesis testing including the use of bootstrapping and permutation tests, and how sample size affects critical values in correlation tests. Heerey also discusses assumptions for using certain statistical methods, such as the necessity of having normally distributed data and dealing with outliers, and explains the conceptual differences between correlation observed in studies and causation, using amusing real-world examples like chocolate consumption and Nobel laureates, highlighting the importance of experimental design.
Highlights
- Recap of hypothesis testing and its importance in statistical analysis. π
- Discussion on statistical significance and when to reject the null hypothesis. β
- Exploring the rejection region through one-tailed or two-tailed tests. β
- Understanding how sample size influences the critical value in a correlation test. π
- Dive into rank order correlation for ordinal data types. π
- Significance of dealing with outliers to avoid false correlation outcomes. βοΈ
- Fun example: Chocolate consumption correlating with Nobel laureates, but not causation! π«
- Clarification of causation conditions in experimental designs. π§ͺ
- Use of Python libraries (Scipy, Pandas, Numpy) for handling correlation calculations in data analysis. π
Key Takeaways
- A correlation doesn't imply causation, no matter how strong it might seem! π§
- We must always split our errors in a two-tailed test to ensure accuracy. π―
- Beware of outliers as they can massively skew your correlation results! π¨
- Understanding whether a dataset fulfills the assumptions is crucial before diving into the tests. βοΈ
- Statistics can sometimes lead us to amusing and misleading conclusions! π«
- Always strive for empirical relationships in experimental setups to suggest causation. π¬
Overview
In this engaging lecture by Erin Heerey, the fascinating world of statistical relationships and their complexities are explored. Erin takes us on a journey through hypothesis testing, beginning with examining our sample size, determining critical values, and understanding both one-tailed and two-tailed tests. The journey continues with fascinating insights on how outliers can distort our results and when to use the Spearman's rank correlation for specific data types.
Heerey then moves into a captivating discussion on the critical distinction between correlation and causation. With amusing examples, such as the correlation between chocolate consumption and the number of Nobel laureates, Erin emphasizes that statistical correlation doesn't automatically equate to causation. Instead, she underscores the importance of a controlled experimental design to ethically and accurately deduce causal relationships.
Wrapping up, Erin talks about how to utilize Pythonβs powerful libraries to perform correlation tests, setting the stage for deeper learning about regression in future discussions. The lecture combines theoretical knowledge with practical applications, making the complex topic both accessible and intriguing for learners keen on understanding the storyline behind the stats.
Chapters
- 00:00 - 00:30: Introduction to Hypothesis Testing The chapter 'Introduction to Hypothesis Testing' discusses the process of evaluating hypotheses in data analysis. It focuses on determining the likelihood of a statistically significant relationship in a given dataset. Specifically, it involves testing the null hypothesis - that a certain parameter, such as rho, is not different from zero. The chapter sets the stage for learning how to determine statistical significance by exploring the basic concepts and reasoning behind hypothesis testing.
- 00:30 - 01:00: Two-Tailed Test and Critical Value The chapter discusses the concept of a two-tailed test and critical values in statistical analysis. It mentions the use of critical values to decide whether to reject the null hypothesis based on the p-value. Specifically, if the p-value drops below 0.05, the null hypothesis may be rejected. The chapter also considers a sample size of 100 as a practical example and emphasizes the importance of identifying the rejection region when conducting a two-tailed test.
- 01:00 - 02:00: Empirical and Theoretical Distributions The chapter discusses the relationship between sample size and critical value in correlation analysis. It explains that a correlation value greater than approximately 0.2 will exceed the critical value in hypothesis testing. Additionally, the critical value decreases as the sample size increases, requiring larger correlations to reject the null hypothesis at smaller sample sizes. The concept of an empirical distribution obtained through methods like bootstrap or permutation testing is also touched upon.
- 02:00 - 03:00: Testing Hypotheses and P-value Calculation This chapter discusses the concept of hypothesis testing and the calculation of p-values. It draws parallels to previous labs, highlighting the theoretical distribution of R under the null hypothesis. It explains that this distribution is similar to the T distribution with the degree of freedom of n-2. The chapter introduces a formula to translate the correlation coefficient R into a t-test, which is a new concept not previously covered. The discussion emphasizes the conversion to a t-test as a method used in hypothesis testing.
- 03:00 - 04:00: Assumptions for Theoretical Methods This chapter covers the process of identifying the critical value for T distributions in hypothesis testing, particularly focusing on scenarios where p-value equals 0.05. It explains the differences in approach when conducting one-tailed versus two-tailed tests, highlighting the need to split the error across two rejection regions for two-tailed tests. The chapter emphasizes understanding the distribution and determining the rejection regions on either side when managing statistical tests.
- 04:00 - 05:00: Dealing with Outliers and Their Effects This chapter discusses the concept of outliers and their effects on statistical analysis. It explains the process of hypothesis testing, particularly focusing on identifying rejection regions in a dataset. The chapter highlights the usage of Scipy, a Python library, which employs methods such as bootstrapping and permutation testing to calculate precise p-values for hypothesis testing. The implementation of these tests in a lab setting is also addressed, providing a practical approach to dealing with outliers in data.
- 05:00 - 06:30: Correlation vs Causation This chapter delves into the distinction between correlation and causation, emphasizing the importance of understanding when two variables might be correlated but not causatively linked. It discusses the theoretical method of converting data for t-tests and highlights the necessary conditions for such conversions, specifically needing interval or ratio data. For cases involving ordinal data, a rank order correlation is recommended. This method ranks data and then uses those ranked values to compute the correlation coefficient.
- 06:30 - 08:00: Establishing Causation The chapter discusses the concept of establishing causation, focusing on statistical methods like Spearman's rank order correlation. It emphasizes the need for related pairs in data (each X and Y value needs to be related) for accurate correlation calculation. The data should exhibit a roughly linear relationship, though a perfect linear relationship is not necessary.
- 08:00 - 09:30: Python Libraries for Correlations This chapter discusses the use of Python libraries for calculating correlations, specifically focusing on when to use Pearson's correlation coefficient. The Pearson's coefficient is suitable for linear relationships between variables. If the data forms a non-linear pattern, such as a U-shape, Pearson's is not appropriate, hence other types of correlation measures should be considered, although these are not covered in this class. Additionally, the chapter emphasizes that both variables involved should follow a reasonably normal distribution before applying Pearson's method.
- 09:30 - 10:00: Conclusion and Transition to Regression This chapter discusses the properties of correlation coefficients, emphasizing their sensitivity to outliers. The presence of outliers in data sets can significantly affect the correlation coefficient, potentially leading to inflated values. It suggests considering alternatives such as rank order correlation methods when dealing with numerous outliers. The chapter serves as a transition to further topics on regression.
Correlation&Regression2 Transcription
- 00:00 - 00:30 all right so if we want to test our hypothesis so we want to ask what's the likelihood that we have a significant relationship a statistically significant relationship in our data um if we're thinking about testing the null hypothesis that rho is let's just use a non-directional prediction here that rho is not different from zero so in order to test that hypothesis what we're looking for is whether our
- 00:30 - 01:00 calculated r value falls into a region of it falls into the rejection region for either a one-tailed or two-tailed test depending so we said we were doing a non-directional which is a two-tailed test so this red line here um and if it does if P drops to below 0.05 then we can reject our null hypothesis so pretty much if we think about let's say let's think about a sample size of a hundred which is not an unreasonable sample size any correlation
- 01:00 - 01:30 that's more than about 0.2 and that'll be given give or take a tiny little bit any correlation bigger than that will be bigger than our critical value and so as you can see here our critical value will go down as the sample size goes up and it will go up as the sample size goes down so we'll need a larger correlation at a smaller sample size in order to reject the null hypothesis um now that's an empirical distribution that we can obtain by bootstrap or permutation testing just like we did
- 01:30 - 02:00 with um just like we've done in the previous labs we can also look at a theoretical distribution so it turns out that the distribution of R under the null hypothesis is actually very similar to the T distribution with with the degrees of freedom of n minus 2. and so there's a formula that we can use to translate R into a different kind of test called a t-test which you've probably heard about but we haven't talked about yet and we can actually convert to a t
- 02:00 - 02:30 distribution and we can look at the these T distributions and we can find the critical value where p is equal to 0.05 here so here and here depending on if we're doing one or two-tailed test if we're doing a two-tailed test we have to split the error across the two ends of our spectrums the rejection region it has there's two rejection regions one on either side so we would be looking right
- 02:30 - 03:00 down here about the halfway point down here between 0 and 0.05 and that would be our rejection region for R so when we're testing hypotheses we can there is a formal process for doing that um you will be doing that this week in lab you will be using scipy for that and scipy often does these things using bootstrapping or permutation testing to get an exact p-value associated with
- 03:00 - 03:30 your specific data not always but sometimes so there are assumptions to use this theoretical method where it's converted to a t-test the assumptions are that you have to have interval or ratio data so for ordinal data we need to use a different kind of correlation called a rank order correlation the rank order correlation basically ranks the data and uses those values as the as the elements of the correlation coefficient
- 03:30 - 04:00 um there's a rank order correlation called spearman's row that's quite popular you have to have related pairs so each participant needs to include a pair of related values so each X variable each Square each score on the X variable and each score on the Y variable those values need to be related to one another otherwise we can't actually calculate a good correlation and they should be roughly linear related it doesn't need to be a perfect linear relationship but actually they
- 04:00 - 04:30 should have some element of a linear relationship if your relationship looks like a perfect nice little U you're probably not going to want to use Pearson's correlation coefficient to calculate that relationship you'd use a different kind of you'd use a different kind of relationship you've tested different kind of correlation which which is beyond the scope of this class we're not going to talk about non-linear relationships in this class and finally both of your variables should be reasonably normally distributed so they should both have
- 04:30 - 05:00 roughly normal distributions and I will also tell you that the correlation coefficient is highly sensitive to outliers so outliers within your data sets you need to really carefully deal with them if you have a lot of them you may actually want to to switch to a rank order correlation like experiments row um because outliers in certain directions can really inflate the coefficient the r
- 05:00 - 05:30 that you that you obtain and can cause false results so that way you can get false positive results really easily with with outliers in your data set likewise you can have outliers that completely wash your correlation away so they can be they can get rid of your correlation even though there really is one there you have one outlier that stops that correlation from becoming a statistically significant correlation and of course we need to remember that correlation not just the test statistic but the
- 05:30 - 06:00 design correlation is both a test statistic and a design so correlation can be reported and if it's being reported in an observational study we cannot establish causation we've talked about this before this is a fantastic spurious correlation looking at chocolate consumption kilograms per year per capita so how many kilograms of chocolate is consumed per year per person
- 06:00 - 06:30 it's 15 kilograms a year that's quite a lot um and that's that turns out to correlate with the number of Nobel laureates per 10 million people in a country and you can sort of see there are some there's a really nice correlation here it's a correlation of 0.79 that's a very strong positive correlation its p-value is less than .0001 and so you can see that the more chocolate consumed within a particular
- 06:30 - 07:00 country the more Nobel laureates they have hmm interesting or at least per head of population so you can look at a country like Switzerland they have a lot of Nobel laureates that eat a lot of chocolate um and there's a little cluster of countries right around here they have more Nobel laureates and they eat a lot of chocolate that is probably eating chocolate probably does not cause people to receive Nobel laureates
- 07:00 - 07:30 um so we need to be careful there and one of the conditions for establishing causation is that your cause precedes your outcome in time so this is why we've talked about manipulations and experimental contexts where we manipulate an independent variable and we measure the dependent variable notice how I said that we manipulate the independent variable and then we measure the dependent variable so there's a temporal order there in a time order in how things are measured
- 07:30 - 08:00 and that's really important the there needs to be an empirical relationship between the variables right so the variables need to be related to one another they need to be empirically correlated again I can calculate a correlation in an experimental design and demonstrate causation so it's really important to know that it's not the test statistic you're calculating that matters it's the experimental design so in order to establish causation the
- 08:00 - 08:30 cause needs to precede the outcome they need to be empirically related to one another and The observed relationship between the variables cannot be explained by the presence of a third variable so a confounding variable which is another variable that affects both of the both the explanatory and response variables in that relationship and they should it should also not be a coincidental statistical relationship like how much chocolate is consumed in a
- 08:30 - 09:00 particular country and what their proportion of Nobel Prize winners is so that's a coincidental statistical relationship we've looked at a couple of these in class so far this year so these are the conditions for establishing causation observational designs cannot establish causation now there are people out there who will tell you we did some fancy statistics on our core on our observational or correlational design and so we can establish causation no you
- 09:00 - 09:30 really can't you have to have these other variables in place you can speculate about causation but you can't show it without an experiment how we're going to deal with correlations in Python well we're going to use either scipy or pandas or numpy all of these libraries that you've used or that you will use all of them allow us to compute correlation coefficients and I'll leave it there for now we're going to be talking about regression in the next lecture