Understanding Correlation and Regression

Correlation&Regression 1

Estimated read time: 1:20

Summary

In this video, Erin Heerey outlines tests of association for continuous data, focusing on correlation and regression. Correlation deals with the strength and direction of relationships between pairs of continuous variables, measured by the correlation coefficient ranging between -1 and +1. Pearson's correlation coefficient (r) describes linear relationships, but may not capture non-linear associations such as U-shaped relations. Regression, which is not detailed in this segment, goes beyond mere association to model dependencies. Various examples, including correlations between poverty levels and high school graduation rates, are provided. Heerey further elaborates on hypothesis testing, discussing directional and non-directional hypotheses concerning correlation coefficients.

Highlights

Correlation and regression are key for understanding relationships between continuous variables. 📚
Correlation strength is measured by the Pearson coefficient, with -1 and +1 being perfect correlations. ➕➖
Positive correlations mean both variables increase, while negative correlations mean one increases as the other decreases. ⬆️⬇️
Real-world data seldom shows perfect correlations, highlighting complexities in data relationships. 🌐
Testing hypotheses often involves determining whether there is a significant correlation different from zero. ❓

Key Takeaways

Correlation measures how variables move together, ranging from -1 to +1. 📊
Positive correlation means both variables increase together; negative means one goes up, the other goes down. 📈📉
Perfect correlations (exactly -1 or +1) are rare in real-world data. 🌍
Pearson's correlation is not suitable for non-linear relationships. 🚫📈
Regression will be discussed separately, focusing on relationships detail.

Overview

Erin Heerey's video delves into the core aspects of analyzing continuous data through correlation and regression. By understanding these tools, one can assess how variables relate to each other, which is vital for data-driven decision-making and research.

Correlation, the spotlight of this segment, is a statistical measure that captures the strength and direction of a linear relationship between two continuous variables. Unlike the chi-squared test, which is for categorical data, correlation needs continuous variables to reflect their synchronous change. Heerey simplifies the concept using relatable examples, like poverty's influence on educational outcomes.

While this discussion primes viewers on correlation, it hints at regression's broader capabilities, which model dependencies in detail. The takeaway: while correlation identifies association, regression provides a deeper dive into understanding how each variable might influence another, setting the stage for part two of Heerey's lecture.

Chapters

00:00 - 00:30: Introduction to Correlation and Regression The chapter introduces tests of association for continuous data, specifically focusing on correlation and regression. Both of these statistical tests are used to examine relationships between multiple variables. Unlike the chi-squared test, correlation and regression require continuous data. The chapter begins with an explanation of correlation, which is described as the more basic test of association.
00:30 - 01:00: Understanding Correlations The chapter 'Understanding Correlations' explains the concept of correlations, elucidating how they indicate the strength and direction of relationships between pairs of variables. It focuses on the examination of associations or relationships specifically between continuous variables. The narrative introduces the correlation coefficients which range from -1 to +1, representing the degree of strength or weakness of the relationship. This foundational understanding aids in analyzing data effectively.
01:00 - 01:30: Perfect and Strong Correlations The chapter discusses the concept of perfect and strong correlations in statistical analysis, focusing on correlations of -1 and +1. It is highlighted that a perfect correlation implies that all data points align on a single line, a scenario rarely seen in research, particularly in psychological data. The chapter emphasizes that perfect correlations are nearly nonexistent in real-world data collection within Psychology.
01:30 - 02:00: Direction of Relationship This chapter discusses the concept of correlation between two variables, X and Y. It explains that a perfect correlation, either +1 or -1, represents the strongest possible relationship, with the sign indicating the direction. For example, +1 indicates a perfect positive correlation, while -1 indicates a perfect negative correlation. It also covers examples of other correlation coefficients such as -0.75, +0.75, -0.5, and 0, emphasizing that a correlation of zero represents no linear relationship.
02:00 - 02:30: Pearson Correlation Coefficient The chapter discusses the Pearson Correlation Coefficient, focusing primarily on how data points relate to each other in terms of their correlation strength. It illustrates this concept with visual plots, demonstrating that as the correlation strength decreases, the data points become less clustered around a single line. Specifically, it notes that with a correlation of -1, all points are perfectly aligned on a specific line, illustrating a perfect negative correlation. In contrast, a correlation of -0.75, though still strong, indicates less clustering compared to the perfect correlation scenario.
02:30 - 03:00: Limitations of Pearson's r Pearson's r is a statistical measure used to assess the strength and direction of a linear relationship between two variables, X and Y. However, this chapter highlights that while there might be a general linear trend observable by eye, the correlation might not be perfect. A correlation of zero indicates no relationship, meaning individuals with high or low scores on one variable do not necessarily score high or low on the other variable, establishing no relation between X and Y in such cases.
03:00 - 03:30: Bivariate Correlation The chapter 'Bivariate Correlation' discusses the concept of correlation between two variables, specifically focusing on the direction of relationships in correlations. It explains positive and negative correlations, illustrating that in a positive correlation, as values of one variable increase, the values of the other variable also increase.
03:30 - 04:00: Example: Poverty and Graduation Rates The chapter 'Poverty and Graduation Rates' discusses the correlation between different variables, specifically focusing on the negative correlation where as the values of one variable increase, the values of the other decrease. It explains the concept of correlation coefficients, particularly the Pearson correlation coefficient, which measures both the strength and direction of a relationship between two variables. This statistical measure was proposed and popularized by Carl Pearson. The chapter provides a foundation for understanding how these correlations can be applied to real-world scenarios, such as examining the relationship between poverty and graduation rates.
04:00 - 04:30: Defining Hypotheses in Correlations The Pearson correlation coefficient, known as 'r', is introduced. It is formally called the Pearson product-moment coefficient of correlation, commonly referred to as the correlation coefficient. It ranges between -1 and +1 and indicates both the strength and direction of a correlation.
04:30 - 05:00: Formula for Correlation Coefficient The chapter explains the formula for Pearson's correlation coefficient, which measures the strength and direction of a linear relationship between two variables. It emphasizes that Pearson's r is only applicable to linear relationships and may not effectively measure non-linear relationships, such as U-shaped correlations. In a U-shaped relationship, extreme scores on one variable might correspond to high scores on both ends of the spectrum for another variable.
05:00 - 05:30: Breaking Down the Formula This chapter, titled 'Breaking Down the Formula', focuses on the understanding of the Pearson correlation coefficient in statistics. It explains how this coefficient is used to assess the linear relationship between variables. The chapter emphasizes that while Pearson's correlation is effective for linear relationships, it may not adequately reflect non-linear relationships, such as U-shaped relationships, between variables. This limitation is highlighted to illustrate why certain relationships might result in a low Pearson correlation value despite the existence of a relationship. Additionally, the concept of bivariate correlation is introduced as a related topic within the discussion.
05:30 - 06:00: Example Calculation The chapter titled 'Example Calculation' delves into the relationship between two continuous variables, referred to as X and Y. The discussion focuses on their linear relationship and the strength of said relationship, as determined by the correlation coefficient. A key point is that the value of the correlation coefficient indicates the strength of the relationship: the further it is from zero, whether positive or negative, the stronger the relationship. Strong relationships are visually represented by data points that are tightly clustered, while weaker relationships exhibit a looser clustering of points. The chapter also presents examples comparing two different correlations, including one that is considered high.
06:00 - 06:30: Testing Hypotheses in Correlations This chapter discusses different levels of positive correlation, exemplified by correlations of 0.9 and 0.5. The chapter suggests using visual aids for understanding, such as drawing lines from the upper to lower bounds of data points on graphs. Another focus is on measuring variation at specific points along the x-axis and examining its correspondence with variation in the y-axis.

Correlation&Regression 1 Transcription

00:00 - 00:30 so we'll now move on to test of Association for continuous data and there are two tests that we're going to talk about today those are correlation and regression so both of these deal with multiple variables that are related to one another so tests of Association just like with the chi-squared but here what we need to have are continuous data so let's start with correlation which is the more basic test of Association
00:30 - 01:00 so we've talked already about correlations a little bit here and there and correlations tell us about the strength and direction of relationships between Pairs of variables so they allow us to examine associations or relationships between continuous variables correlation coefficients range between -1 and plus one and they tell us about the strength of a
01:00 - 01:30 relationship as well as its direction so a relationship that is extremely strong as a correlation of -1 or a correlation of plus one what you can see is that all these dots are lined up on a single line this is a perfect correlation I don't really see this too much in research not certainly in the data we collect in Psychology there's almost never a perfect correlation between things but this is a correlation between
01:30 - 02:00 of the of an X variable and a y variable where the data are perfectly correlated this is the strongest correlation you can get both the minus one and the plus one are equally strong correlations that go in different directions here are correlations of minus 0.75 and plus 0.75.5 minus 0.5 2 5 here and a correlation of zero and one of the things I'm going to ask you to do
02:00 - 02:30 is look at these plots and observe how the data relate to one another so the as the correlation strength decreases these scores become less tightly clustered around a single line so here for the correlation of minus one you can see that the points are totally aligned on a very specific line and here is a correlation of minus 0.75 that's still a very strong correlation
02:30 - 03:00 but it's not perfect anymore so the points generally align on a line you can kind of kind of see where that line would be um just with your eyeballs but it's not a perfect correlation when the correlation is equal to zero there is absolutely no relationship between these data points or between X and Y rather um people who score high on one variable are not likely to score either high or low on the other one there's no relationship between the X variable and
03:00 - 03:30 the Y variable so let's unpack the direction of relationship a bit too so we have correlations that are positive and correlations that are negative a positive correlation is a correlation in which as values of one variable increase values of the other variable also increase so as scores of our X variable here increase in strength or in value rather scores on
03:30 - 04:00 the Y variable increase a negative correlation goes in the opposite direction so here as scores on our X variable increase scores on our y variable decrease and there are all kinds of both positive and negative correlations out in the world so the correlation coefficient describes both the strength and direction of a relationship the Pearson correlation coefficient which was sort of proposed and and popularized by Carl Pearson who was also
04:00 - 04:30 the same person who um popularized the chi-square the Pearson correlation coefficient is known as or is drawn as a lowercase R and it's called the Pearson product moment coefficient of correlation which is a mouthful we just call it the correlation coefficient so when you hear people talking about the correlation coefficient or the Pearson correlation this is what they mean are ranges between -1 and plus one and it tells us both the strength and direction
04:30 - 05:00 of linear relationships so it tells us about how well one variable tracks another as long as there is a linear relationship between them there are other kinds of relationships that wouldn't measure very well with Pearson's r Pearson's correlation coefficient so if for example there's a U-shaped relationship between two variables so very extreme scores on one variable might be associated with high scores on both ends and the scores that are kind
05:00 - 05:30 of in the middle and one variable are associated with low scores on the other that's a U-shaped relationship sometimes it's right side up sometimes it's upside down depending on the the strength of the relationship but that would give you a very low Pearson correlation coefficient even though there was clearly a relationship so the this is really only about linear relationships Pearson's correlation coefficient tells you how much linear relationship there is between the variables and we also have bivariate correlation
05:30 - 06:00 so these are two continuous variables an X variable and a y variable and they both need to be related to one another in a linear fashion so in terms of the strength of the relationship the further the correlation coefficient is from zero in either direction the stronger its relationship and stronger relationships are represented by points that are more tightly clustered and weaker relationships cluster more Loosely so here are two correlations this is a high
06:00 - 06:30 positive correlation of 0.9 here's a low positive correlation of 0.5 and if we think about drawing a little line from the sort of you know upper bounds of the data points to the lower bounds of the data points we can do that on both graphs and we can do is take a measure of at a particular point on this um on this x-axis how much variation is there and why and
06:30 - 07:00 the greater the amount of variation the to weaker the relationship the smaller the more tightly clustered these points are the stronger the relationship so that's the range of X when Y is fixed so Y is given a particular variable we could pick one out of a hat that ranges from its minimum to its maximum value and we could ask what is the range of y when X is fixed to a particular given point and what you can very clearly see is that the range of Y here's our y
07:00 - 07:30 variable on the y-axis the range of Y is greater when the correlation is smaller so here's an example correlation this is the relationship between High School graduation rate and all 50 US states and the percent of residents who live below the poverty line now this is an old graph that was made in 2012 the income the poverty line income variable was 23 050 for a family of four
07:30 - 08:00 so this is in particular region the proportion of people in poverty and the proportion of high school graduation rates in a particular state so the greater the number of people in poverty within a given region the less likely people are to graduate from high school interesting here are independent or predictor variable I'm going to ask you what it is you should think about that first
08:00 - 08:30 is the proportion of people in poverty and our dependent or Criterion variable is the proportion of people who graduate from high school so the high school graduation rate the relationship well there's certainly a negative relationship it's moderately strong and it looks reasonably linear we can kind of imagine there that there's a line through those points that shows that there is you know that there's certainly a relationship between these two variables that high school graduation
08:30 - 09:00 rates are certainly negatively associated with the proportion of people in a region who live in who live below the poverty line so let's talk about how we Define hypotheses in correlations so the population correlation coefficient is a variable called rho it's written like this little funny it looks a bit like a p actually but it's the letter the Greek letter wrote in the sample it's R so the sample
09:00 - 09:30 correlation coefficient is denoted as R and what we often want to know is whether there is a relationship between the predictor and Criterion variables within a population whether that differs from zero and so we might have a null hypothesis in which rho does not differ from zero meaning there's no relationship between the variables we might have a research hypothesis that row does differ from zero meaning there is a relationship between the
09:30 - 10:00 variables this is a non-directional test but we also can make directional hypotheses as well so we can say there is a positive relationship between the variables here row will be greater than zero or we can say the relationship between the variables will not be positive so rho is less than or equal to zero we can do the same relationship on the negative side we could say the variable the relationship between the variables will
10:00 - 10:30 not be negative or there will be a negative relationship between the two variables so it's possible to make directional and non-directional predictions as well in the context of correlations so the formula for R within a sample looks absolutely awful but it's not as bad as it looks so the correlation between an X and a y That's the correlation coefficient here and what we're doing when we calculate that correlation coefficient is we're we
10:30 - 11:00 have this sum of operator you've seen this many times before it's in the formula for the mean it's in the formula for uh the variance and here what we're doing is we're not squaring anymore but what we are doing instead is we are multiplying two quantities and by multiplying those two quantities that is going to lift our deviations we're still taking deviations of scores from means but by multiplying two quantities we are lifting them off of
11:00 - 11:30 that Mark so when they add together they don't give us an automatic answer of zero so you saw when you in the very first lab in this course when you took the deviations and you summed them up they gave you an answer that was zero or at least close enough that it was you know within rounding error so we're doing the same thing here but instead of squaring these quantities we're multiplying them together um and what that's doing is it's
11:30 - 12:00 allowing the correlation to be negative so it allows us to express a negative relationship um and it also keeps those deviations from something to zero so we have person one's X score minus the mean of the X's times person one's y score minus the mean of the Y scores um so we're taking those two those two subtracted scores and we're multiplying them together and then we're adding that up for
12:00 - 12:30 everybody in the sample and then what we're doing down here is we're dividing by the sum here's our sum operator again of the person ones X variable minus the mean of the x squared and then we're taking the square root and we are first we're adding everything
12:30 - 13:00 up and then we're taking the square root so these are sum of squared deviations now and the each person's y score minus the mean of the Y squared squared summed and then you take the square root so in general these numbers are not quite as unfamiliar as they might be otherwise so even though this formula looks terrible it's not probably as bad as it looks and so up here we have just a little example we have an X and A Y and each
13:00 - 13:30 person has contributed both of those scores so this might be height and weight it might be um you know test score and how many hours you spent studying there are any number of things that could be associated with that could be Associated where you would give us scores on multiple variables the mean of the X distribution is 8.67 the mean of the Y's here is 52.5 and so what we can then do is we could say 5 minus 8.67
13:30 - 14:00 times 25 minus 52.5 we can multiply those variables out that would give us our and then of course add them up that would give us our numerator we do the same thing we could just take the sum of squared deviations for X the sum of squared deviations from y we could take their square roots and then we could multiply them and that would give us our correlation so in order to test hypotheses and correlations we need to do some conversion and I'll talk about that in the next section of the video