Understanding Regression: From Concepts to Applications

Correlation&Regression3

Estimated read time: 1:20

Summary

In this lecture segment, Erin Heerey explores the concept of regression, extending beyond simple correlation to the prediction of outcomes. Regression involves understanding the relationship between two variables, and using one variable's data to predict another's outcome. The discussion includes examples from psychology and social studies, such as incarceration effects on crime rates in Baltimore and using antidepressant doses to predict depression changes, highlighting the application of regression analysis. Heerey elucidates linear regression, explaining key components like predictors (independent variables) and response variables (dependent variables), the line of best fit, and how regression models are formulated with error terms. The lecture also touches on multiple regression, regression equations for predictions, and the importance of assumptions in validating regression models in research applications.

Highlights

Regression helps predict outcomes using independent and dependent variables, improving data analysis. 📈
A study in Baltimore linked higher incarceration of black fathers with increased crime, illustrating real-world regression use. 📊
Understanding regression components like the line of best fit aids in comprehending data relationships. 🎯
Graphically demonstrating regression lines clarifies interpretations of positive and negative relationships. 🖉
Regression assumptions, such as homoscedasticity, underpin the reliability of predictive models. 📏

Key Takeaways

Regression extends the concept of correlation to predict outcomes, offering deeper insights into data relationships. 🔍
Understanding the regression model involves knowing predictor and response variables, linear functions, and error terms. 🚀
Baltimore's crime-incarceration study shows positive relationships can have negative outcomes; definitions are crucial. 🚨
Regression's predictive power shines in areas like psychology, forecasting medication impacts on mental health. 💊
The importance of assumptions in regression lies in ensuring valid and reliable prediction models. 🤔

Overview

Erin Heerey's lecture dives into the world of regression—a step beyond mere correlation. While correlation describes relationships, regression predicts outcomes using variables. This analytical tool is not only crucial for calculating relationships but also for making informed predictions about future data trends.

The lecture explores various applications of regression, like assessing crime impacts in Baltimore due to father incarceration rates, and predicting depression treatment outcomes based on medication dosages. These examples serve to illustrate how regression can be employed to analyze complex social issues and psychological phenomena, reinforcing its importance in the research toolkit.

Heerey concludes by explaining the intricacies of regression models, from linear regression equations to error terms, and the assumptions needed to validate these models. Understanding these components aids researchers in creating robust predictive analyses, allowing them to extrapolate insights and implications from their data effectively.

Chapters

00:00 - 00:30: Introduction to Regression This chapter provides an introduction to the concept of regression. It explains that regression builds upon the idea of correlation by not just describing relationships between variables but also using these relationships to predict outcomes. The focus remains on the relationship between two variables and utilizing scores from one variable to make predictions about another.
00:30 - 01:00: Understanding Regression and Correlation This chapter focuses on the concepts of regression and correlation, which are fundamental in statistical analysis. It explains that correlation measures the strength and direction of a linear relationship between variables. Regression, on the other hand, is used to predict the value of a dependent variable based on one or more independent variables. This chapter is essential for those who aim to understand and apply regression analysis in research, especially when a range of values is set for independent variables.
01:00 - 01:30: Predicting Outcomes with Regression This chapter discusses the concept of predicting outcomes using regression analysis, specifically focusing on how changes in independent variables can affect dependent variables. It introduces linear regression or bivariate regression, which involves two variables, showcasing its use in examining these changes.
01:30 - 02:00: Linear Regression and Bivariate Regression The chapter discusses the concept of linear regression and bivariate regression, which involves predicting the effect of one explanatory variable on one response variable. It briefly mentions the use of multiple explanatory variables in a process known as multiple regression, which is often applied in psychology due to its complex nature. The chapter suggests that further learning in statistics will cover multiple regression in more detail. Overall, regression is used for various analyses.
02:00 - 03:30: Example: Incarceration and Crime in Baltimore The chapter discusses a study on the relationship between incarceration rates of black fathers and crime rates in Baltimore, Maryland. Researchers aimed to estimate the impact of a 1% increase in the incarceration rate of black fathers on crime rates in the city a year later. Baltimore is characterized as a challenging environment with high crime rates.
03:30 - 05:00: Positive and Negative Relationships in Regression The chapter explores how drug-related crimes disproportionately affect black individuals, particularly with arrests and prison sentences. It highlights the racial disparities in the criminal justice system, noting research findings that suggest an increase in incarceration rates among black fathers.
05:00 - 06:00: Predicting Responses with Regression The chapter titled 'Predicting Responses with Regression' discusses the relationship between crime and incarceration in Baltimore, Maryland, with a focus on the incarceration of black fathers. It highlights a positive relationship, meaning that as the proportion of incarcerated black fathers increases, it is likely to correlate with an increase in crime rates a year later in the city. The use of the term 'positive' here refers to the direction of the relationship, not its desirability or benefits.
06:00 - 07:30: Regression Models and Error Term The chapter titled 'Regression Models and Error Term' discusses the concept of positive and negative relationships in the context of correlations, emphasizing that these terms refer to the nature of the relationship rather than the evaluation of an outcome. It also touches upon the use of regression for predicting responses, suggesting that understanding the strength of relationships can aid in making predictions.
07:30 - 09:00: Making Predictions Using Regression The chapter titled "Making Predictions Using Regression" discusses the application of regression analysis in predicting outcomes based on relationships between variables. Specifically, it exemplifies using regression to predict changes in depression levels in response to antidepressant dosages. This method allows for estimating how a person's depression might change when prescribed an antidepressant, thereby providing an insight into their potential future mental state.
09:00 - 10:00: Significance Testing in Regression This chapter discusses the concept of significance testing in regression analysis, using the example of antidepressant dosage and levels of depression. It explores the potential negative relationship between the dosage and depression levels, emphasizing that the focus is on understanding the type of relationship rather than the specific outcome. The chapter also highlights how this data can be utilized for prediction purposes.
10:00 - 11:00: Categories of X Variable in Regression This chapter focuses on the 'Categories of X Variable in Regression' within the context of linear regression models. It explores the relationship between predictor (X) and response (Y) variables, emphasizing that for a valid linear regression, these variables must exhibit a linear relationship. The objective is to examine the association between these variables to understand responses in outcome variables and comprehend the workings of complex systems.
11:00 - 13:00: Assumptions of Regression The chapter discusses the basic assumptions required for a simple linear regression analysis, which involves using one predictor variable (X) to determine its effect on a dependent variable (Y). The relationship between these variables is described by a linear function, with changes in Y assumed to be related to changes in X. The text emphasizes the traditional statistical notations used in this context.
13:00 - 15:00: Conclusion of Regression Introduction The conclusion emphasizes caution when interpreting changes in the response variable (Y) as being related to changes in the predictor variable (X) in regression analysis. It highlights that without an experimental design, it is difficult to establish causality between X and Y, even if they are conceptually independent. Linear regression is also referred to as Ordinary Least Squares (OLS) regression.

Correlation&Regression3 Transcription

00:00 - 00:30 so in this next section of the lecture we're going to be talking about regression and regression is not dissimilar to correlation it's sort of taking correlation and moving it One Step Beyond from merely describing a relationship to predicting outcomes so when we think about regression we are also we're still thinking about two variables and the relationship between them but now we're asking can we use people's scores on one variable and something about knowing
00:30 - 01:00 something about the relationship between them to predict an outcome variable so we know the correlation tells us about the strength and direction of a linear relationship but sometimes we want to take that idea a step further so we might want to predict the value of a dependent variable based on the value of one and sometimes more independent variables this is especially true if a researcher has selected or set a range of values that an independent variable can take
01:00 - 01:30 where you might want to say okay how does moving one step further down the range of I of independent variable values how does that change values on the dependent variable we also might want to examine how changes in an independent variable might affect scores on a dependent variable and the tool we use to do these things is called regression so we will look at linear regression or bivariate regression bivariate means two variables
01:30 - 02:00 in which we are predicting the effect of one explanatory variable on one response variable we could do this in using more variables we could use multiple explanatory variables to predict an outcome of interest and often in Psychology we do use multiple variables because psychology is very complicated science that's known as multiple regression if you continue on in your statistics Journey you will learn more about that as you continue so we use regression to do a bunch of
02:00 - 02:30 different things one of which is to understand how a particular system might work so let's think about the system of incarceration and crime in Baltimore Maryland and this is a study that was done a bunch of years ago now in which the researchers were interested in estimating how one percent increase in the proportion of black fathers that were incarcerated enhanced crime rates a year later now Baltimore I don't know if you know anything about the city but it's relatively speaking a bad neighborhood there's a lot of crime and
02:30 - 03:00 in fact there's there's a lot of drug crime and black people seem to be disproportionately affected by arrests and by custodial or prison sentences relative to other relative to people of other races or relative to white men or Asian men and it looks like according to research that as more black fathers become incarcerated
03:00 - 03:30 that actually makes the crime rates worse so there's a positive relationship between crime and incarceration in Baltimore Maryland especially the incarceration of black fathers and I want to call your attention to my use of the word positive here I don't mean an evaluation of this relationship I don't mean that crime is going down and that's a good thing what I mean is there is a positive relationship so as the proportion of black fathers increases it is likely to enhance crime a year later in the city
03:30 - 04:00 um so which is not a good outcome right that's a a negative outcome even though it's a positive relationship so remember that we're talking about positive and negative here and we're using those words in the sense that we talk about relationships in correlations we can think about positive or negative relationships it has nothing to do with how we evaluate an outcome we can also use regression to predict responses so if we know something about the strength and
04:00 - 04:30 direction of a relationship between for example a depression level and an antidepressant dose that might allow us to predict how much a particular person taking that drug will change if so if we were to prescribe an antidepressant to someone who was depressed we would have an idea we could use regression to have an idea about what their likely outcome is how much more depression how much less depressed they will be down the road so
04:30 - 05:00 we might expect that as dosage of a particular antidepressant increases depression will decrease more so there might be a negative relationship between levels of depression and the dose you take of an antidepressant where higher doses lead to less depression later right so remember again we're not evaluating the type of outcome here we're evaluating the type of relationship so there might be a negative relationship there but we can use these kinds of data to predict
05:00 - 05:30 responses in outcome variables and to understand how various really complicated systems work so the linear regression models that we're going to talk about today examine the association between a predictor variable and a response variable the response variable is often described by the letter Y and it needs to be related it needs to have a linear function so the X and Y need to be linear related linearly related to one another in order for
05:30 - 06:00 regression to work so here we're going to talk about regression with only one independent or predictor variable and we will call our predictor variable X which is traditional in statistics we will be looking at the effect of X on one dependent or response variable which we often denote as y so the relationship between the variables is described by a linear function changes in y are assumed to be related to changes in X we need to be
06:00 - 06:30 really careful there changes in y are assumed to be related to changes in X because we've suggested that axes are independent or predictor variable whereas Y is our response variable however without an experimental design we have a hard time showing that so remember that just because we think about them that way does not mean that we have a causal relationship linear regression is also known by a couple of other names it's called Ordinary least squares regression it's
06:30 - 07:00 called bivaria regression it's called Simple regression so that's what we're going to be doing here today symbol symbol aggression so what does linear regression do so linear regression finds what we call the line of best fit which minimizes the sum of the squared errors between the line and the individual data points so this is a correlation graph that you've seen before and what you could do is you can imagine a line on that graph that minimizes the difference between that line and all the data points there so
07:00 - 07:30 remember how the mean minimizes the difference between itself and all of the other scores in our data set so the deviations from the mean become zero well here what we're doing is we're looking at a line now that's not a sort of a sort of straight line across the bottom of this data set but that is a line that best fits whatever the correlation whatever the value of this correlation is here across the data points it is an estimator of the linear relationship between the explanatory and
07:30 - 08:00 response variables so when we talk about regression models regression models describe the relationship between X and Y and they also give us a term we call error it's represented as this little guy right here in the equation so linear regression is represented by y equals beta0 plus beta 1 times X1 X Plus e or
08:00 - 08:30 error so beta0 and beta 1 are model parameters we don't know what they are they're Mysteries to us they're actually calculated by the regression you won't do regression by hand but we'll use uh we'll use a numpy or actually we use sci-fi for this it'll be a it's a python Library so we'll use PSI pi to calculate our
08:30 - 09:00 regression model and it will give us a line of best fit it will give us it will return to us what these parameters are the error is a random variable um called error and it's related to this sampling error in our sample so let's unpack this a little bit because this is pretty complicated so we have a predictor variable X we have a response variable Y and we can see with just your eyeballs is that there seems to be some kind of a relationship between them and in fact it seems to be a reasonably strong positive relationship as we
09:00 - 09:30 increase on our X variable it looks like scores on our y variable also increase so that's a positive relationship and our line of best fit looks like this and you can see it completely and fully intersects a couple of these points so there's no deviation of those points from the line but there are deviations for others of these data points and those deviations can be measured so the sum of the difference between
09:30 - 10:00 this score and the line and the score in the line and this score and the score on all these other scores these sum will sum to zero so what do you think we're going to be doing here when we do regression yeah we're going to be looking at squared deviations aren't we so the line of best fit is will be given to us by these parameters these parameters will tell us how to draw that line so we're going to keep going so we also have a regression equation
10:00 - 10:30 that allows us to make predictions so the regression model is the exact model for every single score in the data set but the prediction model the regression equation that allows us to make predictions has a few different variables in it they're slightly they're slightly different notice this first one we call this guy y hat because he has this little um kind of carrot character over the top of his head he looks like he's wearing a hat so why hat which is the predicted
10:30 - 11:00 value of y is equal to b0 notice we've changed our beta to B so we've changed from our Greek letter which is a parameter to our Roman letter which is a sample statistic b0 plus B1 times X1 and b0 is the average response when X has a value of zero it's what we call the y-intercept so when our X variable here equals zero
11:00 - 11:30 where does our line cross it so b0 is this number down here B1 is the slope of the line so B1 is the change in the average response variable y as X increases by one unit so it's the slope so if we know a value of x that we're putting into this equation maybe x
11:30 - 12:00 equals five if we multiply 5 by our value of B1 and then add it to be zero what we will end up with is a predicted y I'll show you more about that on the next slide maybe choose the let's so when we're thinking about representing these model terms of course we have our predicted y on the y-axis and our X on the x-axis and the regression line is the line of
12:00 - 12:30 best fit so our regression line will be somewhere then its slope and positioning will depend on the data its intercept is where the regression line crosses the y-axis the slope B1 here's a positive slope so this is a positive relationship as values of X increase predicted values of Y will also increase so that's our regression line or the line of best fit here is
12:30 - 13:00 a different kind of intercept this intercept is a different value as you can see this one's down here this one's up here so we have our b0 we still have our line of best fit but in this case the slope B1 is negative so as values of X increase predicted values of Y decrease so this is what these different elements look like on a graph and what we can then do is we can use them to actually make predictions about outcomes in the
13:00 - 13:30 data so when we're thinking about predictions what we're asking is for each unit of increase in X what is the expected or predicted change and why so we can use these these two values which we get from our regression analysis these two parameter estimates that are based on our sample data along with a value of x to give us a
13:30 - 14:00 prediction about y so let's put some bars on this graph and let's look at some what some of those predictions might look like so each one of these little lines here is a one unit of increase in x there's our intercept B zero so let's say we have an X of 1 here what's our value of y what's our predicted value of y well if we have an
14:00 - 14:30 X of 1 and we know where our line of best fit is what we can then do is we can trace up from one to our line of best fit to where we get this intersection here and we can then from there Trace over to the y-axis and we can get a prediction for what our y-hat should be so it looks like for each increase in unit of X we get maybe a one and a tiny little bit more maybe a tenth maybe a 15th of a unit
14:30 - 15:00 increase in y so we can do that kind of across the sample and we can come up with our predictions now the other thing we know if we compare this equation to this equation you can sort of see where the similarities are right you can look at so we have a y hat which is not too dissimilar from a y except there's a little hat over this is a prediction this is an exact y score we have b0 we know that these are both
15:00 - 15:30 that b0 is our parameter estimate of the true Intercept in the population and b0 is our sample based statistic that estimates the parameter and we have B1 an X we know what these are we know that this that B1 is the slope it's estimated by this slope right here we have X1 being our first value of x or at or the next value of x and here we
15:30 - 16:00 have this error so what's he doing well remember this is a prediction and we know that there's sampling error that occurs so just because the X and the Y variables are related doesn't mean that every single person in your sample will have exactly the same response some people have a slightly greater response some people have a slightly lower response and that
16:00 - 16:30 sampling error is really the difference between this predicted y here and the error here so when we're thinking about the error estimates we know that some of our scores fall directly on our regression line but lots of them don't so and we said that they were off we we calculated some or we didn't calculate we drew in some errors and we could do that here with this sample as well so the dots are not all exactly on the line so for example this person here who has
16:30 - 17:00 a value of one two three four on The X axis he is his y score is substantially lower than what we would predict it would be and that is our error estimate that error estimate tells us something about how good a prediction is in general good a prediction is is denoted by a
17:00 - 17:30 special statistic it's represented as capital r squared it's called the coefficient of determination and it's the proportion of variance in y that can be explained By changes in x another way of thinking about that is how well does our regression model fit the actual data how much error variance is there in that line of best fit and it's calculated as the sum of squares explained by the relationship by the X Y relationship divided by the
17:30 - 18:00 total sum of squares you don't need to know this I'm not going to ask you about that on a test you might need to know how to do that prediction but for right now you don't need to know exactly what this quantity is in a simple bivarial regression equation r squared is equal to the correlation coefficient squared times itself the larger the R square the larger the r squared the better the regression model explains the data so let's say we have our y variable on our X variable what you can see is there's some relationship between these
18:00 - 18:30 two variables because there's some overlap in their distributions here I'm using Venn dry Vin diagrams to kind of think about distributions if our variables look like this that would be a much stronger relationship because there's much more overlap between the X and the Y variables in the two circles on the bottom than there is in the circles on the top and our coefficient of determination describes how much overlap is there how much
18:30 - 19:00 covariation is there between our X and our y variables so the bigger the r squared the better your regression model explains your data now we can do statistical significance testing on these correlation coefficients on our estimated B's the b0 and the B1 we care more about the B1 by the way so when we conduct significance testing on a regression model we often try to tell whether the
19:00 - 19:30 value of the slope is different from zero so our null hypothesis it might be in a regression equation but there is no relationship between the scores that are line of best fit will be essentially equal to zero and as usual if our test statistic falls into the rejection region we reject h0 otherwise we retain it and there are two families of tests that we often talk about when we're thinking about significance testing in regression
19:30 - 20:00 these are t-tests and f-tests both of those will come up in fact t-tests will come up in the next week of class in one of the next lectures and f-tests will come up two weeks after that in two weeks so when do we use regression well our y variable our dependent variable must be a continuous variable or X variable might be continuous but it could also be categorical so here now we have an
20:00 - 20:30 extension another extension of correlations you'll remember that in correlation we have to have both the X and the Y need to be categorical or need to be continuous variables but in regression our X variable could actually be categorical so we could ask about you know we could look at the relationship between the number of hours that students spend studying a week um which might be a continuous y variable and we could look at what year in University they were so maybe it's the case that students in the first year
20:30 - 21:00 of University don't spend quite as much time studying as students in year two and that University gets harder as you progress through it so maybe your third year students spend more time studying than second year and so forth so we could actually use your you're in University as a categorical predictor variable and that's one of the especially nice things about regression is that it allows us to escape from this requirement that our X
21:00 - 21:30 variable needs to be continuous so that's one of the really nice things about and really nice areas in which we can use regression there are of course some assumptions and interestingly the assumptions look similar to other assumptions we've talked about so far in this class but the interesting thing that's different about the assumptions in regression is that the assumptions we make in regression apply to the errors and not to the scores themselves so the
21:30 - 22:00 error now is a random variable with a mean of zero it is normally distributed across each value of x so if we think about how much error variance it is if you are a first year versus a second year versus a third year university student and so forth that error variance needs to be normally distributed each one of those values the variance of the errors is often
22:00 - 22:30 denoted as Sigma squared is the same for all values of X this is a term called homoscedasticity and the other assumption is that the values of the error term are independent so when we're thinking about the assumptions of regression those assumptions that we make apply to the error terms rather than to the rather than the basic data so apply to the errors and this is the thing that allows us to jump from
22:30 - 23:00 requiring us to have con continuous dependent variable or independent variable and allows us to shift here into a different kind of statistical model and that's all I'll tell you about regression for now it's a brief introduction to the topic we'll pick up with the next statistical test Zed tests and t-tests in the next lecture