Chi Sq 2
Estimated read time: 1:20
Summary
In this lecture, Erin Heerey delves into the intricacies of the Chi-Square test, a statistical method used for analyzing categorical data. The tutorial begins with a clear distinction between categorical and continuous data, emphasizing the qualitative nature of categorical data that cannot be ordered. The lecture further explains discrete data with real-world examples, leading into a detailed discussion about the Chi-Square test, including its typesβgoodness of fit and independence. Through practical examples, the guide illustrates how these tests help determine relationships between categorical variables, providing valuable insights into statistical analysis.
Highlights
- Chi-Square tests require categorical data, which fall into non-orderable categories π
- Discrete data are finite and countable, like counting cars or votes π
- Continuous data have infinite values within a range, like measuring temperature π‘οΈ
- Chi-Square tests can analyze categorical data to check for goodness of fit or independence π
- Contingency tables are essential for organizing categorical data in Chi-Square tests π
- Understanding probability and independence is crucial when interpreting Chi-Square results βοΈ
Key Takeaways
- Understanding categorical vs. continuous data is key when working with Chi-Square tests π§
- Chi-Square tests are used for analyzing categorical data and determining relationships between variables π
- There are two types of Chi-Square tests: goodness of fit and test for independence π§ͺ
- Categorical data, which are qualitative and not orderable, are fundamental for Chi-Square tests π―
- Discrete data are finite and countable, while continuous data have infinite values between points π
- Chi-Square tests require a reasonable sample size to ensure reliable outcomes π
Overview
In this educational video, Erin Heerey breaks down the essential components of the Chi-Square test, a fundamental tool for analyzing categorical data. By distinguishing between categorical and continuous data, Heerey sets the stage for understanding how these different forms of data can be used in statistical analysis. The focus is primarily on categorical data, which involve non-orderable categories, making them perfect candidates for Chi-Square tests.
Heerey further illustrates the concept of discrete data through practical, real-world examples such as counting the number of cars at an intersection or tallying votes for a popular charity. This segues into an overview of continuous data, which can be broken down into infinite values within any given range. These distinctions are pivotal for applying the Chi-Square test effectively, whether checking for goodness of fit or testing for independence between variables.
The lecture concludes with a deep dive into contingency tables and the importance of understanding probability and independence when interpreting the results of a Chi-Square test. Through examples such as analyzing student preferences based on grade level, Heerey provides a comprehensive understanding of how Chi-Square tests help identify associations between categorical variables, making it a versatile tool in statistics.
Chapters
- 00:00 - 01:00: Introduction to Categorical Data The chapter discusses the differences between categorical and continuous data, with a focus on the chi-squared test, which requires categorical data. Categorical data is a form of qualitative data where items fall into distinct categories, such as determining favorite pets.
- 01:00 - 03:00: Categorical vs. Discrete Data The chapter discusses the differences and similarities between categorical and discrete data using the example of pet preferences. It explains how people can be categorized based on their favorite types of pets like cats, dogs, birds, etc., and how these categories can be used to count and classify individuals into groups. The focus is on how these classifications are important in understanding data and preferences.
- 03:00 - 04:30: Discrete vs. Continuous Data The chapter 'Discrete vs. Continuous Data' explores the concept of qualitative data, emphasizing that these data types, like preferences for certain pets, cannot be ordered or ranked. It highlights the idea that people have personal preferences that aren't inherently better or worse and cannot be quantitatively measured or compared.
- 04:30 - 05:00: Introduction to Chi-Squared Tests The chapter provides an introduction to Chi-Squared Tests, exploring different types of data, including categorical data that falls into unordered categories, and discrete quantitative data which is finite and countable. The discussion also touches on how to handle subjective preferences and the importance of questioning to determine who might be 'right' in interpreting the data.
- 05:00 - 07:00: Chi-Squared Test for Independence The chapter discusses a program implemented by a British supermarket where customers, upon spending a certain amount, are given tokens that they can use to vote for a charity the supermarket will donate to. Customers insert their tokens into boxes labeled with different charities, and the charity with the most tokens at the end of the week receives the donation. Usually, there are three or four charities to choose from. This system encourages customer participation in charitable donations and showcases different charitable organizations.
- 07:00 - 08:00: Explaining Independence with Examples The chapter discusses how discrete data, such as tokens representing contributions to different charities, can be used to evaluate opinions about those organizations. Each token is countable and represents one unit, making it a finite way to measure views. The Great Backyard Bird Count is mentioned as another example of counting discrete items to gather information.
Chi Sq 2 Transcription
- 00:00 - 00:30 all right so let's talk about categorical versus continuous data so the test we're going to be talking about in the next parts of this lecture is a test called chi-squared and chi-squared requires us to have categorical data so I want to review what categorical data is so categorical data is a kind of qualitative data metric in which items fall into categories so you could ask the question what is your I could ask you the question in fact what is your favorite pet are you a dog lover are you
- 00:30 - 01:00 a cat lover do you prefer a bunny how about birds goldfish bearded dragons or uh I don't know cute dogs dressed up as squirrels um what kind of pet is your favorite and you could give me a category and what we could then do is we could count the number of people who like each one of these different kinds of pets so these are categories and they are classifications we could classify you as a dog lover or a cat lover or whatever and those classifications importantly
- 01:00 - 01:30 are not orderable I can't say that some people are better because they're dog lovers some people are better because they're cat lovers none of these pets if you're the person who loves that kind of pet you think that's the best pet and there's nothing anybody else can say so these are qualitative data categories that don't have an order to them and that cannot be ordered using any
- 01:30 - 02:00 anything other than different people's preferences and then you have to ask the question well who's right so categorical data are data that fall into categories that cannot be ordered we also have data that are quad that are quantitative data that are what we call discrete so these are finite countable things so one type of discrete data this is a
- 02:00 - 02:30 display from a British Supermarket if you spend a certain amount of money what the supermarket will do is donate some money to a charity so they give you little tokens that are associated with whatever the name is here on this thing and you put whatever token into the box that you want to and so the token goes into the box and the charity that's who's who's who's on the box for that week usually there's a choice of three or four
- 02:30 - 03:00 um that charity will be given a certain amount of money for each token that's in its that's in its bin um so this is discrete data they're finite and countable so each of these tokens counts as one item we can count the number of tokens there are in this bin and that bin and this bin and that bin and that will give us a metric for evaluating people's views on these different bins or on these different charities right you can also do the great backyard bird count is another one
- 03:00 - 03:30 of these things they count the number of different kinds of birds that people report seeing during a particular weekend um another one are traffic surveys right there where where people actually count the number of cars using a particular intersection or using a particular Road or maybe they're Counting Cars versus bikes or whatever have you so these are things that are discrete and countable you can't get partial cars you either count a car or you don't count a car you don't get six and a half cars
- 03:30 - 04:00 it's either six or seven cars those are your two choices um so these are not things that we think about in terms of sort of partial numbers they are discrete entities and each item is one thing and we also know that based on the counts the more the the more of that particular item is counted the more frequent it occurs so there the data are orderable right if um if if I see more I don't know Toyota
- 04:00 - 04:30 Prius is on my way home um then Ford Mustangs um that will tell me something about the pref people's preferences for Toyota Priuses versus Ford Mustangs um and those data are countable based on the number of items there are so these are discrete quantities they can't be measured in halves and finally we have continuous data so in continuous data
- 04:30 - 05:00 there is an infinite number of values between any two particular values in a in in a defined area so we think about length that way so we think about a metric so I can say Okay an inch two inches three inches but actually I can take the amount of space between this point and this point zero inches and one inches for instance and I could break that down into smaller and smaller quantities so as to get an
- 05:00 - 05:30 infinite number of values between those two points and those are continuous data temperature is another one speed is another one so these are continuous data so in chi-square we deal with categorical data so we often deal with frequency counts for different categories of things that cannot be ordered but we also often deal with discrete and countable data so those are the two kinds of data we typically use for chi-squared
- 05:30 - 06:00 so we're going to talk about that next we're going to talk about testing associations with categorical data so we use nominal data where the observations fall into non-ordered groups they're often labeled or named so you could think about gender you could think about ethnicity you could think about religion you could think about country where you were born you could think about Eminem colors we also have ordinal data so the observations are ordered but they're not on a standard scale so this might be places or a race right so if we're
- 06:00 - 06:30 having a race and you get first place and I get second place and someone else gets third place the difference between first and second place might not be the same as the difference between second and third people don't come in and exactly measured Precision between them sometimes they're closer to one another sometimes they're further apart so these are ordered but not on a standard scale another kind of ordinal data that we often deal with with chi-squared is
- 06:30 - 07:00 ratings data so if I say oh how much did you like this thing on a scale of one to seven you could give me a number and you know and I would know how to interpret that I could ask somebody else to tell me exactly I could give them exactly the same question and they could give me a different number and what we don't know is how different those numbers are I could ask you and you could give me if I said how much do you like to eat oatmeal for breakfast and you told me uh two out of seven um somebody else might tell me a two out of seven as well and mean something
- 07:00 - 07:30 different by it so you need to so so ratings are kind of ordinal data that we also often deal with in chi-squared although not all the time in Psychology we like to analyze ratings data as if they are not exactly ordinal data even when they are and finally we can think about discrete data so these counts numbers of observations of a particular event is a chi-squared a pretty standard chi-squared analysis
- 07:30 - 08:00 so there are two major types of chi-squared tests there's the chi-squared goodness of fit test where we ask does a sample of data match a particular distribution that we might expect it to match and we have a categorical variable that we compare with an expected population distribution now we need to have our reasonable sample size for that outcome or for the the results of that test to be reliable
- 08:00 - 08:30 second kind of chi-squared is the chi-squared for Independence where we're asking are two events or two outcomes associated with one another and so what we're doing is we're comparing proportions within a contingency table to expected proportions if there were no association between those events or between those categories again we need to have a reasonable sample size for that test to be reliable so the test we're going to be talking about in this class is the chi-square test for Independence
- 08:30 - 09:00 it's very similar to the chi-square goodness of fit test we just calculate our expected proportions just a little bit differently so as a review we've talked about contingency tables in this class before so you'll remember from our gender discrimination example we had a really simple two by two contingency table where it was resume gender whether the name on the person's resume was a a man's name or a woman's name and whether
- 09:00 - 09:30 a bank manager promoted or did not promote that person so the so there we have our explanatory variable which is our independent variable or resume gender and our response variable which is the promotion decision both of those variables are categorical and that is our primary ingredient it's in chi-squared so I might give you a description on a test of a research study that someone has done and I'm that has a categorical independent variable
- 09:30 - 10:00 and a categorical dependent variable and I might ask you which kind of test you would do to analyze it I like these kinds of questions by the way I include them every year so one of the things that you should pay attention to moving forward is what are the ingredients for the test what do kind of data do you need to use this particular type of test now we can also do Chi Squared when we have more than two categories so here's an example for what we call the chi-square test of Independence so in
- 10:00 - 10:30 this study kids in grades fourth five and six were asked whether good grades um athletic ability so sport ability at sport or popularity social popularity was most important to them and we can ask the question is there a relationship between the rows and columns do pupils do the kids goals vary um by grade level and what you can see here this is called the carpet plot and the boxes that you
- 10:30 - 11:00 see are are sized their their area is the relative proportion of the sample so if you look here is the here are fourth graders fifth graders and sixth graders here are the number of fourth graders who say grades are the most important thing here are the number of fourth graders who say being popular is the most important thing and here are the number of fourth graders let's say being sporty is the most important thing and so these are relative proportions of the sample here
- 11:00 - 11:30 in these um squares in the carpet plot so what we can ask is is there a relationship between the rows and the columns so are the rows what grade kids are in related is that related to or independent of what they think is most important in school so the hypothesis for a chi-square test if we're going to
- 11:30 - 12:00 think about the null hypothesis there is no relationship between grade level and preference so we might say preferences do not vary by grade that is a null hypothesis there's nothing going on there's no relationship grade level and preferences are independent so that's a null hypothesis we can also make a research hypothesis which is that preferences do depend on grade level so people's or kids preferences are going to vary
- 12:00 - 12:30 depending on what grade they're in that is a research hypothesis in which preferences depend on grade level now let's go back and have a quick review of this idea of Independence because that's what we're going to be working with here in the in the chi-squared so in the probability lecture you learned that two variables were independent if the probability of one event given the other
- 12:30 - 13:00 is the same as the probability of the of the first event by itself so two events are totally independent so I think I gave you the example of um walking to work so I walked to work rain or shine it doesn't really matter what the weather is so getting up and knowing that it's raining today does not affect the probability that I will walk to work I will walk to work if I come to work I sometimes work from
- 13:00 - 13:30 home but by and large I walk to work and so the weather and my walking to work are independent of one another the probability that I walk to work event Acts is totally independent of the problem it's totally unrelated to the probability of Event Y the weather so regardless of what event Y is regardless whether it's raining or sunny or anything else I'm walking to work and
- 13:30 - 14:00 so that probability of of me walking to work is independent of the weather now if I only walk to work when it's sunny once we know that knowing that it's sunny increases the probability that I will walk to work so if I'm walking to work and I only walk when it's sunny then those events are not independent so
- 14:00 - 14:30 in more mathy terms if the probability of event x given event y equals the probability of event X then event X and Event Y are independent and from a conceptual standpoint knowing something about why whether it's sunny out outside today tells us nothing about acts whether I will walk to work so that's what Independence is and so that's what we're actually asking when we do this chi-squared so let's talk about the chi-square itself we'll do that in the next segment
- 14:30 - 15:00 of the video