Chi Sq 3

Estimated read time: 1:20

Learn to use AI like a Pro

Get the latest AI workflows to boost your productivity and business performance, delivered weekly by expert consultants. Enjoy step-by-step guides, weekly Q&A sessions, and full access to our AI workflow archive.

Summary

Erin Heerey delves into the intricate workings of the Chi-Squared test statistic, explaining how it measures the difference between observed and expected counts under the null hypothesis. Utilizing a formula that involves summing ratios of squared differences divided by expected counts, the test helps in determining statistical significance. Large Chi-Squared values suggest a significant effect, while degrees of freedom play a crucial role in interpreting results. The lecture outlines conditions for application, discusses independence, and encourages using Python's SciPy library for computation.

Highlights

Chi-Squared test uses Greek letter Chi and square symbol for calculation 🔍.
Sum of ratios involves observed minus expected count squared over expected count 🤓.
Example calculations demonstrate expected count formula using algebra and probabilities 📐.
Degrees of freedom calculated as number of rows minus one times number of columns minus one ➗.
Chi-Squared cannot be negative; a zero value indicates no difference noted. 🚫
Degrees of freedom play a pivotal role in understanding Chi-Squared results
SciPy library in Python facilitates Chi-Squared calculations in labs 🔬.

Key Takeaways

Chi-Squared test examines the difference between observed and expected cell counts 📊.
Formula involves summing ratios of squared differences over expected counts 🔢.
High Chi-Squared values are linked to significant findings 🌟.
Degrees of freedom are crucial for calculating test statistics 📈.
Python’s SciPy library can be used for Chi-Squared test computations 🐍.
Independence and adequate sample size are assumptions of the test ✅.

Overview

The lecture, led by Erin Heerey, intricately details the Chi-Squared test statistic—a pivotal tool in statistical analysis used to measure variance between observed and expected data under the null hypothesis. It embodies a formula where the squared difference between observed and expected counts is divided by the expected count, and this ratio is summed for each cell. The concept of expected counts relies on probabilistic reasoning when no relationship is presumed between variables.

Through Erin's discussion, listeners grasp the Chi-Squared distribution, emphasizing how higher values suggest more significant effects. The session sheds light on degrees of freedom, denoted as the number of rows minus one times the number of columns minus one, highlighting their importance in the calculation of statistics. The principles also underline that the Chi-Squared value cannot be negative, as negative numbers become non-existent once squared.

Finally, the interpretation of Chi-Squared distributions and significance tests become clearer, with the discussion guiding through assumptions like adequate sample size and independence. The lecture suggests utilizing Python's SciPy library to facilitate computations, steering learners towards efficient statistical testing via practical labs.

Chapters

00:00 - 01:00: Introduction to Chi-Squared Test The chapter introduces the Chi-Squared test statistic, denoted by the Greek letter Chi and the square symbol. It is a statistical measure used to evaluate the deviation between observed counts for each cell and the expected counts under the assumption that the null hypothesis is true. The formula for Chi-Squared involves summing these deviations across all cells.
01:00 - 06:00: Expected Probabilities and Count Calculations This chapter discusses the calculation of the chi-square test statistic, which involves summing the ratio of the observed count minus the expected count squared divided by the expected count for each cell in a table. The focus is on the process of obtaining expected probabilities.
06:00 - 11:00: Chi-Squared Calculation and Interpretation This chapter explains how to calculate and interpret Chi-Squared in the context of determining probabilities. It focuses on examining the probability of fourth graders considering being popular or sports as the most important aspects. The chapter emphasizes the comparison between actual and expected counts of participants' responses.
11:00 - 16:00: Degrees of Freedom Explained The chapter 'Degrees of Freedom Explained' provides a detailed discussion on understanding the concept of degrees of freedom in statistical analysis. It starts with a basic overview of how one would expect data items to be distributed in cells under the null hypothesis—that there is no relationship between variables being examined, such as grade level and opinions on importance of grades, popularity, or sports.
16:00 - 22:00: Specific Chi-Squared Example and Analysis The chapter titled 'Specific Chi-Squared Example and Analysis' explains the computation of the probability of two events: being in fourth grade and liking sports. It discusses how this is the probability of item A, which is the probability of being in fourth grade, multiplied by the probability of item B, nominating sports as the most important thing. The chapter includes a complex equation represented with numbers, such as the marginal probability of being in fourth grade, denoted as 119 over 478.
22:00 - 29:00: Chi-Squared Test Assumptions and Conditions This chapter discusses the application of the Chi-Squared (χ²) Test, focusing on the assumptions and conditions necessary for its validity. Examples include calculating probabilities from sample data (e.g., nominating Sports) and simplifying complex mathematical expressions to understand relationships within data, using algebra to find marginal probabilities. The discussion provides a grounding in applying the test correctly by understanding marginal totals and their importance in calculating expected frequencies in contingency tables.
29:00 - 30:00: Python Implementation of Chi-Squared Test The chapter discusses the Python implementation of the Chi-Squared test. It starts with an explanation of the expected counts calculation using an algebraic formula, where the expected count is computed as (row total * column total) / grand total. Specifically, in a demonstration, the expected count was calculated as (119 * 90) / 478, which rounds to 23. This indicates that the expectation, in this case, is that 23 people would be in a specific cell of the contingency table.

Chi Sq 3 Transcription

00:00 - 00:30 all right so let's talk about the chi-squared test statistic so Chi Squared which you've seen before is written with the Greek letter Chi and the square symbol the chi-squared test statistic examines the deviation between the observed counts for each cell and the expected counts if the null hypothesis is true so our formula looks like this we have Chi Squared equals and this is our big sum of operators you can guess that this is going to be a sum
00:30 - 01:00 and what we're summing over is a ratio it's the ratio of The observed count in a Cell minus the expected count squared divided by the expected count so the chi-square test statistic is the sum of those over each of the cells in your table so how do we get expected probabilities
01:00 - 01:30 so we might want to calculate the probability of being in fourth grade and thinking grades are popular the probability of being in fourth grade and thinking being popular is the most important thing the probability of being in fourth grade and thinking sport is the most important thing now remember we're comparing our actual counts here to our expected counts and our expected counts are what the number of participants or the number
01:30 - 02:00 of items we would expect in each cell given that the null hypothesis is true given that there's no relationship between which grade you're in and whether you think grades popularity or Sports is most important so that's our null hypothesis there's no relationship between the explanatory and response variables so how do we get these numbers well if we want to think about the expected count the expected count for being in a particular cell let's say
02:00 - 02:30 being in fourth grade and liking Sports is the probability of item a which is the probability of being in fourth grade multiplied by the probability of B so this is the probability of nominating Sports is the most important thing multiplied by the total now that becomes a very complicated equation here's what it looks like with numbers the probability of a this is a marginal probability right here the probability of being in fourth grade so 119 over 478
02:30 - 03:00 the probability of nominating Sports and now we have the other marginal probability so that's 90 out of 478 and of course our total we know is 478. now we can do a little bit of algebra on this long equation that's going to look really terrible and we can simplify that down to 119 times 90 divided by 478. so if we do and if you don't remember
03:00 - 03:30 your algebra you can go back and review that's beyond the scope of this class so with algebra simplified here our expected count is 119 times 90 divided by 478. so if we do a little bit of rounding there that's going to round to 23. so we would expect 23 people to be in this cell we can also look at for example these
03:30 - 04:00 other cells here the probability of being in fourth grade and nominating grades probability of being fourth grade and nominating popularity so this is of course 119 because we're still talking about the probability of being in fourth grade here yeah um multiplied by 247 in the case of this kind of orangey color here because that's the probability of nominating grades divided by 478 and the probability of nominating popularity is
04:00 - 04:30 um and being in fourth grade is 119. times 141 divided by 478. so we and these are of course all rounded numbers so let's look at the chi-square calculation itself so what you see in this table are expected counts are in are here in blue and The observed counts are in Black The observed counts are just exactly what you saw in the previous table
04:30 - 05:00 and to calculate this test to test a statistic what we're going to do is we're going to sum across the cells so the Chi Squared is going to be 63 minus 61 squared divided by 61. plus 31 minus 35 squared divided by 35 Plus um 25 minus 23 squared divided by 23 and all the way on through our table until we get to the very last cell
05:00 - 05:30 32 minus 34 squared divided by 34. and so if we add up all of those quantities what we get is this number right here 1.31 now that's our chi-squared total what do you notice about it is it seemed big to you does it seem not so big if you look at the numbers in the table itself do these numbers look close to their expected counts the observed
05:30 - 06:00 unexpected so what we want to know is what does the calculated chi-square mean so each element of the total chi-squared is a ratio of The observed minus the expected value squared divided by the expected value so the bigger the value for a particular cell the larger the deviation from its predicted outcome and when we add all of those things
06:00 - 06:30 together remember none of these things are negative because we're doing squares here we're taking the sum of basically sum of squares-ish not quite the same formula but we are squaring these numbers before we divide and then we're adding them up so larger values of the statistic are associated with smaller p-values so as the value of chi-square gets larger the statistic began becomes closer to the rejection region of the graph so
06:30 - 07:00 bigger values mean we're more likely to see a statistically significant effect can chi-square be negative well it can't because even though this number here might be negative and if you go back to the previous slide and look at the table you'll see there are a couple of them that are but when we go back and look at the at the observed minus expected some of those are negative but once we Square them that negative number drops away so
07:00 - 07:30 what that tells us is that chi-squared cannot be negative that if you have a negative Chi Squared you have made a mathematical error a chi-squared equal to zero tells you that there is absolutely no difference between the observed scores that you got and the expected scores that you got there is absolutely in that those two quantities are absolutely independent here's our chi-square distribution
07:30 - 08:00 and this is so these are chi-square distributions based on contingency tables of different sizes how these are derived by the way is these are derived by randomization testing so we talked about that in the lab last week and in the lecture last week we know that it's a universal and Powerful technique and we also know that there's a sort of more analytical or traditional or theoretical solution which is this sort of standard
08:00 - 08:30 chi-squared calculation um which is approximate its applicability depends on the assumptions of the test so we'll talk about these assumptions in a little bit so here's our chi-squared value and here's the probability um for contingency tables of size where the degrees of freedom equals one two three four or five we're going to talk about
08:30 - 09:00 that degrees of freedom metric in a minute and what you can see is the chi-square generally speaking takes on a shape that is somewhat skewed and it has a rather long tail end edging toward the positive direction so when you're evaluating a chi-square you're looking for the point on this curve whichever curve you're evaluating against here where the probability it drops to less than .05
09:00 - 09:30 . so that brings us to degrees of freedom degrees of freedom are an important concept they will play a role in all the test statistics we calculate from here on out and what that degrees of freedom is is it's the number of independent elements of information that are used to calculate a test statistic and that could be any test statistic regardless of which one it is it's the number of independent elements of information that
09:30 - 10:00 make up the calculation of that statistic so we usually calculate it as the number of elements that make up a statistic minus one so let's take a really easy statistic the mean n minus 1 elements can vary but in order to get a particular mean the remaining one must be so let's take a data set as an example and I'm going to have you do this calculation on your own
10:00 - 10:30 so if a data set has five values in it and it's mean is three let's say the first five values in fact the first five values you could pick as anything really so let's say the first value is one then we have a four we have a nine we have a one what does this value have to be in order to give us a mean of three so in order to get a mean of three fifth value in this data set based on this set of numbers here has to be equal
10:30 - 11:00 to zero so do and zero is a number here zero is something that we add to other things we don't just discount it um even nothing is something so in this data set if you add one plus four plus nine plus one plus zero and you divide by five the answer you will get is fifteen sorry the answer you will get is three I don't know where that came from um it must be coffee time so that's what we're looking at when we think about degrees of freedom go ahead
11:00 - 11:30 and do that math all by yourself so that you can convince yourself that this is true you can even pick some different numbers here you can pick a different value for the mean you can pick some different numbers here using a random number generator if you like and once you get to that fifth value there is only one number that will give you the correct mean so it's the number of independent elements that we call free to vary that
11:30 - 12:00 make up our test statistic and that test is going to be any test statistic whether it's the mean whether it's the standard deviation whether it's a chi-squared whether it's a test statistic called T anything else that we learned about in the class they all have a degrees of freedom so what is degrees of freedom in chi-square well this is the degree to which scores and in chi-squared the scores are cell counts in a table
12:00 - 12:30 the degree to which scores are free to vary so the relevant statistics here are the row and column totals so the degrees of freedom for a chi-squared is the rows minus one times The Columns minus one we often denote this number as K and Chi Squared but it's also often called degrees of freedom as well so you'll see it written both ways so let's look at our specific chi-squared so we have a calculated Chi Squared of 1.31 it has four degrees of
12:30 - 13:00 freedom remember because we have um three grade levels fourth fifth and sixth and we have three things that kids are nominating is the most important thing to them in school which is uh grades popularity and sports so we have three rows and three columns so three minus one is two and three minus one is two two times two is four that gives us four degrees of freedom and that gives us a chi-sport distribution that really looks like this
13:00 - 13:30 here that's the modeling of the chi-squared distribution this is from a little widget that you can go and look at at this website if you like and this is an analytical or theoretical solution the p-values from these kind of lookup tables we don't usually test two-tailed hypotheses by the way in chi-squared although you could um and you need to think about that when you're designing your hypotheses whoops so for our Chi Squared of one that we've
13:30 - 14:00 calculated of 1.3 the critical value the point at which P equals 0.05 on this distribution is 9.49 now our calculated Chi Squared of 1.31 which we saw a couple of slides ago is not statistically significant it's not less than our Alpha of 0.05 in fact the exact p-value associated with a Chi
14:00 - 14:30 Squared of 1.31 is 0.86 so our Chi Squared here does not fall into this nice blue rejection region where we would hope it would be or maybe not depending on what you think um so what that tells us about our our contingency table is that you know kids aren't kids are kids have their own personalities and those personalities are shaping their preferences across the grade levels so we don't have there's there isn't
14:30 - 15:00 there's no association between grade level and whether kids care very much about being popular or about playing sports or about grades in general a nice one thing that was nice to see from that table is most of the kids in that table think grades are the most important things but there are other students who have other impressions of what's important so let's have another quick look at Independence
15:00 - 15:30 we calculated the chi-square test for Independence in this example but we could have calculated the goodness of fit we could have asked the question are the proportions of students who prefer to focus on grades versus Sports versus popularity the same and to do that we would have had to recompute the expected values to test that hypothesis so then it would then become a goodness of fit test so all we're doing to change the chi-square test of Independence into the goodness of fit is changing the calculation of expected values I'm not going to go into
15:30 - 16:00 the exact calculations here but know that you can do it other ways there's not just one way to do this here we're asking the question about just the basic question about whether those two things are independent we could have asked questions about their distributions and whether they were similar or different across the grade levels and finally there are some conditions or assumptions that we need to meet for chi-squared the first one is Independence each case that contributes data to a table must be independent of
16:00 - 16:30 all the other cases in the table so each of the students in our little example needed to be an independent person they can't have more than one value that they've contributed to that table um the next thing is a sample size thing you'll remember that I said in order for the results to be really reliable we need to have larger samples so each cell in our table should have at least five cases in it
16:30 - 17:00 finally our degrees of freedom should be at least one so our minimum table size is a two by two table so in a two by two table we have two rows and two columns two minus one is one times two minus one or one one times one is one so we must have a minimum of one degree of freedom in order to calculate a chi-square and if we are violating these
17:00 - 17:30 assumptions the error roots for the test will not be accurate now I will also tell you that the sample size convention is more of a suggestion than a rule but you should remember that results can be unreliable with small sample sizes in cells and I will end by telling you that in Python there's a really nice library that calculates that has a chi-square function and instead of randomization this week we're gonna cheat and we're going to use
17:30 - 18:00 the sort of standard theoretical model for calculating the Chi Squared you can also calculate this independently so you can do it the same way and you should be able to get the same results but we will import a new library this week called scipod and we will use SCI Pi's chi-squared function to test chi-squared in a data set and that's what what the lab associated with this lecture will be doing