Probability4

Estimated read time: 1:20

Summary

In this lecture, Erin Heerey discusses contingency tables and their role in understanding dependency between events. A key example involves a study on treatments for cocaine addiction, showcasing how relapse rates differ across three treatments: desipiramine, lithium, and placebo. Heerey also delves into concepts like marginal and conditional probabilities, using them to illustrate ideas of independence and the gambler's fallacy. The lecture concludes with discussions on probability distributions and their relevance to statistical sampling, highlighting the convergence of a sample mean to a population mean.

Highlights

Contingency tables show how events, like treatment outcomes, depend on each other using categorical variables. 📊
In a cocaine addiction study, desipiramine, lithium, and placebo were tested for relapse rates among participants. 💊
Marginal probabilities describe likelihoods of outcomes without considering other variables, focusing on totals. 🔢
Conditional probability assesses the impact of one event's outcome on another, highlighting dependencies. 🔄
Independent events illustrate how individual outcomes do not influence each other's probabilities. 🎲
Clarifying the gambler's fallacy: separate events do not alter each other's probabilities, even in repeated scenarios. 🎲
Understanding probability distributions is key to comprehending how sample means align with population means. 📈

Key Takeaways

Contingency tables help illustrate dependencies between categorical variables like coin flips and treatment outcomes. 🎲
The lecture uses a study on cocaine addiction treatments to explain concepts of relapse rates and treatment efficacy. 💊
Marginal probabilities are calculated using the totals of specific conditions against the overall total. 🔢
Conditional probabilities explore the chance of one event given another, essential for understanding dependencies. 🔍
Independent events have probabilities that remain unchanged regardless of other outcomes, debunking the gambler's fallacy. 🎲
Probability distributions help in understanding sample means and their relation to population means, crucial for statistical analyses. 📊

Overview

The lecture starts by introducing contingency tables, tools that display dependencies between different categorical variables such as treatment types and outcomes. These tables provide a clear way to see how often events happen in conjunction with one another. The example of a study on cocaine addiction treatment helps illustrate how different treatments have varied relapse rates, offering real-world application of these concepts.

Moving deeper, the notion of marginal probabilities is explained. Marginal probabilities focus on the odds of specific occurrences without the influence of other factors. The lecture stresses the importance of understanding both marginal and conditional probabilities, particularly in experiments and data analysis, where assumptions on dependence and independence of events play a critical role.

To wrap up, the session explores probability distributions and their impact on statistics, particularly how sample means can approximate population means. This segment reinforces the importance of probability theory in understanding statistical outcomes, debunking misconceptions like the gambler’s fallacy, and emphasizing the independence of separate events in probabilistic scenarios.

Chapters

00:00 - 00:30: Introduction to Contingency Tables This chapter introduces contingency tables, which are used to describe how events depend on each other. These tables display categorical variables, which can be events like a coin flip or determining the sex of a kitten, in terms of their frequencies or relative frequencies. The chapter sets up for a famous example to illustrate the concept.
00:30 - 01:00: Study on Cocaine Addiction Treatments This chapter investigates the efficacy of various treatments for cocaine addiction, known for being particularly challenging to overcome. It finds that certain antidepressants have shown promise in aiding recovery. The study referenced was conducted some time ago, involving 72 individuals who were in recovery from cocaine addiction and were randomly assigned to different treatment groups.
01:00 - 01:30: Description of the Study and Treatments This chapter focuses on a study involving three different treatments aimed at preventing relapse in individuals who have completed a treatment program. The treatments included an antidepressant called desipramine, a mood stabilizer named lithium, and a placebo. The primary measure of the study was the number of participants in each group who relapsed, meaning they started using again after their treatment. The chapter discusses the setup and results of the contingency tables used in the research.
01:30 - 02:00: Analysis of the Relapse Data The chapter titled 'Analysis of the Relapse Data' provides a breakdown of the number of participants in each condition and the outcomes they experienced. The study included a total of 72 participants, with 24 participants randomly assigned to each group. Random assignment was executed, likely using a random number generator to ensure unbiased distribution. This chapter seems to focus on the methodology and data representation concerning the distribution and the outcomes of the groups involved in the study. The main emphasis is on how participants were grouped and the statistical handling of the participant distribution across conditions.
02:00 - 02:30: Understanding Contingency Tables The chapter titled 'Understanding Contingency Tables' discusses how to analyze data related to relapse conditions in different groups: desipramine, lithium, and placebo. It details the number of people who relapsed and those who did not in each group. Specifically, 10 people relapsed in the desipiramine group, 18 in the lithium group, and 20 in the placebo group. On the non-relapse side, there were 14 non-relapsers in the desipiramine group, 6 in the lithium group, and 4 in the placebo group. The chapter illustrates how these numbers sum up and relate to contingency tables.
02:30 - 03:30: Independent vs Conditional Probability The chapter titled 'Independent vs Conditional Probability' begins with a focus on calculating 'marginal' probabilities. The instructor explains the concept using a table and emphasizes determining the likelihood that a participant did not relapse. The example provided includes a scenario where there are 24 people in a certain category, and the marginal probability is calculated by dividing this number by the total number of participants. This sets the stage for exploring more complex ideas around independent and conditional probability.
03:30 - 04:00: Calculating Conditional Probabilities In this chapter, the concept of calculating conditional probabilities is discussed with a specific example. A calculation is made to determine that one-third (33%) of a sample did not experience relapse, which is termed as a marginal calculation due to its basis on table margins. Despite the seemingly low odds, it is noted that these results are relatively positive for the context of addiction. The chapter suggests that further exploration of other cells in the data could provide additional insights into participant probabilities.
04:00 - 05:30: Gender and Faculty Example The chapter discusses the intersection of the outcome involving relapse versus no relapse and the type of treatment participants were receiving. It focuses on those who took desipiramine and experienced a relapse, explaining the process of deriving this data by examining the number of people in the desipiramine group who relapsed, as highlighted in a specific cell. The methodology involves dividing the identified number of relapsed individuals by the total number of participants to understand the conjunction.
05:30 - 06:30: Independent Events and the Gambler's Fallacy The chapter titled 'Independent Events and the Gambler's Fallacy' begins with an example where the probability of relapsing for individuals in a desipiramine group is discussed. The probability is calculated as 10 divided by 72, which approximates to 0.139 or about 14 percent. This illustrates the concept of independent events and sets the stage for examining misconceptions like the Gambler's Fallacy.
06:30 - 07:30: Normal Distribution and Sampling In the chapter titled 'Normal Distribution and Sampling,' the concept of conjunctions, specifically 'AND' conjunctions, is explored. The chapter explains that in AND conjunctions, calculations involve taking the individual cell value and dividing it by the total. This differs from marginal calculations, where a marginal cell is used, and the calculation involves taking the marginal total and dividing it by the overall total to find the marginal probability. Marginal probability helps in determining the likelihood of being in one of the groups, irrespective of the treatment received. The chapter introduces the concept of joint probability as well.
07:30 - 08:00: Conclusion In the conclusion chapter, the concept of probability in the context of treatment and outcomes is revisited. The discussion focuses on the probability of a specific treatment, such as desipiramine, leading to a particular outcome, like relapse. The chapter emphasizes the idea of independence between two variables. It explains that two variables are considered independent if the probability of one event occurring is the same, regardless of the occurrence of another event. This concept is likened to independent events, such as separate coin tosses or weather conditions, where the outcome of one does not influence the probability of the other.

Probability4 Transcription

00:00 - 00:30 In this final part of the lecture we're going to talk about what we call contingency tables. These describe how events depend upon one another. A contingency table is a display of categorical variables like a coin flip or the sex of one kitten in terms of their frequencies or relative frequencies. The example that I'm going to show you is a rather a famous one and it's about
00:30 - 01:00 the efficacy of various treatments for cocaine addiction. Cocaine addiction is a really notoriously problematic addiction, and it's hard, it's very hard to treat. But what does what has had some success are certain kinds of antidepressants. So what I'm showing you are some data from a study that was conducted a very very long time ago actually, in which a total of 72 people who were recovering from cocaine addiction were randomly assigned to
01:00 - 01:30 one of three treatments, and what was measured was whether or not the or the number of people in each group who relapse, meaning they started to use again after they had done their treatment program. The three treatments were an antidepressant called the desipiramine, a mood stabilizer called lithium and placebo. And what I'm showing you here is the contingency
01:30 - 02:00 table for the number of participants in each condition; that experienced each outcome. What you can see is there were a total of 72 participants in the study 24 of them were assigned to each group, and they were randomly assigned - so basically there was some dicing out of who was, of which participant numbers were in which group. Usually a random number generator does this so there's no actual dice involved. But what we're calculating here, what we're showing here, is the sum of the number of
02:00 - 02:30 people in each group who relapsed and who did not relapse. so we have the full sample space covered and what you can see in the relapse condition is 10 people in the desipiramine group relapsed; 18 people in the lithium group did; and 20 people in the placebo group. On the non-relapse side of the equation, there were 14 non-relapsers in desipiramine, six in lithium, and four in placebo. So we can see how how the numbers sum up. And that's what a contingency
02:30 - 03:00 table looks like. So the first thing we're going to do, is we're going to calculate what we call some 'marginal' probabilities what is the likelihood that a participant did not relapse what's the likelihood that they were in this column. So we know there are 24 people in that column and what we're going to do to calculate that marginal probability is we are going to take the number 24 and we're going to divide it by the total number of participants there were and
03:00 - 03:30 that's going to give us 33% odds so 24 divided by 72 is 0.33 and so a third of the sample did not relapse. [This is a marginal calculation because the numbers are in the table margins]. Now, for this addiction actually, that's not bad odds. I know it seems terrible but it's not bad odds. Let's look at some of these other cells now. What is the probability that a participant
03:30 - 04:00 took desipiramine and relapsed? So now we're looking at the intersection of the outcome here, relapse versus no relapse, and the treatment type that they were in. So how do we get this number? Well we have the number of people who were in desipiramine and the relapse column so that's this cell right here that's highlighted in blue, that has a little blue circle over it and what we need to do to look at the conjunction here is we need to divide that by the total number of participants
04:00 - 04:30 we have so 10 divided by 72 gives us the probability of being in the desipiramine group and relapsing. It's 10 divided by 72, and if we do the math on our calculators that comes out to a number to 0.139 so about about 14 percent, if we round, of people relapsed on discipline
04:30 - 05:00 so that's an AND conjunction so when we're doing the AND conjunctions we're taking the individual cell and dividing it by the total whereas for a marginal we're taking a marginal cell rather than an individual cell. Here taking the marginal total, and dividing it by the overall total to get the marginal probability, the probability of being in one group or another. So the probability of being here or here, regardless of which treatment you got. So the joint probability is
05:00 - 05:30 the probability of getting a specific treatment and then being in one of these outcome cells. So here we're looking at the probability of desipiramine and relapse. So let's talk about the idea of Independence. Two variables are independent if the probability of one event, event x, given the other event, is the same as the probability of that event by itself. So given that coin toss one was heads, what's the probability that coin toss 2 is heads. Given the fact that it's a sunny day today what
05:30 - 06:00 is the probability that I will eat a chocolate for dessert - these two events are totally unrelated. If events are non-independent then we need to think about a thing called conditional probability. So you can think about that in the context of my rain example from earlier - the chances of rain given the environmental conditions. if it's a nice sunny day and there's not a cloud in the sky, the chance of rain is probably lower.
06:00 - 06:30 If it's cloudy and grey and it's warmer than about three degrees Celsius, the chance of rain becomes higher because the chance of rain is dependent on the other environmental conditions. So when we're thinking about the idea of independence we're asking whether the presence of one event makes the possibility of a second event more or less likely. Two events are independent if the presence of one of them is totally independent of the presence of the other. So if we think about
06:30 - 07:00 Poppy's first kitten being male, the probability that her second kitten is male is totally independent of the probability that her first one was male. The the probability that kitten number two is male is totally independent of the probability that kitten number one was male. So we can think about conditional probabilities in these tables as well. So conditional probability
07:00 - 07:30 is the outcome of interest, event X, given the presence of another condition, event Y. We usually write that using an equation that looks like this: the probability of X; this up and down bar here, this vertical bar, you read that as 'given'. So the probability of X given Y equals the probability of X and Y divided by the probability of Y so let's look at where those numbers come from in this table. So we have the probability of relapse and desipiramine which
07:30 - 08:00 we've already calculated. We know the probability of relapse and desipiramine is 10 divided by 72. And we need to divide that by the probability of desipiramine. We know that 24 of 72 people were assigned to that condition so that number is, we get that, by dividing the marginal probability of being in one condition versus another, divided by the total number of participants there are. Now we're going to simplify that math
08:00 - 08:30 so the 72s are both going to go away [review your algebra if you don't remember this]. This is going to become 10 divided by 24 and that gives us a probability of 41.7 So the probability of relapse given desipiramine is about 42 percent. When you look at the numbers, that's pretty close to reasonable and right. 10 of the people in desipiramine relapsed out of a total of 24. That's give or take 42 percent.
08:30 - 09:00 Now we can think about conditional probabilities and getting those from tables. now we're going to look at these over the rows here because that makes the most sense we have treatment type in rows, and we have outcomes in columns. And we usually think about the independent variable or treatment in an experiment being the cause of the outcome, So that's why we're looking at it in a rows. we
09:00 - 09:30 could do it the other way as well. So this is the probability of relapse given desipiramine, we already said that was about 42 percent. We can look at the probability of relapse given lithium and 18 people out of 24 relapsed in the lithium group. That's about 75 percent, and we can look at the probability of relapse and placebo. I have 20 out of 24 people relapsed so that probability is 83%. It actually looks like the desipiramine, even though it has a 42% relapse
09:30 - 10:00 rate, it's still a pretty good treatment. And that is what we're doing when we talk about conditional probability. And now you can see that this becomes interesting when we think about psychological experiments and testing psychological hypotheses. We can do this, of course, the other way around as well. We can calculate what's the probability that if a participant relapses what's the probability that they took to desipiramine, in which case, we're going to do the totals here in the columns not the rows. Again this doesn't make super much sense. I'm showing it to you just so you can get
10:00 - 10:30 a feel for where these numbers come from. So the probability of desipiramine given relapse is 10 divided by 48 or 21 percent and we can carry it forward from there. And that brings us to the multiplication rule. If X and Y are two outcomes or events, then the probability of X and Y equals the probability of X given X times the probability of Y. Remember,
10:30 - 11:00 you've seen a formula that looks really similar to this earlier this is the formula, it's the conditional probability formula and it's just been rearranged a bit. It's useful to think about X as the outcome of interest and Y is the condition that caused it or was associated with it. So here is a hypothetical probability distribution. It's totally made up and the numbers are made up because they work out in a very nice way. So we could talk about gender identity and faculty. So we could you know randomly sample a total of 100 students
11:00 - 11:30 over in UCC who are walking through. we could have them identify as either a woman or a man on a little survey and then tell us are they from the faculty of Social Science or are they studying in Science. Now UCC is of course very close to the social science building and farther away from the science building so actually we get more social science students than we do science students and that kind of makes sense. So the probability that a randomly selected student is in social science,
11:30 - 12:00 and you should think about doing that math before I show you what the answer is, is 60 out of 100 or 0.6 so we have a 60% likelihood that a randomly sampled student in UCC is from Social Science. They're more likely to be from social science than from science, because social science is simply in closer proximity; and so where are you going to get a coffee if you're in science? Well you're probably going to go somewhere like um one of the coffee places that's over closer to your building versus if you're in social science
12:00 - 12:30 you're probably going to go to the Starbucks at the UCC or maybe the Timmy's at the UCC. But we can also ask about gender identity and faculty so what's the probability the randomly selected student is a social science major given that they identify as a woman? Again think about where the numbers come from before I tell you the answer you can pause and do your calculations and check and see if you're right.
12:30 - 13:00 So that probability happens to be 30 divided by 60. that's the probability that a randomly selected student is a social science major given that they identify as a woman so 30 out of 60. And the probability that a random social science student identifies as a man is actually equal to the probability that they identify as a woman.
13:00 - 13:30 And both of those are equal to the probability that they identify as social science. And that means that the probability of being in social science is independent of gender. And that's one way to think about these Independence equations in the context of probability. So to sum that up, if the probability of event X given event Y equals the probability of X, then event X and Event Y are independent. At a conceptual level, that means knowing something
13:30 - 14:00 about Y doesn't tell us anything about X. Mathematically if and only if X and Y are independent the probability of X AND Y equals the probability of X times the probability of Y. So the joint probability is the product of their two marginal probabilities if they are independent.
14:00 - 14:30 Let's do one final experiment: two rolls of a fair die. What's the probability that you roll a three given that you just rolled a two? And the sample space, of course is here, we've identified this, so the probability of rolling a two is one and six what's the probability that you roll a three on the next roll? Think about that very carefully before you answer the question? probability of getting a three is
14:30 - 15:00 also one and six. Are they related to one another or are they independent of one another? It turns out that they're independent of one another. Two rolls of the same fair die, if we roll if we roll the two in our first roll, what's the probability that we then roll a three. It's still one in six. And this is the gambler's fallacy that you need to be careful of here. The
15:00 - 15:30 gambler's fallacy states that you know if you have a long run of bad luck eventually you're going to get lucky. Now that might be true but every roll of the dice is independent, every flip of a card if you're thinking about black cards or reds; or the roulette wheel is a good one, blacks or reds, These are independent probabilities and it's important to remember that if you have two events that are unrelated to one another the probability of one does not affect the probability of the other.
15:30 - 16:00 So if these events are independent, the probability of a three given that you've just rolled a 2, is equal to the probability that you will roll a three. How does this relate to statistics? So remember a little bit ago when we talked about samples and distributions? What does this mean for sampling? We're going to revive this normal distribution - you're going to see this picture a lot in this class - so consider that normal
16:00 - 16:30 distribution: what's the probability of randomly drawing a score that falls within one standard deviation of the mean? Well, 34.13% of the scores fall within one within the mean and one standard deviation below it and 34.13, and these are rounded to four decimal places here, percent of scores are between the mean of zero and plus one standard deviation.
16:30 - 17:00 So we have a total of a 68 percent and a bit chance. So 68% is 34 plus 34 is 68%. We have about a 68 chance of selecting a score, by pure random chance, that falls within one standard deviation of the mean. Our chances of selecting a score within two standard deviations are this plus this, plus this, plus this. And so the thing to remember about probability distributions,
17:00 - 17:30 and why they're important in statistics is they tell us something about the means and about the scores that we are drawing, and about the likelihood of those scores being drawn from the population. And that's an important thing to keep in mind as you move forward. It's also why it is that the mean of a representative sample converges on the true population mean. So if you imagine we take our representative sample, a totally random sample,
17:30 - 18:00 from this population and it's large enough to be representative. So let's say it's not two people or three people, let's say we take a good size sample like 150, or 200, or 400 people, something like that, if we have a good representative sample, the mean of those participants, because we are substantially more likely to sample scores that are closer to the mean than scores that are far away from the mean, will converge on the true population mean. So that's the take-home
18:00 - 18:30 point for this probability idea - or at least one of important one. Thanks for listening.