Decoding Sampling Distributions
Sampling&Distributions3
Estimated read time: 1:20
Summary
In this lecture, Erin Heerey delves into the intricate process of estimating population parameters from sample data, emphasizing the construction of sampling distributions. These distributions differ from typical ones as they comprise statistics, like means and standard deviations, instead of individual scores. A key highlight is the convergence of sampling distribution means on the real population mean, demonstrating the central limit theorem in action. Heerey also introduces practical methods such as Monte Carlo sampling and bootstrapping to estimate population parameters efficiently, even when dealing with large or skewed datasets. The lecture accentuates the importance of understanding variability and highlights techniques to ensure samples are representative of a population.
Highlights
- Erin Heerey explains how sampling distributions differ as they are comprised of statistics like means rather than individual data points. 📈
- The central limit theorem ensures sample means tend to converge on the true population mean, even if the original population is skewed. 📏
- Monte Carlo sampling involves repeated random sampling, aiding in estimating expected values of variables with varied distributions. 🎰
- Bootstrap resampling allows estimation of sample statistics using the original dataset, useful especially with non-normally distributed data. 🚀
- Having adequately large and representative samples is critical for effective bootstrapping and generalizing results to the population. 📊
Key Takeaways
- Sampling distributions use statistics rather than individual scores, providing insights into population parameters. 📊
- The central limit theorem ensures that sample means converge on the population mean, even from non-normal distributions. 🔄
- Monte Carlo sampling and bootstrapping are valuable techniques for estimating population parameters from samples. 🎲
- Boostrap resampling allows for the creation of sampling distributions from a single dataset, useful in non-normally distributed data. 🔄
- Large sample sizes are crucial for accurate population parameter estimation, ensuring sample representativeness. 🏋️
Overview
The lecture tackled the conceptual framework of sampling distributions, explaining their vital role in making statistical inferences about populations. Unlike basic distributions, sampling distributions use statistical measures, such as means and standard deviations, rather than individual data points. This facilitates a deeper understanding of population characteristics through sample-derived data.
One of the remarkable aspects discussed is the central limit theorem, a principle asserting that with sufficiently large sample sizes, the distribution of sample means will approximate the population mean, irrespective of the population's original distribution shape. This concept underlined the importance of considering the size and nature of samples when making statistical estimations.
Practical methodologies such as Monte Carlo sampling and bootstrapping were introduced, providing powerful tools for estimating population parameters. These techniques enable researchers to leverage sample data efficiently, making them indispensable in the realm of statistics, particularly when addressing large or skewed datasets. The lecture concluded by emphasizing the importance of attaining large, representative samples for robust statistical analysis.
Chapters
- 00:00 - 01:00: Estimating Populations from Samples This chapter introduces the concept of estimating population parameters based on samples. It emphasizes the theoretical nature of population distributions and the reliance on sample statistics to make these estimations. Additionally, the chapter highlights the importance of understanding the shape or probability density function of the distribution while acknowledging the challenges in quantifying these estimates.
- 01:00 - 02:00: Understanding Sampling Distributions The chapter titled 'Understanding Sampling Distributions' focuses on the concept of sampling distributions. It explains the importance of determining whether a sample is representative of the actual population. To achieve this understanding, a new type of distribution, known as a 'sampling distribution,' is introduced. Unlike traditional distributions that contain individual data points, sampling distributions consist of statistical measures such as means or standard deviations.
- 02:00 - 03:00: Building Sampling Distributions The chapter titled 'Building Sampling Distributions' explores the concept of creating distributions from samples rather than individuals. It focuses on the mechanism of these distributions by referencing a population sample, specifically using Canadians as an example. The chapter details the process of generating a distribution by randomly sampling groups of individuals, illustrating with samples larger than five. The emphasis is on understanding how these sample-based distributions are constructed and operate.
- 03:00 - 04:00: Redefining Sampling Variability The chapter discusses the concept of sampling variability and how it can be redefined. It explains the process of calculating the mean of a given sample's ages and plotting that mean on a frequency distribution or histogram. For example, a sample consisting of ages 43, 27, 7, 84, and 21 has a mean age of 36.4. By repeating this process multiple times, a sampling distribution can be created.
- 04:00 - 05:00: Sampling Process and Reduction of Variability The chapter discusses the concept of a distribution of sample statistics, specifically focusing on means, rather than individual data points. It explores what a distribution of sample means looks like, providing an example and noting that these distributions are theoretical and created under certain conditions.
- 05:00 - 06:00: Understanding Standard Error of the Mean This chapter delves into the understanding of the Standard Error of the Mean (SEM) within the context of statistical distributions. The discussion begins with an exploration of a normally distributed population characterized by a mean and standard deviation, illustrated with red bars flanking either side of the mean. The focus then shifts to the concept of a sampling distribution of sample means, elaborating on how it emerges from repeated sampling from the population distribution.
- 06:00 - 07:00: Practical Applications of Sampling Distributions In this chapter titled 'Practical Applications of Sampling Distributions,' the concept of sampling distributions is explored with a focus on sample means. The chapter discusses how, with large enough samples, the mean of the sampling distribution will align with the mean of the population distribution. It elaborates on the idea by using a example where a computer generates a graph by selecting a sample of three people, demonstrating the process of approaching the population mean through samples.
- 07:00 - 08:00: On Population Distribution Shapes and Sampling Distributions The chapter discusses the concept of sampling distribution and illustrates it with a population distribution example. It explains how repeatedly sampling from a population, calculating the mean for each sample, and compiling these means builds a sampling distribution. The example specifically refers to samples of a size of three, and each point in the resulting histogram represents a sample mean rather than an individual score.
- 08:00 - 09:00: Illustrating Convergence with Game Theory The chapter 'Illustrating Convergence with Game Theory' discusses the properties of distribution in relation to population and sampling. It highlights that the mean of the sample distribution closely aligns with the population distribution. Additionally, it notes the standard deviation of the population distribution falls within specific boundaries and contrasts this with the sampling distribution of the sample means.
- 09:00 - 10:00: Central Limit Theorem in Practice Chapter: Central Limit Theorem in Practice
- 10:00 - 11:00: Theoretical Population Distributions Chapter Title: Theoretical Population Distributions Summary: This chapter explores the concept of theoretical population distributions, focusing specifically on sampling distributions. It explains that as sample size increases, the mean of the sampling distribution remains centered on the population mean while the standard deviation becomes narrower. This indicates the convergence of the sampling distribution mean on the population mean, while highlighting that the standard deviation is much smaller in comparison.
- 11:00 - 12:00: Monte Carlo Sampling Explained This chapter explains the concept of Monte Carlo sampling, focusing on the statistical term 'standard error of the mean,' which refers to the standard deviation of the distribution of sample means. It discusses the importance of understanding the standard error in assessing the variability and reliability of sample means when making inferences about a population.
- 12:00 - 13:00: Bootstrap Resampling Technique The chapter discusses the significance of sampling distributions in understanding populations. It highlights that sampling distributions are efficient and cost-effective compared to measuring entire populations, like determining the ages of everyone in Canada or globally, which is impractical given the population size over 8 billion.
- 13:00 - 14:00: Strength and Limitations of Bootstrapping The chapter titled 'Strength and Limitations of Bootstrapping' discusses the dynamic nature of populations and the difficulty of measuring demographic distributions due to daily changes such as births and deaths. It highlights the impracticality of measuring everyone in the world to understand age distributions, suggesting instead that random sampling in major geographic areas could be more feasible.
- 14:00 - 15:00: Considerations on Sample Size and Representativeness This chapter titled 'Considerations on Sample Size and Representativeness' discusses the importance of selecting a sample size that is representative of the entire population. It explains how having a good sample allows for accurate estimation of the probability of different outcomes occurring in the population. The chapter emphasizes that when analyzing data from a sample, the goal is to understand the likelihood of those outcomes happening within the broader population.
- 15:00 - 16:00: Conclusion on Bootstrapping and Sample Statistics The chapter discusses the significance of creating a sampling distribution based on specific data to understand a phenomenon in the general population. It explains that the shape of the population distribution may not be crucial because, with enough samples of adequate size, the sampling distribution of sample means tends to normalcy.
Sampling&Distributions3 Transcription
- 00:00 - 00:30 The next topic we need to pick up is the idea of how do we estimate populations based on samples. So we've said that our population distributions were mainly theoretical and we typically use our sample statistics to make an estimate about the parameters of the population from which it came. So that's why the shape or probability density function of the distribution is so important. But one of the tricky bits we have when we're estimating is to understand how to quantify the
- 00:30 - 01:00 uncertainty in this distribution. Did we get a sample that was representative of the shape of the real population and to do that, to understand that, we have to build a new type of distribution. And the type of the distribution that we're going to build is called a 'sampling distribution'. A sampling distribution is a distribution that doesn't contain individuals anymore, it contains statistics. The means for example, or standard deviations, instead of individual scores.
- 01:00 - 01:30 This is a distribution made up not of individuals, but of samples of individuals. How do these things work? If we think about our population that we used earlier, we used a population of Canadians. And what I did when I showed you the population distribution is I randomly sampled samples of people. They were samples of, these were larger than five [as on the slide], but they were samples of people.
- 01:30 - 02:00 We could calculate the mean of their ages and we could plot that mean on a frequency distribution or histogram. So now, instead of plotting the actual ages of these people - let's say we have, this is our sample here, we have someone who's 43, a person who's 27, a person who's seven, a person who's 84, and a person who's 21. so if we take the mean of their ages it's 36.4. We could take that mean and plot it on a frequency distribution or histogram. If we repeated those steps many many many times, what we would build is what we call a sampling
- 02:00 - 02:30 distribution. It would be a distribution of sample statistics [means] not of individuals so what does a distribution of sample means look like? Here is an example one. now these are theoretical they're created from, they're created by some
- 02:30 - 03:00 computer code, but if we think about a population distribution that looks like this, it's normally distributed, it's characterized by some mean which is right here and a standard deviation which is indicated by the red, the little red bars that are on either side of that mean. If we want to think about the sampling distribution of sample means, what we need to understand about it is that a distribution of samples drawn from this population will eventually with enough samples,
- 03:00 - 03:30 and large enough samples, will tell us something about the mean of this population distribution. So, the mean of a sampling distribution of sample means will converge on the population distribution mean. So this is the sampling distribution of sample means of samples of size three. So what the computer did to make this graph right here was it picked a sample of three people
- 03:30 - 04:00 randomly from this population it calculated their mean and then it threw that sample back and then it did that again and again and again - probably ten thousand times. And what this built up is what we call a sampling distribution of sample means. So each individual score in this histogram is a mean and not an individual. And it's a mean of samples of size three. One of the things you can
- 04:00 - 04:30 immediately see about this distribution here is it has exactly the same mean, or at least close to within eyeball specificity, of the original population distribution from which it was sampled. Another thing you can immediately see is that so our population distribution has a standard deviation that is within these red boundaries. But when we look at the sampling distribution of sample means from this population, samples
- 04:30 - 05:00 of size three, what you can see is we've reduced the variability pretty substantially. Our means have become closer to our true population mean. Here is the same process again. Now in this case, we're not doing samples of size three, we're doing samples of size five. Here it is with size ten; here it is with samples of 20 participants each. What you can see is that each time we do this
- 05:00 - 05:30 sampling process the mean doesn't change it stays centered on the population mean. But the standard deviation [of the distribution of sample means] gets narrower and narrower and narrower as we increase the sample size. So when we have a sampling distribution of sample means the mean of that distribution is going to converge on the population mean. We can also see that the standard deviation of this distribution is much smaller. Now the
- 05:30 - 06:00 standard deviation of the distribution of sample means has a very fancy term in statistics. It's called the "standard error [of the mean]". So the standard error is the standard deviation of this distribution. The standard error is the standard deviation of the distribution of sample means. Now what do we use this for? Well it turns out we can use this to understand the degree to which our
- 06:00 - 06:30 own sample is likely to match this population. And that's one importance of sampling distributions. A sampling distribution really tells us about a population and it's much more efficient and cost effective than trying to figure out, for example, the ages of everyone in Canada. What if we want to know the ages of people all over the world? There are over 8 billion people whose ages we would need to measure we couldn't possibly do it in the time we have. Think about
- 06:30 - 07:00 the minute you measure some somebody and they die tomorrow, or you measure somebody and they give birth tomorrow. Right now, we've got the all of these numbers but these populations change on a daily basis. It would be really difficult to come up with a distribution that told us about the ages of all people in the world by actually trying to measure everyone. However, if we randomly selected samples of people in all the major geographic regions
- 07:00 - 07:30 these samples would allow us to make a pretty good estimate about the full population. That allows us to calculate the probability of a particular outcome occurring. Now remember when we're examining data in a sample, when we're doing an experiment and we take a sample of data, what we're trying to do is understand the likelihood of the outcomes that we measure, in terms of understanding how likely those things are to happen out in the real population.
- 07:30 - 08:00 So if we could make a sampling distribution based on our own specific data, that would actually tell us potentially quite a lot about how our phenomenon works out in the world more generally - in the general population. Now we can also ask, does the population distribution shape matter? Well it turns out that with enough samples of large enough size, the sampling distribution of sample means converges to normal. So this is the graph
- 08:00 - 08:30 that you've already seen here this this one right here over on the left hand side. You see each one of these sampling distributions is sampled from a normal gaussian population distribution. Well it turns out you can do this from a skewed distribution and it also converges on the true population mean and it becomes normally distributed. Here are samples drawn from a
- 08:30 - 09:00 uniform distribution. So what you can see is that again, they converge on the distribution mean. And even though this is in uniform distribution, when we're looking at the means of samples they are normally distributed. Why does this happen well you can...one way think about it is... I don't know if anyone plays a game called Catan or Settlers
- 09:00 - 09:30 of Catan so this is a game in which one of the elements of play is the rolling of two dice. What happens is when you do the roll, you take the sum of the two numbers on the sides of the dice. So you could roll a seven and a three, or actually you couldn't roll a seven and three; you could roll a three and a four - is what I meant to say and get seven. You could roll a four and a four and get eight you could roll a three and a three and get six and so forth.
- 09:30 - 10:00 But interestingly enough some of those numbers come up more often than others right? If you've played Catan you'll know that right on the chips in the center of the of the board the boldness and color of those chips tells you how often that number is to come up (on average). Why does that happen? Well 8 is going to come up a lot more often than 12. There is only one role that will get you a score of 12 and it's a double six. Similarly there's only one
- 10:00 - 10:30 role that will get you a score of two and that's a snake eyes (two ones). So a roll of seven or eight or six these rolls are going to come up a lot more often because there are many ways that you can make a number like seven. You can roll a one and a six you can roll a two and a five you can roll a three and a four, you can roll a four and a three and so forth. All of these things are going to ... because they add up, they're going lead to you certain numbers becoming more
- 10:30 - 11:00 frequent than others. And so we can see that here in these distribution patterns, Because these are graphs of means and not individuals, they don't look like the populations from whence they came. Even this really irregular kind of randomish distribution here converges to normal with large enough sample sizes. Last year in Data Science 1000, you learned about a thing called the central
- 11:00 - 11:30 limit theorem. This is the central limit theorem in action. You use it every time you play Catan every time you play a game where you have to add up the values on the role of two dice. those roles tend to converge on the average. So, when we consider population distributions, which really live in the domain of theory,
- 11:30 - 12:00 we assume that a population distribution has a particular shape and we set its basic parameters, for example the mean and standard deviation, those can be sat in an arbitrary fashion. We are expecting to have and this population distribution a mean of a hundred and a standard deviation of 10. So they can be arbitrary they can also be based on previous research. For example, you could look up in a paper some effect that you're interested in and you could find out what
- 12:00 - 12:30 the mean and standard deviation is of that effect in previous samples. You could base your theory about what the population should look like based on previous samples. So you can take a theoretical population with a mean and standard deviation that you find or that you set. And you could take random samples from this population to produce a sampling distribution. We will be doing that, in fact we'll be playing around in the lab this week with some of these theoretical populations.
- 12:30 - 13:00 One of the methods we'll be looking at in lab this week is a method called Monte Carlo sampling Monte Carlo sampling is just like the name suggests, based on gambling it's the use of repeated random sampling to generate draws from a probability distribution. The important elements here are that these repeated random samples are repeated and that they are each independent of one
- 13:00 - 13:30 another. Monte Carlo sampling allows us to guess the expected value of a variable. We can estimate that empirically, because over many samples as we saw with that sample and distribution a slide or two ago, we saw that no matter what the shape of the population it was drawn from, the value of the sampling distribution converged on the true population mean.
- 13:30 - 14:00 So that's one of the really nice features of Monte Carlo sampling. You'll be doing some of that in Jupyter. For example, you'll be taking some Monte Carlo samples with a population mean that you'll set to some number and a population standard deviation that you'll set to some number. Also you can get a single sample from a population with a mean of say 100 and a standard deviation of 20. This for example is going to be what you draw for part of the lab. You're going to have a single sample. You're going to take it from a random distribution that's normally
- 14:00 - 14:30 distributed with a mean of something and a sigma of something else, a standard deviation of something else, that's going to give you a single value from that distribution. You can also do this multiple times which is exactly what we do in Monte Carlo sampling. So we can take a sample size of 500, for example, and we can create, we will use numpy just like we did last week, to create an array that's this that's that's as large as our sample size and is
- 14:30 - 15:00 of data type, called 'float64'. And then what we can do is we can iterate over that sample and we can take a single sample and we can plop it in at the if index of that sample until we have 500 scores in that sample. That gives us a sampling distribution of individual scores now the other thing we can do is we can take those scores and instead of plotting
- 15:00 - 15:30 individuals we can upscale them from a sample to to a sampling distribution. We can make histograms of samples. So here's an example of what the code looks like. I'll be walking you through that in the lab as well this week. Another kind of sampling that is extremely popular when we're thinking about sampling distributions is a type of sampling called bootstrap resampling. Most of the time, in the real world, we only have one sample of data and it's the sample
- 15:30 - 16:00 that we actually drew. So say we collected data from 100 people on a particular task, or on a particular questionnaire, we can create our own probability density function from these data by resampling them randomly and independently. Now that seems weird doesn't it? To draw independent samples of the same size as the original, we have to sample with replacement. So what that means is we draw a score out of a hat, we note down what it is, and we throw
- 16:00 - 16:30 that score back in the hat; we shake it up; we draw another score we note what that score is, we throw it back in the hat, so we replace it, and that's what it means to sample with replacement. Otherwise the minute we drew one score, we would not have independent samples anymore because that same score that we just picked, that can't be picked again. But if we sample with replacement, that same score, it can come up multiple times. Now the really neat thing about bootstrap
- 16:30 - 17:00 resampling, is it allows us to develop a sampling distribution of sample means, for instance, or of sample standard deviations, or whatever our statistic is, it allows us to estimate that using the EXACT sample that we have. So the proportions of data that we got across the probability density function of our sample distribution, we can actually use that to create a sampling distribution and that can sometimes help us estimate sample
- 17:00 - 17:30 statistics. It allows us to treat our own data as a population instead of as a sample. It's really really powerful statistical technique for understanding the distribution of a sample. What our method is there, is we take random and independent draws of scores from our sample so we might start with an array that contains the data in our sample, and what we would do is we would pick a random number from that sample, we would note it down,
- 17:30 - 18:00 we would throw it back, we would do that again and again. And that means that in some samples the same score might come up five or six times in other samples that score it won't come up at all. And so because we're sampling independently and repeatedly, getting samples of the same size with replacement we can make an estimate for how often each one of the scores in the sample is going to come up on average.
- 18:00 - 18:30 So what we do [in Bootstrapping] is we we draw a sample we calculate our sample statistic, like the mean or the standard deviation, and then we plot that statistic on a frequency histogram. And we repeat that process many many times. You will want to understand what sampling distributions look like. We'll be talking about that pretty extensively in lab this week. Finally, bootstrapping can be useful, especially if we have non-normally distributed data. and we
- 18:30 - 19:00 want to make comparisons across groups; if we want more precise estimates of correlation; of the parameters that are active in our given samples. So remember when we sample from a population, we are sampling specific statistics that we can observe. Those facts, those descriptive statistics we talked about in the last lecture, we talked about those relative to to the sample. Now here we're interested in scaling those up to population parameters right? So bootstrapping is
- 19:00 - 19:30 an extremely powerful way to do that. It gives us a very precise estimate of population parameters, especially when we don't know the underlying population distribution, which often we don't. It's also useful when we have a sort of smaller sample size, but not too small. If the sample is too small, if the sample is so small that it's not representative of the population, then bootstrapping might also provide results that are false or that fail to generalize.
- 19:30 - 20:00 So we need to make sure that in general, from an experimental design consideration, that our samples are representative of our populations.They need to be large enough to be representative, if we think about ... so one of the things I study is social behavior. Social behavior is really variable. So in order for me to get representative samples, I need to get really big samples of behavior, because behavior is so unique from one interaction
- 20:00 - 20:30 to the next. So the larger my samples the better able I am to estimate what a particular behavior looks like, or how it's distributed in real life, in a way that generalizes. So here we need to be careful of having samples that are too small so that they are not representative of the population, in which case bootstrapping isn't going to help us. What we really want is samples that are big enough to be useful, and then bootstrapping is an extremely powerful
- 20:30 - 21:00 technique for making estimates of population parameters based on sample statistics. I'll leave it there for this part of the lecture we'll move on to probability in the next lecture.