Decoding Sampling Distributions

Sampling&Distributions3

Estimated read time: 1:20

    Summary

    In this lecture, Erin Heerey delves into the intricate process of estimating population parameters from sample data, emphasizing the construction of sampling distributions. These distributions differ from typical ones as they comprise statistics, like means and standard deviations, instead of individual scores. A key highlight is the convergence of sampling distribution means on the real population mean, demonstrating the central limit theorem in action. Heerey also introduces practical methods such as Monte Carlo sampling and bootstrapping to estimate population parameters efficiently, even when dealing with large or skewed datasets. The lecture accentuates the importance of understanding variability and highlights techniques to ensure samples are representative of a population.

      Highlights

      • Erin Heerey explains how sampling distributions differ as they are comprised of statistics like means rather than individual data points. 📈
      • The central limit theorem ensures sample means tend to converge on the true population mean, even if the original population is skewed. 📏
      • Monte Carlo sampling involves repeated random sampling, aiding in estimating expected values of variables with varied distributions. 🎰
      • Bootstrap resampling allows estimation of sample statistics using the original dataset, useful especially with non-normally distributed data. 🚀
      • Having adequately large and representative samples is critical for effective bootstrapping and generalizing results to the population. 📊

      Key Takeaways

      • Sampling distributions use statistics rather than individual scores, providing insights into population parameters. 📊
      • The central limit theorem ensures that sample means converge on the population mean, even from non-normal distributions. 🔄
      • Monte Carlo sampling and bootstrapping are valuable techniques for estimating population parameters from samples. 🎲
      • Boostrap resampling allows for the creation of sampling distributions from a single dataset, useful in non-normally distributed data. 🔄
      • Large sample sizes are crucial for accurate population parameter estimation, ensuring sample representativeness. 🏋️

      Overview

      The lecture tackled the conceptual framework of sampling distributions, explaining their vital role in making statistical inferences about populations. Unlike basic distributions, sampling distributions use statistical measures, such as means and standard deviations, rather than individual data points. This facilitates a deeper understanding of population characteristics through sample-derived data.

        One of the remarkable aspects discussed is the central limit theorem, a principle asserting that with sufficiently large sample sizes, the distribution of sample means will approximate the population mean, irrespective of the population's original distribution shape. This concept underlined the importance of considering the size and nature of samples when making statistical estimations.

          Practical methodologies such as Monte Carlo sampling and bootstrapping were introduced, providing powerful tools for estimating population parameters. These techniques enable researchers to leverage sample data efficiently, making them indispensable in the realm of statistics, particularly when addressing large or skewed datasets. The lecture concluded by emphasizing the importance of attaining large, representative samples for robust statistical analysis.

            Chapters

            • 00:00 - 01:00: Estimating Populations from Samples This chapter introduces the concept of estimating population parameters based on samples. It emphasizes the theoretical nature of population distributions and the reliance on sample statistics to make these estimations. Additionally, the chapter highlights the importance of understanding the shape or probability density function of the distribution while acknowledging the challenges in quantifying these estimates.
            • 01:00 - 02:00: Understanding Sampling Distributions The chapter titled 'Understanding Sampling Distributions' focuses on the concept of sampling distributions. It explains the importance of determining whether a sample is representative of the actual population. To achieve this understanding, a new type of distribution, known as a 'sampling distribution,' is introduced. Unlike traditional distributions that contain individual data points, sampling distributions consist of statistical measures such as means or standard deviations.
            • 02:00 - 03:00: Building Sampling Distributions The chapter titled 'Building Sampling Distributions' explores the concept of creating distributions from samples rather than individuals. It focuses on the mechanism of these distributions by referencing a population sample, specifically using Canadians as an example. The chapter details the process of generating a distribution by randomly sampling groups of individuals, illustrating with samples larger than five. The emphasis is on understanding how these sample-based distributions are constructed and operate.
            • 03:00 - 04:00: Redefining Sampling Variability The chapter discusses the concept of sampling variability and how it can be redefined. It explains the process of calculating the mean of a given sample's ages and plotting that mean on a frequency distribution or histogram. For example, a sample consisting of ages 43, 27, 7, 84, and 21 has a mean age of 36.4. By repeating this process multiple times, a sampling distribution can be created.
            • 04:00 - 05:00: Sampling Process and Reduction of Variability The chapter discusses the concept of a distribution of sample statistics, specifically focusing on means, rather than individual data points. It explores what a distribution of sample means looks like, providing an example and noting that these distributions are theoretical and created under certain conditions.
            • 05:00 - 06:00: Understanding Standard Error of the Mean This chapter delves into the understanding of the Standard Error of the Mean (SEM) within the context of statistical distributions. The discussion begins with an exploration of a normally distributed population characterized by a mean and standard deviation, illustrated with red bars flanking either side of the mean. The focus then shifts to the concept of a sampling distribution of sample means, elaborating on how it emerges from repeated sampling from the population distribution.
            • 06:00 - 07:00: Practical Applications of Sampling Distributions In this chapter titled 'Practical Applications of Sampling Distributions,' the concept of sampling distributions is explored with a focus on sample means. The chapter discusses how, with large enough samples, the mean of the sampling distribution will align with the mean of the population distribution. It elaborates on the idea by using a example where a computer generates a graph by selecting a sample of three people, demonstrating the process of approaching the population mean through samples.
            • 07:00 - 08:00: On Population Distribution Shapes and Sampling Distributions The chapter discusses the concept of sampling distribution and illustrates it with a population distribution example. It explains how repeatedly sampling from a population, calculating the mean for each sample, and compiling these means builds a sampling distribution. The example specifically refers to samples of a size of three, and each point in the resulting histogram represents a sample mean rather than an individual score.
            • 08:00 - 09:00: Illustrating Convergence with Game Theory The chapter 'Illustrating Convergence with Game Theory' discusses the properties of distribution in relation to population and sampling. It highlights that the mean of the sample distribution closely aligns with the population distribution. Additionally, it notes the standard deviation of the population distribution falls within specific boundaries and contrasts this with the sampling distribution of the sample means.
            • 09:00 - 10:00: Central Limit Theorem in Practice Chapter: Central Limit Theorem in Practice
            • 10:00 - 11:00: Theoretical Population Distributions Chapter Title: Theoretical Population Distributions Summary: This chapter explores the concept of theoretical population distributions, focusing specifically on sampling distributions. It explains that as sample size increases, the mean of the sampling distribution remains centered on the population mean while the standard deviation becomes narrower. This indicates the convergence of the sampling distribution mean on the population mean, while highlighting that the standard deviation is much smaller in comparison.
            • 11:00 - 12:00: Monte Carlo Sampling Explained This chapter explains the concept of Monte Carlo sampling, focusing on the statistical term 'standard error of the mean,' which refers to the standard deviation of the distribution of sample means. It discusses the importance of understanding the standard error in assessing the variability and reliability of sample means when making inferences about a population.
            • 12:00 - 13:00: Bootstrap Resampling Technique The chapter discusses the significance of sampling distributions in understanding populations. It highlights that sampling distributions are efficient and cost-effective compared to measuring entire populations, like determining the ages of everyone in Canada or globally, which is impractical given the population size over 8 billion.
            • 13:00 - 14:00: Strength and Limitations of Bootstrapping The chapter titled 'Strength and Limitations of Bootstrapping' discusses the dynamic nature of populations and the difficulty of measuring demographic distributions due to daily changes such as births and deaths. It highlights the impracticality of measuring everyone in the world to understand age distributions, suggesting instead that random sampling in major geographic areas could be more feasible.
            • 14:00 - 15:00: Considerations on Sample Size and Representativeness This chapter titled 'Considerations on Sample Size and Representativeness' discusses the importance of selecting a sample size that is representative of the entire population. It explains how having a good sample allows for accurate estimation of the probability of different outcomes occurring in the population. The chapter emphasizes that when analyzing data from a sample, the goal is to understand the likelihood of those outcomes happening within the broader population.
            • 15:00 - 16:00: Conclusion on Bootstrapping and Sample Statistics The chapter discusses the significance of creating a sampling distribution based on specific data to understand a phenomenon in the general population. It explains that the shape of the population distribution may not be crucial because, with enough samples of adequate size, the sampling distribution of sample means tends to normalcy.

            Sampling&Distributions3 Transcription

            • 00:00 - 00:30 The next topic we need to pick up is the idea of  how do we estimate populations based on samples.   So we've said that our population distributions  were mainly theoretical and we typically use our   sample statistics to make an estimate about the  parameters of the population from which it came.   So that's why the shape or probability density  function of the distribution is so important.   But one of the tricky bits we have when we're  estimating is to understand how to quantify the
            • 00:30 - 01:00 uncertainty in this distribution. Did we get a  sample that was representative of the shape of   the real population and to do that, to understand  that, we have to build a new type of distribution.   And the type of the distribution that we're going  to build is called a 'sampling distribution'.   A sampling distribution is a distribution  that doesn't contain individuals anymore,   it contains statistics. The means for example, or  standard deviations, instead of individual scores.
            • 01:00 - 01:30 This is a distribution made up not of  individuals, but of samples of individuals.   How do these things work? If we think  about our population that we used earlier,   we used a population of Canadians.  And what I did when I showed you the   population distribution is I randomly sampled  samples of people. They were samples of,   these were larger than five [as on the  slide], but they were samples of people.
            • 01:30 - 02:00 We could calculate the mean of their ages and we  could plot that mean on a frequency distribution   or histogram. So now, instead of plotting the  actual ages of these people - let's say we have,   this is our sample here, we have someone who's  43, a person who's 27, a person who's seven,   a person who's 84, and a person who's 21. so  if we take the mean of their ages it's 36.4.   We could take that mean and plot it on a  frequency distribution or histogram. If   we repeated those steps many many many times,  what we would build is what we call a sampling
            • 02:00 - 02:30 distribution. It would be a distribution of  sample statistics [means] not of individuals   so what does a distribution  of sample means look like? Here is an example one. now these are theoretical  they're created from, they're created by some
            • 02:30 - 03:00 computer code, but if we think about a population  distribution that looks like this, it's normally   distributed, it's characterized by some mean which  is right here and a standard deviation which is   indicated by the red, the little red bars that  are on either side of that mean. If we want to   think about the sampling distribution of sample  means, what we need to understand about it is   that a distribution of samples drawn from this  population will eventually with enough samples,
            • 03:00 - 03:30 and large enough samples, will tell us something  about the mean of this population distribution. So, the mean of a sampling distribution of  sample means will converge on the population   distribution mean. So this is the sampling  distribution of sample means of samples of size   three. So what the computer did to make this graph  right here was it picked a sample of three people
            • 03:30 - 04:00 randomly from this population it calculated their  mean and then it threw that sample back and then   it did that again and again and again - probably  ten thousand times. And what this built up is what   we call a sampling distribution of sample means.  So each individual score in this histogram is a   mean and not an individual. And it's a mean of  samples of size three. One of the things you can
            • 04:00 - 04:30 immediately see about this distribution here is  it has exactly the same mean, or at least close   to within eyeball specificity, of the original  population distribution from which it was sampled.   Another thing you can immediately see  is that so our population distribution   has a standard deviation that  is within these red boundaries.   But when we look at the sampling distribution  of sample means from this population, samples
            • 04:30 - 05:00 of size three, what you can see is we've reduced  the variability pretty substantially. Our means   have become closer to our true population mean.  Here is the same process again. Now in this case,   we're not doing samples of size three, we're doing  samples of size five. Here it is with size ten;   here it is with samples of 20 participants each.  What you can see is that each time we do this
            • 05:00 - 05:30 sampling process the mean doesn't change  it stays centered on the population mean.   But the standard deviation [of the distribution  of sample means] gets narrower and narrower and   narrower as we increase the sample size.  So when we have a sampling distribution   of sample means the mean of that distribution  is going to converge on the population mean.   We can also see that the standard deviation  of this distribution is much smaller. Now the
            • 05:30 - 06:00 standard deviation of the distribution of sample  means has a very fancy term in statistics. It's   called the "standard error [of the mean]". So the  standard error is the standard deviation of this   distribution. The standard error is the standard  deviation of the distribution of sample means.   Now what do we use this for? Well it turns out we  can use this to understand the degree to which our
            • 06:00 - 06:30 own sample is likely to match this population. And  that's one importance of sampling distributions.   A sampling distribution really tells us about  a population and it's much more efficient and   cost effective than trying to figure out, for  example, the ages of everyone in Canada. What   if we want to know the ages of people all over  the world? There are over 8 billion people   whose ages we would need to measure we couldn't  possibly do it in the time we have. Think about
            • 06:30 - 07:00 the minute you measure some somebody and they  die tomorrow, or you measure somebody and they   give birth tomorrow. Right now, we've got the all  of these numbers but these populations change on   a daily basis. It would be really difficult  to come up with a distribution that told us   about the ages of all people in the world  by actually trying to measure everyone.   However, if we randomly selected samples of  people in all the major geographic regions
            • 07:00 - 07:30 these samples would allow us to make a pretty  good estimate about the full population. That   allows us to calculate the probability of a  particular outcome occurring. Now remember   when we're examining data in a sample, when we're  doing an experiment and we take a sample of data,   what we're trying to do is understand the  likelihood of the outcomes that we measure,   in terms of understanding how likely those  things are to happen out in the real population.
            • 07:30 - 08:00 So if we could make a sampling distribution  based on our own specific data, that would   actually tell us potentially quite a lot about  how our phenomenon works out in the world more   generally - in the general population.  Now we can also ask, does the population   distribution shape matter? Well it turns out  that with enough samples of large enough size,   the sampling distribution of sample means  converges to normal. So this is the graph
            • 08:00 - 08:30 that you've already seen here this this one right  here over on the left hand side. You see each   one of these sampling distributions is sampled  from a normal gaussian population distribution.   Well it turns out you can do this from a skewed  distribution and it also converges on the true   population mean and it becomes normally  distributed. Here are samples drawn from a
            • 08:30 - 09:00 uniform distribution. So what you can see is that  again, they converge on the distribution mean. And even though this is in uniform distribution,  when we're looking at the means of samples they are normally distributed. Why does  this happen well you can...one way think   about it is... I don't know if anyone  plays a game called Catan or Settlers
            • 09:00 - 09:30 of Catan so this is a game in which one of the  elements of play is the rolling of two dice.   What happens is when you do the roll, you take  the sum of the two numbers on the sides of the   dice. So you could roll a seven and a three, or  actually you couldn't roll a seven and three;   you could roll a three and a four - is what  I meant to say and get seven. You could roll   a four and a four and get eight you could roll  a three and a three and get six and so forth.
            • 09:30 - 10:00 But interestingly enough some of those numbers  come up more often than others right? If you've   played Catan you'll know that right on the  chips in the center of the of the board   the boldness and color of those chips  tells you how often that number is to come   up (on average). Why does that happen? Well 8 is  going to come up a lot more often than 12. There   is only one role that will get you a score of 12  and it's a double six. Similarly there's only one
            • 10:00 - 10:30 role that will get you a score of two and that's  a snake eyes (two ones). So a roll of seven or   eight or six these rolls are going to come up a  lot more often because there are many ways that   you can make a number like seven. You can roll  a one and a six you can roll a two and a five   you can roll a three and a four, you can roll a  four and a three and so forth. All of these things   are going to ... because they add up, they're  going lead to you certain numbers becoming more
            • 10:30 - 11:00 frequent than others. And so we can see that here  in these distribution patterns, Because these are   graphs of means and not individuals, they don't  look like the populations from whence they came.   Even this really irregular kind of randomish  distribution here converges to normal with large   enough sample sizes. Last year in Data Science  1000, you learned about a thing called the central
            • 11:00 - 11:30 limit theorem. This is the central limit theorem  in action. You use it every time you play Catan   every time you play a game where you have to  add up the values on the role of two dice. those roles tend to converge on the average. So, when we consider population distributions,  which really live in the domain of theory,
            • 11:30 - 12:00 we assume that a population distribution has a  particular shape and we set its basic parameters,   for example the mean and standard deviation,  those can be sat in an arbitrary fashion.   We are expecting to have and this population  distribution a mean of a hundred and a standard   deviation of 10. So they can be arbitrary they can  also be based on previous research. For example,   you could look up in a paper some effect that  you're interested in and you could find out what
            • 12:00 - 12:30 the mean and standard deviation is of that effect  in previous samples. You could base your theory   about what the population should look like based  on previous samples. So you can take a theoretical   population with a mean and standard deviation  that you find or that you set. And you could take   random samples from this population to produce  a sampling distribution. We will be doing that,   in fact we'll be playing around in the lab this  week with some of these theoretical populations.
            • 12:30 - 13:00 One of the methods we'll be looking at in lab  this week is a method called Monte Carlo sampling   Monte Carlo sampling is just like the name  suggests, based on gambling it's the use of   repeated random sampling to generate draws from a  probability distribution. The important elements   here are that these repeated random samples are  repeated and that they are each independent of one
            • 13:00 - 13:30 another. Monte Carlo sampling allows us to guess  the expected value of a variable. We can estimate   that empirically, because over many samples  as we saw with that sample and distribution   a slide or two ago, we saw that no matter what  the shape of the population it was drawn from,   the value of the sampling distribution  converged on the true population mean.
            • 13:30 - 14:00 So that's one of the really nice features of Monte  Carlo sampling. You'll be doing some of that in   Jupyter. For example, you'll be taking some Monte  Carlo samples with a population mean that you'll   set to some number and a population standard  deviation that you'll set to some number. Also   you can get a single sample from a population  with a mean of say 100 and a standard deviation   of 20. This for example is going to be what  you draw for part of the lab. You're going   to have a single sample. You're going to take  it from a random distribution that's normally
            • 14:00 - 14:30 distributed with a mean of something and a  sigma of something else, a standard deviation   of something else, that's going to give  you a single value from that distribution.   You can also do this multiple times which is  exactly what we do in Monte Carlo sampling. So   we can take a sample size of 500, for example,  and we can create, we will use numpy just like   we did last week, to create an array that's this  that's that's as large as our sample size and is
            • 14:30 - 15:00 of data type, called 'float64'. And then what  we can do is we can iterate over that sample   and we can take a single sample and we can plop  it in at the if index of that sample until we   have 500 scores in that sample. That gives us  a sampling distribution of individual scores now the other thing we can do is we can  take those scores and instead of plotting
            • 15:00 - 15:30 individuals we can upscale them from a sample  to to a sampling distribution. We can make   histograms of samples. So here's an example of  what the code looks like. I'll be walking you   through that in the lab as well this week. Another  kind of sampling that is extremely popular when   we're thinking about sampling distributions is  a type of sampling called bootstrap resampling.   Most of the time, in the real world, we only  have one sample of data and it's the sample
            • 15:30 - 16:00 that we actually drew. So say we collected  data from 100 people on a particular task,   or on a particular questionnaire, we can  create our own probability density function   from these data by resampling them randomly and  independently. Now that seems weird doesn't it?   To draw independent samples of the same size as  the original, we have to sample with replacement.   So what that means is we draw a score out of  a hat, we note down what it is, and we throw
            • 16:00 - 16:30 that score back in the hat; we shake it up; we  draw another score we note what that score is,   we throw it back in the hat, so we replace it, and  that's what it means to sample with replacement.   Otherwise the minute we drew one score, we would  not have independent samples anymore because that   same score that we just picked, that can't be  picked again. But if we sample with replacement,   that same score, it can come up multiple times.  Now the really neat thing about bootstrap
            • 16:30 - 17:00 resampling, is it allows us to develop  a sampling distribution of sample means,   for instance, or of sample standard deviations,  or whatever our statistic is, it allows us to   estimate that using the EXACT sample that we  have. So the proportions of data that we got   across the probability density function of  our sample distribution, we can actually use   that to create a sampling distribution and  that can sometimes help us estimate sample
            • 17:00 - 17:30 statistics. It allows us to treat our own data  as a population instead of as a sample. It's   really really powerful statistical technique  for understanding the distribution of a sample.   What our method is there, is we take  random and independent draws of scores   from our sample so we might start with an  array that contains the data in our sample,   and what we would do is we would pick a random  number from that sample, we would note it down,
            • 17:30 - 18:00 we would throw it back, we would do that again  and again. And that means that in some samples the same score might come up five or six  times in other samples that score it won't   come up at all. And so because we're  sampling independently and repeatedly,   getting samples of the same size with  replacement we can make an estimate for   how often each one of the scores in the  sample is going to come up on average.
            • 18:00 - 18:30 So what we do [in Bootstrapping] is we we draw  a sample we calculate our sample statistic,   like the mean or the standard deviation, and  then we plot that statistic on a frequency   histogram. And we repeat that process many  many times. You will want to understand what   sampling distributions look like. We'll be talking  about that pretty extensively in lab this week.   Finally, bootstrapping can be useful, especially  if we have non-normally distributed data. and we
            • 18:30 - 19:00 want to make comparisons across groups; if we  want more precise estimates of correlation; of the   parameters that are active in our given samples.  So remember when we sample from a population,   we are sampling specific statistics that we  can observe. Those facts, those descriptive   statistics we talked about in the last lecture,  we talked about those relative to to the sample.   Now here we're interested in scaling those up to  population parameters right? So bootstrapping is
            • 19:00 - 19:30 an extremely powerful way to do that. It gives us  a very precise estimate of population parameters,   especially when we don't know the underlying  population distribution, which often we don't.   It's also useful when we have a sort of smaller  sample size, but not too small. If the sample   is too small, if the sample is so small that  it's not representative of the population,   then bootstrapping might also provide results  that are false or that fail to generalize.
            • 19:30 - 20:00 So we need to make sure that in general,  from an experimental design consideration,   that our samples are representative of our  populations.They need to be large enough to   be representative, if we think about ... so  one of the things I study is social behavior.   Social behavior is really variable. So in  order for me to get representative samples,   I need to get really big samples of behavior,  because behavior is so unique from one interaction
            • 20:00 - 20:30 to the next. So the larger my samples the better  able I am to estimate what a particular behavior   looks like, or how it's distributed in real  life, in a way that generalizes. So here we   need to be careful of having samples that are  too small so that they are not representative   of the population, in which case bootstrapping  isn't going to help us. What we really want is   samples that are big enough to be useful, and  then bootstrapping is an extremely powerful
            • 20:30 - 21:00 technique for making estimates of population  parameters based on sample statistics. I'll leave it there for this part of the lecture  we'll move on to probability in the next lecture.