Understanding Data Dispersion

DescriptiveStats4

Estimated read time: 1:20

    Summary

    In this week's lecture, Erin Heerey explores the concept of data dispersion, focusing on how data points spread across a data set. Key measures of dispersion such as range, interquartile range (IQR), variance, and standard deviation are discussed. Heerey explains the difference between population and sample statistics, showing how they inform inferential statistics. The lecture emphasizes the calculation and interpretation of variance and standard deviation, illustrating why these measures are crucial for understanding the spread of data. It highlights the significance of adjustments in the denominator for sample variance and the rationale behind using squared deviations.

      Highlights

      • Dispersion reveals data spread across a set, aiding in understanding metrics like range, IQR, and variance. 📊
      • Standard deviation connects to original measurement units, making it more comprehensible. 🎯
      • Box plots and violin plots visualize concepts like IQR effectively. 📉
      • Population vs sample statistics are differentiated using Greek and Roman letters. 🏛️
      • Sample variance adjusts with n-1 to avoid underestimating population variance. 🔄
      • Squaring deviations prevents negative values, highlighting true variance and enhancing statistical tests. 🔢
      • Standard deviation translates variance into original units, easing interpretation. 🏄‍♂️

      Key Takeaways

      • Data dispersion metrics help in understanding data spread within a set. 👓
      • Standard deviation is particularly more intuitive as it's expressed in the same units as the data. 📏
      • Visual tools like box and violin plots are handy in interpreting dispersion metrics. 📊
      • Different notations clarify whether a statistic pertains to a sample or an entire population. 📜
      • A careful statistical approach is necessary when inferring population parameters from samples. ⚖️
      • Use of squared deviations in variance calculations highlights actual data spread by preventing negative sums. 🧮
      • Understanding these statistical methods is vital for insightful data analysis. 🧠

      Overview

      Dive into the world of data dispersion with Erin Heerey, where we explore how data points beautifully scatter across a dataset. The lecture covers essential statistical measures like range, interquartile range (IQR), variance, and standard deviation, breaking them down to demystify how data spread helps in making sense of numbers. Understanding range and IQR helps in identifying where the major chunk of data sits, while variance and its close cousin standard deviation help in quantifying how 'spread-out' our data really gets.

        Heerey takes us on a visual tour with box plots, which are as humble as they are informative, laying out data quartiles and spreads vividly. The session delves deep into differentiating between population and sample statistics, ensuring we never mix up these cousins by highlighting notational distinctions. The lecture also shares a golden nugget: for sample statistics, adjusting the denominator to 'n-1' guards against underestimating our data’s true variance — a small tweak with big implications.

          Ultimately, the squaring of deviations serves a dual purpose: neutralizing negative values which would otherwise mislead us into thinking there's no spread and putting more weight on outliers. By translating this squared variance into a standard deviation, we return to a familiar measurement universe, making the abstract tangible and measurements meaningful. This week's journey through dispersion leaves us well-equipped to understand and interpret data's intricate dance.

            Chapters

            • 00:00 - 00:30: Introduction to Measures of Dispersion In this chapter titled 'Introduction to Measures of Dispersion', the lecture discusses the concept of dispersion in data sets. It explains that measures of dispersion provide insights into how data points are spread across a set. The chapter covers various metrics including the range, which is the difference between the maximum and minimum values, the interquartile range, which is the difference between the 75th and 25th percentiles, the variance, which calculates the average squared deviation of scores from their means, and the standard deviation. These measures help in understanding the distribution and spread of data.
            • 00:30 - 01:00: Understanding Standard Deviation The chapter delves into the concept of standard deviation, highlighting it as the average deviation of a score from its mean. It is noted for being particularly useful because it is expressed in the same units as the original measurement, unlike variance, which involves squared deviations and often results in large, less intuitive numbers. The standard deviation, therefore, provides a more comprehensible metric that directly relates to the original units of measurement, making it easier to understand the dispersion of data values.
            • 01:00 - 01:30: Using Box and Violin Plots to Understand IQR The chapter discusses how to understand the Interquartile Range (IQR) using box and violin plots. It explains that in a box plot, the IQR is found by subtracting the 25th percentile from the 75th percentile, helping to identify where 50% of the data is located. The illustration uses the visual aid of a box plot to further elucidate the distribution and summary statistics of a data set.
            • 01:30 - 02:00: Calculating Range and Variance The chapter discusses the concept of interquartile range, which is the difference between the value above the median and the value below the median. It describes how to calculate the full range of a data set by subtracting the minimum value from the maximum value. This part of the chapter also introduces the concept of variance as a measure of dispersion, setting the stage for further discussion.
            • 02:00 - 02:30: Population vs Sample Statistics The chapter discusses the importance of understanding the distinction between population and sample statistics. It emphasizes how sample statistics are often used to make inferences about population parameters, and highlights the need for careful application of these methods. The use of variance calculation to derive the standard deviation is mentioned as a common practice, particularly in a classroom and lab context.
            • 02:30 - 03:30: Population Variance Formula Explained The chapter delves into the representation of variables in statistical formulas, particularly focusing on population variance. It highlights the use of Greek letters in representing populations, such as sigma squared (σ²) for population variance and mu (μ) for population mean. The discussion also touches on the usage of uppercase letters when representing certain elements.
            • 03:30 - 04:30: Sample Variance Formula Explained The chapter titled 'Sample Variance Formula Explained' discusses the representation of population size using 'N'. It is noted that 'N' can have different meanings, and its use in this context pertains to representing a population size, which is often theoretical in nature. Additionally, the chapter introduces the concept of population variance (sigma squared) and mentions how it is calculated, hinting at a more detailed examination of the variance formula.
            • 04:30 - 05:00: Squared Deviations and Their Importance The chapter discusses the concept of 'squared deviations' and their significance in statistical analysis. It explains the operator used in calculating the mean, focusing on 'X of i', which represents each individual value of a variable 'X'. The text describes the process of subtracting the mean from each individual score, squaring the result, and then summing these squared values. This summation is referred to as the 'sums of squares'. Finally, the chapter notes that this sum is divided by the capital 'N', which denotes the population size, to complete the calculation.
            • 05:00 - 06:00: Why Use Squared Deviations? This chapter explains the concept of population variance and how it differs from sample variance. It highlights the similarities between the two, noting that while the population variance uses Greek letters, the sample variance uses Roman letters. The formula for sample variance is presented and analyzed.
            • 06:00 - 06:30: Calculating Standard Deviation The chapter titled 'Calculating Standard Deviation' explains the concept of sample variance in comparison to population variance. It highlights the notations used, such as x_bar for sample mean and lowercase 'n' for sample size. The lecture uses a PowerPoint presentation to illustrate these points, although some parts like the bar symbol may not be clearly visible. The chapter emphasizes the formula and method to compute sample variance, referencing previous discussions on population variance for context.
            • 06:30 - 07:00: Graphical Representation of Standard Deviation The chapter covers the concept of sample variance and its graphical representation through standard deviation. It begins by explaining how the sample variance is derived by taking the sum of each individual score, subtracting the mean, squaring the difference, and then averaging those squared differences. The transcript notes that a small adjustment is made in the denominator to account for potential underestimation of population variance when calculating sample variance. However, this adjustment is not explained in detail within this chapter.

            DescriptiveStats4 Transcription

            • 00:00 - 00:30 So to finish out this week's lecture, we're going  to talk about dispersion, how data are distributed   across a data set. Measures of dispersion tell us  about how spread out the data points happen to be,   so we can gain information like the range, the  maximum value minus the minimum value, we can get   the interquartile range, the 75th percentile minus  the 25th percentile, we can get the variance which   is the average squared deviation of scores from  their means. We can get the standard deviation
            • 00:30 - 01:00 which is the average deviation of a score from  its mean. The standard deviation is interesting   because it's expressed in the units, the same  units as the original measurement. So when we   look at variance, often those numbers are very  large because they're squared - so we're getting   squared deviations and they're large enough that  sometimes they can be hard to understand. But the   standard deviation is nice because it's related  right back to the original measurement units.
            • 01:00 - 01:30 So to get the range and the interquartile  range we're going to look once again at   our humble box plot. We can also do this off  of a violin plot as well but let's leave the   box plot up here. This is the same image you  saw before. It has the interquartile range   marked on it. Here's the 25th percentile;  here's the 75th percentile. To get the IQR,   you take the 75th percentile and you subtract  off the 25th percentile and that gives you the   interquartile range so that gives you kind  of where 50 of the data are located. The 25%
            • 01:30 - 02:00 that's above the median and the 25% that's below  the median is the interquartile range. The full   range of the data is the maximum value in the data  set minus the minimum value in the data set. So,   if your minimum value is zero and your  maximum value is 10 you have a range of 10. The next measure of dispersion that we need  to look at is the variance, and we're going to
            • 02:00 - 02:30 be talking a lot about this one in the class.  In fact, we're going to be using the variance   calculation to get us to the standard deviation  relatively regularly, and it's very explicitly   in the lab this week. So one of the things I  want to mention to you here is the difference   between population and samples statistics,  because we are often going to use sample   statistics to tell us something about population  parameters. but we need to do that very carefully.
            • 02:30 - 03:00 Let's look at a few of the elements here. Let's look at, for starters, how  we represent these variables. when we're talking about populations  you will often see these represented   in Greek letters so this is sigma  squared, it's the population variance,   mu is the population mean, you will also often  see uppercase values when we're representing a
            • 03:00 - 03:30 population, not always and later in the course  you'll see N used in a couple of different ways,   but for right now N is going to represent a  population size. Do we know the population   size? Usually not really - this  is often a theoretical quantity. So we have sigma squared, the population variance,  is equal to, here's this very fancy 'sum of'
            • 03:30 - 04:00 operator that we saw when we looked at the mean,  and then we have X of i, this is the ith value of   the variable X so each individual score minus  its mean then squared. So that's this special   quantity called the 'sums of squares' and it's  the sum of each score minus its mean squared   (you square before you add), all divided by  capital N the population size and that is
            • 04:00 - 04:30 the population variance formula. The sample  variance is what we have here. This is what   we get based on our specific sample and you  can see there are lots of elements here that   look very similar to the population variance.  I will point out to you that here when we're   talking about samples we're using Roman letters  so the standard ones. This is s and X this is
            • 04:30 - 05:00 again supposed to be a bar but it doesn't come  out very clearly in the PowerPoint display,   x_bar the sample mean, and here when we're talking  about n we're using lowercase it's the actual   sample size. So this is the sample variance. it  looks very similar to the population variance   that you saw in the previous half of this slide  or in the previous in in this column over here.
            • 05:00 - 05:30 The sample variance equals the sum of each  individual score minus its mean, squared, just   like what we were doing there, and we are simply  taking an average but in the sample variance we   have a little adjustment in the denominator.  We're using a little adjustment because   we are likely to underestimate the population  variance when we take the sample variance. I'm   not going to unpack that for you in this week's  lecture, we're going to talk specifically about
            • 05:30 - 06:00 that in a future lecture, but suffice  it to say for this, and for right now   it's enough to understand that we typically  use n minus 1 in the denominator of the sample, and we do that because we are taking a sample of  a population and we're likely to underestimate it.   By making a little adjustment to the denominator  we can get a little bit closer to where we need to
            • 06:00 - 06:30 be. Now, it's not a perfect correction, and we'll  talk about that, as I said, in a future lecture. So the variance is the average squared deviation  from the mean and if you unpack this equation a   little bit, it's probably nicer to look over it  at it on this side, what you'll see, remember   when we took the mean we took the sum of some  numbers divided by the count of those numbers,   we're doing exactly the same thing here.  We're just doing a little bit more math in
            • 06:30 - 07:00 the numerator before we add add things up, so this  is the average squared deviation from the mean.   Now why do we use squared deviations? Let's look  at that for a minute. I'm going to give you a   visual example here. Let's say we have a bunch  of numbers, a data set, that looks like this.   What's our mean? The mean of  this data set it turns out is 3.   And what you can see is that we have some numbers  that are sitting exactly on three they have zero
            • 07:00 - 07:30 deviation from three. We also have some numbers  that are one different from three so here's one,   two, three of them that have a deviation of one,  here's another one, a fourth one that also has a   deviation of one this happens to be minus one,  and then we have some further away numbers.   Now if we look at all of these deviations  and add them up, and I will ask you to pause   the video and do that yourself, what you'll  find is that they come out to a big fat zero.
            • 07:30 - 08:00 But that's not really right because you can see  that there's more deviation than zero deviation in   this data set. So we need to account for that, and  the way we do that is by squaring these numbers.   And by squaring these numbers, you can now see  that there are no more negative numbers. Remember   when you multiply a negative number by a negative  number it gives you a positive number. So now we   have deviations that we can add up that add up  to more than zero and that's really important.
            • 08:00 - 08:30 The other reason we use these squares, is that  it allows these more deviant values to carry   a little bit more weight. That's not quite so  important for the for the variance here, but it   becomes more important when we're talking later  on about different kinds of statistical tests   that also rely on squared deviations from  various measures from central tendency.
            • 08:30 - 09:00 So the standard deviation is the average  difference from the mean in the original units.   That's more easily interpretable than squared  units. The population standard deviation is sigma,   and it equals the square root of sigma  squared. So we calculate our variance,   and then we take the square root and that  gives us the population standard deviation.   The sample standard deviation is the same  but obviously using the sample calculations   rather than the population calculation.  So the standard deviation of the sample,
            • 09:00 - 09:30 denoted as 's', is equal to the square  root of the sample variance. Here's what   this looks like graphed what you'll see is  that bars are the height of the mean and   the standard deviations around the means are  equal to plus or minus one standard deviation.   Those are measures of dispersion and  that brings us to the end of this week.