Understanding Advanced Statistical Concepts
Statistics Lecture 6
Estimated read time: 1:20
Summary
In this engaging lecture, Zach's Math delves into statistical measures beyond mean and median, focusing on percentiles, quartiles, and the construction of box plots. The session introduces the concept of percentiles as a measure of positional significance within a data distribution and explains the calculation and visual interpretation of quartiles, specifically through box plots. These box plots are critical for understanding the distribution of data and the concept of interquartile range as a spread measure. Despite their use, the lecturer notes their limitations and encourages a more intuitive approach to data visualization.
Highlights
- Percentiles show where a data point ranks within a distribution 📈.
- Quartiles split data into four equal sections, aiding in analysis 🎉.
- Box plots visualize statistical data but can be non-intuitive 🔍.
- Interquartile ranges help identify data spread while resisting outliers 🚀.
- Excel's functionalities streamline statistical analysis, particularly with quartiles 📊.
Key Takeaways
- Percentiles help track data's relative position within a distribution 📊.
- Quartiles divide data into four equal parts, helping identify spread and center 🎯.
- Box plots, though common, are often criticized for lacking intuitive clarity 🤔.
- Interquartile range provides a robust measure against outliers 🧮.
- Using Excel simplifies the computation of quartiles and interquartile range 💻.
Overview
Zach's Math takes us on a statistical journey in this lecture, exploring beyond the fundamentals of mean and median. The discussion focuses on expanding our understanding of statistics by introducing percentiles as a method to gauge how a data point or value ranks within the entire distribution. Percentiles are crucial for comparing individual scores or measurements against a larger population, traditionally used in various real-world applications.
Furthering our statistical toolkit, quartiles are introduced as a structured way to partition data into four sections, each representing an equal data span. Quartiles are significant for visualizing data spread and identifying the median. A practical application of quartiles is seen through box plots, a method illustrated for showing distributions and their central tendency, although acknowledging their downsides in clarity.
The lecture concludes with a critical look at box plots, highlighting their limitations despite their widespread use. The speaker introduces alternative visualization methods that offer better intuitiveness than traditional box plots, encouraging the use of technological tools like Excel for precise quartile computation. The class is geared towards solidifying the understanding of these statistical tools as part of the coursework.
Chapters
- 00:00 - 01:00: Introduction and Recap of Lesson 5 The chapter 'Introduction and Recap of Lesson 5' begins with a welcome to the lecture recording for lesson six. It serves as a direct continuation of lesson five, focusing on a deeper exploration of commonly discussed statistics, specifically mean, median, and standard deviation. The chapter aims to cover additional statistics while clarifying the definition of 'statistics' as singular concepts, which are numbers that describe certain characteristics, as opposed to the broader field of study.
- 01:00 - 10:00: Percentiles In this chapter, we delve into the concept of percentiles as a measure of positional value within a data set. Initially, we explored measures of center and spread, but now we focus on determining the relative position of particular values within a distribution. By understanding the concept of percentiles, we can effectively describe how a specific value compares to all possible values in the dataset.
- 10:00 - 25:00: Quartiles and Box Plots The chapter introduces the concept of quartiles and how they allow us to measure the spread of data using the interquartile range (IQR). This sets the foundation for understanding and creating box plots, a type of data visualization. The explanation begins with defining percentiles, which represent a position or rank within a dataset.
- 25:00 - 40:00: Issues with Box Plots Chapter focuses on understanding and interpreting box plots, emphasizing the concept of percentiles.
- 40:00 - 43:00: Conclusion and Classwork Expectations The chapter provides an understanding of the concept of percentiles, clarifying the distinction between the position and the value in percentile calculation. The instructor notes that the term 'centile' will be used to refer to either the location or the number at a specific location, emphasizing the importance of context in interpreting percentiles.
Statistics Lecture 6 Transcription
- 00:00 - 00:30 hello welcome to our lecture recording for lesson six today we're going to do a pretty direct followup from what we covered in lesson five where we sort of talked about a few common statistics namely mean median and standard deviation we're just going to cover a few more here I do want to give the quick reminder that when I say statistics I'm not talking about the field of study I'm talking about a statistic singular which is a number that describe some characteristic of a
- 00:30 - 01:00 sample of data and so we're also going to talk here about measures of position before it was just measures of center and spread but now we're going to give numbers that describe where in a distribution a particular value is relative to all the other possible values that we could have gotten now the language for that just to be up front the words we're going to use are percent
- 01:00 - 01:30 cortile and then those will let us do inter cortile range which is a measure of spread and gets my spread color which is IQR for short and finally we'll be able to do a visual called a box plot now without further Ado I'll go ahead and just start defining these things we'll start with a percentile a percentile is a position or a rank within a
- 01:30 - 02:00 distribution a percentile indicates the percentage of the data that falls below a certain threshold so the way this works when you say it best way to do this is with an example I can say if a data point is at the 70th percentile it means that all or 70% of all possible data points will be below it so we're giving a measurement a
- 02:00 - 02:30 number that tells us where that value is is it at the 5050 Mark the 7030 Mark something like that and as a note there is a distinction here between position and value the word percentile can mean either it can be like the number that is at a location or the location itself and in my usage I will treat the word centile as a
- 02:30 - 03:00 position just to keep everything a bit more distinct and clear so the way I'll say things is the value at a percentile and what that means is the value is the actual number that corresponds to the position described by the percentile and percentiles are commonly used to compare individual scores or measurements to a larger population so they help us understand where a particular value value ranks relative to
- 03:00 - 03:30 others to give something kind of real world here where I am genuinely using real world height data it's sourced at the bottom there a 6ot tall adult human male is at the 84th percentile for height so an adult man who is 6 feet tall is taller than 84% of men which means he is shorter than 16% of men there's different ways we can say this
- 03:30 - 04:00 we can say he's at like the bottom 84% of height or the top 16% of height and we can also use a picture to demonstrate this now we haven't really seen a picture like this before so I do want to label this and explain what we're looking at on the bottom we have a number line that is actually listing values of the variable in this case and I'll finish writing before I talk so giving values of the
- 04:00 - 04:30 variable which for this picture are Heights then I can pick a particular height these Heights are in inches so 6t tall means 72 in tall this guy right here is the value for a 6ot tall
- 04:30 - 05:00 man and then above that on the bell curve we have the percentages labeled so I have 84% and 16% and so these are the percentages above and below the dividing line
- 05:00 - 05:30 and I can describe that dividing line as being like at x = 72 inches and we'll see a lot of pictures like this in future lessons the like we'll have an entire lesson that is on curves like this it's lesson 11 so this is our first introduction to pictures like this this also isn't the only way that pictures like this can show up so I
- 05:30 - 06:00 want to show just a recap or I'm going to jump back to last lesson I'll say it that way and in lesson five I had this picture of distribution of household income and I said these numbers at the top the 10th 50th 90th 95th these are percentiles so if I go to translate this the 10th percentile value is $12,300 so 10% of households make
- 06:00 - 06:30 $12,300 or less the 50th percentile here is $3,700 so half of households make $53,500 or less half make that much or more the value at the 90th percentile is $157,000 so if you made $157,000 that year you were in the top 10%
- 06:30 - 07:00 of earners and for the 95th percentile so the top 5% of earners you would need $26,600 and so these percentiles just show up as a way to label specific places in a distribution of data I think I probably most commonly see them on like an income distribution or something like this but they show up in other context too you could look at like the percentiles of letter grades on an exam or something like that but just a a real
- 07:00 - 07:30 world picture this is using that census data for household income I'll go back to lesson six here so I can keep moving and then as we go through I've got another example next one is IQ scores at the 50th percentile so I'll go ahead and label this here at the 50th percent would have an IQ of 100 so I can draw my dividing line
- 07:30 - 08:00 my x = 100 and again the number line at the bottom represents the variable which for this is IQ score and these numbers up above these percentages are the [Music] percent above or below and we could use this like if we
- 08:00 - 08:30 have 115 that's an above average IQ if we have a 70 that's a pretty well below average IQ so on and so on I got some bullet point notes here just to hit key Concepts so I'm going to emphasize here percentiles the number that they give is always for the percentage less than a value so percentiles themselves are always the number on the left they're like this
- 08:30 - 09:00 number right here which in a 50/50 it's the same number but any other situation they will be different numbers reason for that is that 100% of all values must exist somewhere they need to like if they do exist they they are somewhere so the less than and greater than the above or below percentages must add up to a 100 so that's why we have a 50/50 here
- 09:00 - 09:30 or an 8416 on the previous one or like a 7030 an 80 20 a 60 40 as long as those two numbers add to 100 we have a dividing line somewhere on this distribution this also gives us an alternative way we can think of percentiles so the inth percentile can be thought of is the value that separates the bottom in percent
- 09:30 - 10:00 from the top 100 minus in per. so if I think back to that income I can say the 10th percentile is the cut off for being in the bottom 10% of earners or being in the top 90% of earners those numbers add to 100 I can phrase it either way then we also get 50 is in the middle so something above 50 is above average below 50 is below average average and 50 exactly I'll emphasize
- 10:00 - 10:30 that exactly word is the number exactly in the middle the exact center of a sorted list of data which is something we defined before the median and so the median is actually just a special case of a percentile it's a named percentile there's a few more that we're going to describe as well that are called cor tiles so I'll go ahead and scroll down and get that on the screen here so cor tiles are just
- 10:30 - 11:00 specially named percentiles they can be defined as percentiles but we'll use this cortile language what these things are is three dividing lines like I was showing above but just specific ones so there are three cor tiles which cut a set of data they split a distribution into four regions
- 11:00 - 11:30 Each of which has an equal likelihood that values fall within it and I will change that to a yellow highlight just so I can highlight specific stuff here so when I say Each of which has equal likelihood that values are in it I'm saying each of the four regions has 25% of all the possible data values inside of it a value has a 25% chance of being in each region and these dividing
- 11:30 - 12:00 lines will occur at the 25th 50th and 75th percentiles we call those q1 Q2 and Q3 if you're wondering why we only have three of these for four different cor tiles it's because if you want to cut anything into four pieces you only need to make three Cuts you cut once let's say you cut something in half you get two halves one cut two halves you cut again then you'll get three pieces so
- 12:00 - 12:30 two cuts three pieces you cut one more time you get four pieces so three Cuts four pieces that's why we only actually have three cor tiles and again just to label here the middle one is something we have already introduced it is the media so these cor tiles are going to give us another sort of visual that we can use but they can also be a little bit counterintuitive because there's an equal number of
- 12:30 - 13:00 values in each of the four regions we can end up with situations where the denser and like the more closely packed the values are the smaller their region will look on a number line and I'll I'll show some visuals for why this can end up counterintuitive but before we get there let's go ahead and go through one more page this is where I kind of elaborate on that a bit further so let's note the four regions separated by the cor
- 13:00 - 13:30 tiles won't necessarily have equal widths since an equal likelihood is not the same as an equal width or size on the number line now if we want to use a normal bell curve as an example we get a situation that looks like this so I can highlight my middle two regions region one region two I'll draw a little divide in line here just to make them stand
- 13:30 - 14:00 out and that covers 25% of the data but if I want to shade farther if I want to look at the outer two regions I have to cover a lot more space those blue lines are literally longer than the yellow lines so the less common values farther from the center need more range to make
- 14:00 - 14:30 up a full 25% of the data so I'll try and underline that less common values need more range and that's where the counterintuitiveness will come into play when I fully get there I'm going to go ahead and keep scrolling here I want to show what it's going to look like for us to compute these just like lesson five we're going to use Excel so we're not going to be necessarily doing this by hand it's a
- 14:30 - 15:00 little bit of a pain to do it by hand but I'll go ahead and pull up Excel here the sample data that I'm going to use is based on the lesson five classwork if you haven't gone to that yet it's no big deal you don't need to have done it before it's just you'll recognize the set of data um from both this lecture and that classwork whichever order you do on I've got an Excel print out here I'll go ahead and also swap to excel View on the video and you can see that I
- 15:00 - 15:30 have these 12 values entered into Excel here it doesn't take much time to do I can do this in you know 10 15 seconds but I do want to go through whenever I do this and just double check for typos make sure that they match the numbers that are make sure they match the numbers that are provided because if they don't it's a really easy way to make a mistake without having a good way of catching it so I just check one by one everything looks good and I can you
- 15:30 - 16:00 know lazily double check my print output and the handout here those also match so I don't think I made any typos and I'm good to go I can see from here that I have my cor tiles q1 Q3 and technically the median we've already been working with that's Q2 so I already have the values without needing to do any extra work over what I do for mean median standard deviation so just
- 16:00 - 16:30 bonus stuff for free like I said in the last lesson five video now from that output so on the screen over there I can also include the minimum and maximum I'll highlight this here will also include these two things and this will give us something called the five number summary so minimum q1 Q2
- 16:30 - 17:00 Q3 maximum if I show this on a number line I get four regions that have different sizes the four regions do not have equal widths here despite each of them containing three data values so there's an equal number of values here here here and here but we need more or less space to actually get those values and this is where the issue comes into play cuz a
- 17:00 - 17:30 smaller region right here actually means one that is more dense with data we did not need as much space to get our three values a wider region which looks bigger actually has less dense data and so it can just be unintuitive I keep using that word because that's the best word I've got but we're kind of stuck with these because they are incredibly commonly
- 17:30 - 18:00 used now what's the purpose of this five number summary it'll let us do two things so five number summary essentially provides a cent a summary of Center spread and shape that's based on percentiles rather than on like mean standard deviation or median from before we got a little labeled picture here the
- 18:00 - 18:30 range in red goes from the full minimum to maximum this is very vulnerable to outliers because let's say we typo the maximum we make it 798 now the range is enormous so instead we use What's called the intertile range in blue that's the distance from q1 to Q3 it spans the middle half of the data and so it's resistant to those outliers
- 18:30 - 19:00 it describes the spread of the middle 50% and then circled in the center there in green is the actual median so that could be like the center of our figure if we take this five number summary and we draw it out on a number line we actually get a plot that will be called a box plot before I move too much I'll go ahead and highlight here what I said verbally so the center is the
- 19:00 - 19:30 median spread is measured by the width from q1 to Q3 which is defined as inter cortile range the range of only the middle 50% of the data which trims out outliers and keeps things behaving more like a median does I can update my picture here with IQR included with the median labeled I have a new diagram I want to look slightly past that because what we're
- 19:30 - 20:00 ultimately going to do is kind of connect a box together here in the middle can Loosely draw like this if I fill in those two lines and fill in a box I get what is creatively called a box plot so I will scroll down and here's the same picture with those two lines this is a box plot this lets us visualize the shape this lets us see range and interc cortile range this has a line right here
- 20:00 - 20:30 for the median so it shows us allinone are Center spread and shape description for a set of data these were historically used because they're actually pretty straightforward to make by hand you don't need a computer for it but as we'll see they're not very good if you have a computer because computers can just generate more intuitive images so I got one more page before I get
- 20:30 - 21:00 there but here's an example this is from a real world research study if you tried to like Google image search this you'll see tons of charts that look very similar we can just see how these different um sets of data look so I don't actually even remember what they are here but I've got data one data set 2 data set 3 data set 4 and it looks like this this and this are all reasonably similar like the fourth one right here is a tiny
- 21:00 - 21:30 bit lower than the others just by literal height um but then this one's much more different from the rest it's more spread out there's more range to it the center is smaller because it's lower on our number line on the left hand side so we can compare these sets of data by drawing side by-side box plots little note here for the just box on the side is that these dots represent outliers we're not going to worry much
- 21:30 - 22:00 about finding these outliers if you're interested you could Google like fences for outliers literally like the fence on a yard um there's a computation you can do based on interc cortile Range that formally determines what's an outlier or not we don't really care we'll just if it's far away from the rest we'll call it an outlier and then from here I do also have some pictures for what the box plots look look like for our four most
- 22:00 - 22:30 common shapes of data let me Zoom so this fits well on the screen there and so top left I have a normal distribution we've got a pretty small box in the middle and longer Tails like we saw originally top right is a uniform distribution where everything's equally likely so all four regions are pretty much the same width for a positively skewed distribution a skewed right distribution bution is the other wording
- 22:30 - 23:00 we get that longer right tail so the whisker on the right is longer and maybe there's some outliers out there opposite for negative skew or left skew we get a longer left tail but again we we got to keep in mind when we're looking at these and I'm going to zoom way in just to be obnoxious here that technically speaking there's an equal number of data values here here here and here so that right little whisker has
- 23:00 - 23:30 the same amount of data as that very long left whisker and that's where the issue comes into play because if you show this to someone who's never been trained in how box plots are defined it looks like there's way more stuff on that left tail it's just longer it's bigger even though that's not actually the case and that is why box plots in my opinion aren't actually that good of a Gra or of a chart we just have to cover them
- 23:30 - 24:00 because they are so common and so my last little bit here is going to dive into the issues of those box plots there and like I said the core issue is that the more the data is clustered together the smaller it looks on the box plot and I stole some pictures here from a you know I can drop the Excel view we'll go back to just this
- 24:00 - 24:30 um but completely lost my train of thought let me reather that I'm borrowing some images here from that blog post or that publication that I mentioned earlier so you'll see those when I show that as well but right away we can see the issue what looks like a small quantity on the bottom is actually the single biggest B the single biggest uh region that is present and box plots can also overly
- 24:30 - 25:00 simplify the data so on the left hand side here I have what looks like two identical groups but we can see on the right hand side with a harder to create plot but you know if we got a computer who cares we can see that those sets of data are actually quite different the control group has a cluster of people in their teens and a cluster of people in their early 40s the test group is more evenly spread out through throughout those ra throughout those ages so
- 25:00 - 25:30 there're just differences that get hidden and I've got a link to this page at the bottom down here this nighting Gale DVS and I'm going to go ahead and just pull that article up right here like I said this is from the Journal of the data visualization Society so it's not just some random blog post it is an actual thing and we've got that first picture I won't go in full detail here I would recommend commend reading this if you're interested just if you think
- 25:30 - 26:00 you're going into anything like stem or science related where you will likely encounter box plots in just workshops and conferences and Publications it's good to understand I'm mostly going to look through the pictures here where we have some alternatives for the visualization this one is using the core ID of a box plot but it's using the like a dark of the region to show where those values
- 26:00 - 26:30 are more or less spread out we can see that the more spread out region is a lighter color on that left hand chart or we can just literally make it so there are rectangles with the same area and we can see how stretched out that third region is in Miss naab's class because it it's such a skinny long rectangle versus the others that are less like that so this is an example of an
- 26:30 - 27:00 alternative we can use that is more visually intuitive and they're not the only ones so if I scroll down we can see the Jitter plots here this is one that I put in my handout and we can also just see a straight up um what we call a strip plot which again shows the pattern more clearly than a box plot itself does so I can see see how the age distributions differ in groups a b and c very directly
- 27:00 - 27:30 I don't have a layer of abstraction applied to it and if we spread those dots out a little bit to hide like overlap we can see it as well it sort of Blends together the um darkness of the shading and all of that so there's just better ways of presenting data than box plots themselves now that we have more sophisticated tools and we don't have to draw these by ourselves I think that's pretty much it I've got a side by side
- 27:30 - 28:00 comparison of the three Alternatives shown right here and then that should more or less finish the article so you know if you want take a moment pause here look at those differences and then think what these would look like as box plots like from the journal example that I provided and think what's more straightforward to understand I'll move on from the this in three two one just
- 28:00 - 28:30 double checking that I did finish the article I did so I can go back here and all I really have left to talk about is that we're really not going to be making that many box plots this semester it's just they are sort of a legacy type thing archaic would be another word what you can expect here I do have to cover them because they do show up all the time people do what I consider bad
- 28:30 - 29:00 practice pretty often but they will for us show up in two assessment questions I don't remember the exact numbers of them off the top my head but I know there is one each where you will draw a box plot after finding a five number summary which again finding the five number summary just use Excel to do it it's right there you type the set of data in you get the five number summary and then you can draw the box plot by
- 29:00 - 29:30 ticking off each of those five numbers on a number line then other than drawing will have interpretation where we have a comparison so oh I do not know why that jumped there I apologize for that sometimes my tablet can be a little weird but we'll interpret a box plot comparison like the sciencey one that I in included on page four or
- 29:30 - 30:00 five I got to slide this back over there we go so those are the two skills I expect you to have and again you're you're only going to be asked to do this once this isn't going to be something that shows back up on the final so worst case I'm hoping you get it on a reattempt or something like that I don't want these to be a big sticking point for the course despite how common that they are are and so my closing summary here we
- 30:00 - 30:30 have percentiles as a thing so percentiles describe a relative rank within a distribution of values if you can find the percentile of a value you know exactly where it is compared to the rest and as far as skills relating to percentiles go it's mainly going to be understanding comprehension so doing interpretations and and specifically I
- 30:30 - 31:00 will ask you to understand and be able to distinguish between a percentile and a percentage they are very closely related but they are different so make sure you understand that distinction percentile vers percentage then I have cor tiles which are specially named percentiles so q1 is the 25th percentile Q2 is the median or the 50th percentile Q3 is the
- 31:00 - 31:30 75th percentile and these separate a distribution into four regions of equal size the cor tiles are the dividing marks not the regions themselves so we only have three of them um and then whoops skipped one inter cortile range is the difference between q1 and Q3 which is the range for the middle 50% of the data this means that median and IQR give us
- 31:30 - 32:00 an alternative measure of center and spread so they're not as common as mean or standard deviation we'll see that throughout the semester most of what we'll be doing is mean and standard deviation but they're both resistant to outliers and skew I demonstrated this for medians in lesson 5 during the lecture you'll demonstrate it yourselves for IQR in the classwork for them so that's what you can expect to do next
- 32:00 - 32:30 class then lastly here we have box plots these are a type of graph that shows the minimum q1 median Q3 and maximum values but box plots are kind of confusing not intuitive we won't use them that much but as I've been saying they're very common so I got to cover them and I want to do as at least as good a job as I can to help you interpret them so for the classwork you can expect
- 32:30 - 33:00 to do these things practice some percentile Concepts find five number summaries use anner cortile range draw and interpret box plots if you got any questions feel free to email me and I hope to see you in our next class