DescriptiveStats3

Estimated read time: 1:20

Summary

In this lecture, Erin Heerey delves into various measures of central tendency, including the arithmetic mean, median, and mode, explaining their significance in data analysis. The discussion extends to the use of histograms, box plots, and violin plots to visually represent data distributions and highlight key trends, such as shifts in demographics or financial data over time. Heerey emphasizes understanding how data is distributed to draw meaningful conclusions about broader populations.

Highlights

The arithmetic mean, commonly known as the average, is the central focus of many data discussions 💬
Histograms show the distribution of data and are useful for visualizing the mode 📊
The Great Backyard Bird Count serves as a real-world example of data collection 🐦
Box plots and violin plots are tools for displaying the median and understanding data spread 📊
Changes in generational debt illustrate shifts in societal financial trends 💰
Violin plots provide more detailed insights into data distribution than box plots 🎻
A histogram example highlighted changes in ages of first-time mothers between 1980 and 2016 👶
The median divides data into two equal parts and is often represented in a box plot 📦

Key Takeaways

Measures of central tendency include mean, median, and mode 📊
The arithmetic mean is calculated by dividing the sum of all numbers by the count 📐
Median is the middle value when data points are ordered 🌟
Mode is the most frequently occurring value ⚡
Histograms visually display data distributions 📈
Box plots and violin plots provide insights into data distribution and median 🎻
The Great Backyard Bird Count is a practical example of data collection and analysis 🐦
Histograms can reveal shifts in data trends over time, such as age of first-time mothers 👶

Overview

In this enlightening session with Erin Heerey, the focus is primarily on understanding measures of central tendency in statistics, with a particular emphasis on the arithmetic mean, median, and mode. Each of these measures provides unique insights into the data set, revealing different aspects of central values that aid in interpreting statistical data more effectively. Heerey also introduces students to the role of histograms, a key tool in illustrating the distribution of data points.

Heerey's lecture uses real-world examples like the Great Backyard Bird Count to illustrate the practical application of data analysis. By participating in this event, data about bird populations can be visualized and interpreted, offering insights into migratory patterns and population sizes. This example underscores the importance of data visualization tools such as histograms, which can reveal trends and distributions at a glance.

Moreover, the discussion on graphical data representation extends to the use of box plots and violin plots. These tools not only showcase the median and dispersion of data but also highlight shifts in statistical measures over time, such as changing trends in the ages of first-time mothers or generational differences in household debt. By the end of this lecture, students gain a comprehensive understanding of these essential statistical tools and the stories they can tell through data.

Chapters

00:00 - 00:30: Introduction to Measures of Central Tendency The chapter introduces the concept of measures of central tendency, which provide information about the average data point. It highlights that there are various kinds of averages, with a specific focus on the arithmetic mean. The arithmetic mean is explained as the sum of all numbers in a data set divided by the number of data points.
00:30 - 01:00: Arithmetic Mean, Median, and Mode The chapter discusses three primary measures of central tendency: mean, median, and mode. The mean, also known as the average, is the sum of all values divided by the number of values. The median is the middle value when all numbers are ordered from least to greatest, effectively placing it at the 50th percentile. This means half of the data is below the median and half is above. Lastly, the mode is identified as the most frequently occurring value in a set of data.
01:00 - 02:00: Understanding the Mode and its Visualization This chapter introduces the concept of the mode in statistical data analysis, highlighting it as the most frequently occurring data value in a dataset. An example is provided with a dataset where the number 6 is identified as the mode. The chapter also mentions that histograms are a common way to visualize modes and that data sets can have more than one mode. Histograms will be used frequently throughout the class for data visualization.
02:00 - 04:00: Histograms and Data Distribution The chapter titled 'Histograms and Data Distribution' introduces histograms as tools to display modes without using numeric terms. Histograms are described as plots that represent the distribution of data, available in various shapes. The x-axis of these plots typically represents the individual values a specific variable can take. More details on histograms and their applications are promised to be covered in subsequent lectures.
04:00 - 05:30: Histogram Example: Age of First-Time Mothers This chapter explains the concept of a histogram, specifically using the example of 'Age of First-Time Mothers'. It describes how histograms are used to count and display the frequency of various values within a dataset, with values plotted along the x-axis and frequency on the y-axis. The chapter briefly mentions the Great Backyard Bird Count as a hypothetical example to illustrate how data collection and plotting occur.
05:30 - 06:30: Shift in Distribution Over Time This chapter discusses the shift in bird distribution over time, focusing on how people contribute to studying this phenomenon. It highlights the role of birdwatchers and citizen scientists who use apps to record sightings, which assist researchers in estimating bird populations, distribution, and migratory patterns. This data collection is crucial for understanding changes in bird habitats and movements.
06:30 - 08:30: Introduction to Mean and Two Examples The chapter provides an introduction to the concept of 'mean' using examples from bird watching. It suggests counting common birds like cardinals and robins in a neighborhood to illustrate the idea. By observing and reporting the numbers of these birds, various data points can be collected.
08:30 - 10:00: Introduction to the Median This chapter introduces the concept of the median in statistical analysis. It starts by discussing common real-world examples such as counting birds in a yard to illustrate how data can be collected and summarized. The chapter then transitions to explaining the use of histograms, which are tools that help us understand the distribution of data within a sample. It touches on the importance of determining whether data is normally distributed or follows a different kind of distribution. This foundational knowledge serves as an entry point to more advanced statistical concepts.
10:00 - 14:00: Box Plot and Violin Plot for Median The chapter titled 'Box Plot and Violin Plot for Median' discusses how histograms can provide insights into the data distribution within a larger population. Histograms can indicate if a sample is skewed, suggesting possible skewness in the population itself. This provides valuable information about the distribution characteristics of the data.
14:00 - 15:30: Comparison: Box Plot vs Violin Plot The chapter discusses methods for understanding statistical distribution, specifically focusing on the comparison between box plots and violin plots.
15:30 - 17:00: Conclusion on Violin Plot The chapter titled 'Conclusion on Violin Plot' discusses the changes in the age distribution of first-time mothers from 1980 to 2016. Initially, the most common age for first-time mothers was between 18 and 20, primarily among 19-year-olds who graduated from high school, got married, and started families. By 2016, this mode has shifted to around 20 years old, demonstrating a clear bimodal distribution.
17:00 - 18:00: Preview of Next Section: Dispersion The chapter discusses the concept of multiple modes in data, particularly focusing on how data shape indicates trends rather than just numeric counts. A specific example is given, showing a graph illustrating a substantial shift in the age at which people have their first child. Compared to 1980, the data reveals that people are now waiting longer, indicating interesting trends in societal behavior.

DescriptiveStats3 Transcription

00:00 - 00:30 So let's talk about measures of central tendency. Measures of central tendency tell us about the average data point. Now, there are different kinds of averages that we can calculate and we'll talk about a bunch of them in today's lecture. The one that we are going to, that you're probably most familiar with, and that we are going to focus on most heavily is what we call the arithmetic mean. The arithmetic mean is the sum of the numbers in a data set, divided by the number of data points
00:30 - 01:00 that we have in that data set. It's often referred to as the 'mean' or the 'average'. We will also talk about the median, which is the middle value in the data set. It's the 50th percentile, so it's right where 50 of the data are lower than that number and 50 of the data are higher than that number. And then we have the mode which is the most common value in a data set.
01:00 - 01:30 Let's talk about the mode first mostly because it's the easiest one to describe. it is simply the most frequent data value that exists. So if I give you this pretend data set here, with just a few data points in it what you can see is that it's really clear that 6 is our mode. Six is the most frequently occurring value. One of the ways we can see modes, is in fact, one of the most common displays of these data are histograms. We can also have data sets with more than one mode. By and large we'll be looking at a lot of histograms in the in the context of the class
01:30 - 02:00 and histograms are how we display modes when we don't want to use numeric terms. Histograms are simply plots of distributions that come in many shapes, and we'll talk more about these in future lectures, and they provide a picture of the distribution of the data. The x-axis typically plots the individual values a specific variable can take for example you might
02:00 - 02:30 have a variable things that you can count and so you plot how many of each of the values occur on the x-axis and the y-axis is the frequency with which each of these individual values occurs. So let's say you're participating in, a good example is the Great Backyard Bird Count. This is a worldwide event that happens, I think it happens, in February,
02:30 - 03:00 where all kinds of people who put bird feeders in their yards and like to go around with their binoculars, walk around and they count birds and there are apps that you can download and you can put in birds and what kind of bird you saw in the location where you saw it. And that allows people who study birds, and their migratory patterns, and their habits, to estimate how many of these birds there are, where they're living, where they're migrating,
03:00 - 03:30 and that tells us things like about like climate change and so forth. But what we might do is let's say you saw one of a really common bird in our area, like cardinals. Robins are another really common bird in our area as well. And there are many of these birds. And so you could count the number of robins that you see coming into your yard and you could report that on your app and so you could have 1 Robin you could have 5 Robins you could have 26 Robins but you can count the number
03:30 - 04:00 of each of these birds that shows up in your yard, and that would give people a count of the number of times people you you saw that bird or you could count the number of different birds you saw from cardinals to robins to finches to sparrows to whatever. Histograms tell us about the way the data in a sample are distributed. Are they normally distributed or what kind of distribution do they have if they're not
04:00 - 04:30 normal? And another interesting thing that we can learn from the histogram of a sample is it can tell us something about how the data are likely distributed in the larger population. So histograms give an estimate of this. If we have a histogram that looks skewed for example, that might suggest that the population itself has a skewed distribution, and that can tell us something. We can sort of look at the degree to which a sample might be
04:30 - 05:00 representative of a larger population or might be generalizable to another population with a same/different kind of distribution. it also tells us about whether certain values are more frequent than others. Here's an example of a histogram. This is a great histogram that was produced by the New York Times a number of years ago and this is the age of first-time mothers in 1980 compared to the age of first-time mothers in 2016. These are all women who got pregnant the good old-fashioned way, not using any high-tech methods, and what you can see is that distribution of their ages has
05:00 - 05:30 changed dramatically between 1980 and 2016. From a peak where the greatest proportion of first-time mothers were getting pregnant sometime between the ages of 18 and 20. So these are 19 year olds, people who graduated from high school got married and got pregnant and you can see that that same mode has now increased to 20 at least in 2016 and there's a clear bimodal distribution here
05:30 - 06:00 where there are probably multiple modes - not necessarily by the exact numbers of counts. but certainly by the shape. What you can see is this graph shows a substantial shifted, so that people are waiting longer to have their first child than they did back in 1980. So, we can see these really interesting data trends from the
06:00 - 06:30 distribution of scores that are presented by these histograms. This tells a really interesting story. The mean is our next statistic, our next measure of central tendency. It's the balance point in a data set. So here's the data the little data set that I made up - that I showed you before and it's mean it's 3.9 and you can think about that mean as a place where if we were to take the blocks just as they're balanced on this number line here the place the balance point would
06:30 - 07:00 be is 3.9. This is where the blocks weigh the same amount on one side as they do on another side. How do we find the mean in the sample? Well the mean is often times represented by the name of a variable in this case the variable is called X and it has a little bar sitting right over the top of it. We call this 'x-bar'. So it's the mean of the variable X and it's presented by
07:00 - 07:30 an equation that looks terrible but really isn't as bad as it looks. In plain English this is the sum of each of the numbers from the beginning of the data set (the first number) all the way to the last number in the data set. We add up all those numbers and then we take the count of the numbers (how many there are) and we divide the sum of the numbers by the count. So it looks terrible but it's not nearly as bad as it looks. The mean
07:30 - 08:00 of of a variable is the sum of the data that it contains, divided by how many data points there are and here's a good map, a good graphic. This is an infographic showing means of U.S household debt across different generations of people. So the silent generation, the Baby Boomers, Gen X, Millennials, Gen Z and so forth. This shows you the increase between 2019 and 2020; or
08:00 - 08:30 decrease depending there depending on who you're looking at, of debt across these communities. so it's an interesting visualization of what's happening in the context of covid for people who have who are these different, who are born in different time eras. The infographic also tells you when these time eras are, just so that you can get a feel for that,
08:30 - 09:00 and how much increase in debt they have now relative to the beginning of the pandemic so it's a really interesting infographic that tells us something about how unequally the pandemic has affected various individuals. The median is the center point of our data set. It's the 50th percentile and in order to find it
09:00 - 09:30 we put the numbers in order; if there's an odd number of data points we take the very middle value in the data set; if there's an even number of data points we take the mean of the two middle values. So what we do is the numbers are in order and we work our way to the very middle of the data set and it happens, in this data set, the median is three. Now, if we add in a value, let's say we added a 4 to this, that number that would cause our mean to shift ever so slightly and it would cause our median to shift to 3.5 because
09:30 - 10:00 when there's an even number of data points (originally this data set had an odd number) but if we had an even number of data points we take the mean of the two middle values. The median is often displayed using what we call a box plot so a box plot is a very fancy plot I'm not going to walk you through all the elements here, but you probably should know what they are.
10:00 - 10:30 A box plot is a plot that has a box of the central distribution of the data so if it runs from, the box usually runs from the 25th percentile, the median is the 50th percentile there, and the top of the box is the 75th percentile. So the median is usually displayed as a horizontal line in the middle of the box plot if the box plot is oriented this way. Then you can see, you know, what we
10:30 - 11:00 would expect for the lower value of the data what we would expect for the upper values of the data and then whether there are outliers as well, so a box plot is how we typically represent the median. And there's another plot that we often also use that you will see a lot in this class called the violin plot. It's another plot that also shows us the median, but it shows us something else. It shows us a fancy thing called the 'kernel density estimate'. This is called the violin plot because
11:00 - 11:30 it actually is sort of a bit wiggly - it has wiggly outsides, instead of these nice, square tidy boxplot bars here. But what this shows is how the data are distributed across these different bands so it can provide a bit more information than what you might anticipate from a simple box plot. The median is this dot right here. It's this central value in the interquartile range. So this
11:30 - 12:00 box here marks exactly the same space out that this box here marks with a median and the same point, but what it's doing here is it's showing you where and how the data are distributed. We see these often in research as well. Here's another version of these same plots kind of together so here's the box plot, which you can obviously see - you can see there are some outliers in this box plot. Here's the kernel density estimate, based on
12:00 - 12:30 a histogram. A kind of smooth histogram is one way to think about the kernel density estimate and it shows you how the data are distributed across the graph, from the lower bound of the data set to the upper bound of the data set. And so you can see where the median is on these plots and you can see how the data are distributed so these are all plots of the median.
12:30 - 13:00 Here's an interesting GIF that shows how much more informative violin plots are than box plots specifically each one of these bars in the raw data produce, no matter how they're arranged here, produce box plots that look like this and the violin plots themselves look very different depending on how the data are distributed. so when the data distribution changes the
13:00 - 13:30 violin plot really changes shape, even though the box plot is unaffected by that. And we often see these in scientific or research presentations they can look, this is a very nicely formatted one, looking at social preferences in infants and showing how these data are distributed. This one's interesting as well because what the researchers have done here is they've drawn the individual data points right inside of the plots. These are nice violin plots that are there rather descriptive.
13:30 - 14:00 The next section we'll talk about is dispersion and we'll talk about that in the final element of this week's lecture.