Behind the Bars: Data Visualization Demystified
DescriptiveStats2
Estimated read time: 1:20
Summary
In this captivating and enlightening video, Erin Heerey delves into the often overlooked but crucial step of data visualization in statistical analysis. The video opens by introducing the Datasaurus dataset, a series of visually distinct plots that share identical statistical properties like means and standard deviations. Erin explains the pitfalls of misleading graphs using examples like an erroneous pizza topping pie chart and distorted sales comparisons between McDonald's and Burger King. She emphasizes the importance of clarity and honesty in visual representation, referencing data visualization expert Edward Tufte's principles of minimizing 'chart junk' and optimizing the 'data to ink ratio.' Through eye-opening examples, Erin makes a compelling case for why visualizing data is indispensable for understanding complex datasets and avoiding bias in interpretation.
Highlights
- The Datasaurus example illustrates how graphs with identical statistics can look dramatically different. 📉
- Erin criticizes a misleading pizza topping pie chart that skews proportional data. 🍄
- Highlighted a faulty graph comparing McDonald's and Burger King sales, noting its deceptive scaling. 🍔
- Covered a Fox News graph on COVID-19 that distorted timelines and case numbers, misleading the public. 🦠
- Introduced Edward Tufte's concepts of 'chart junk' and 'data-to-ink ratio' to design better graphs. 📐
Key Takeaways
- Visualize before you analyze! Always look at your data graphically first. 📊
- The Datasaurus dataset shows visually different plots can have identical stats. Mind-blowing! 😲
- Misleading graphs are everywhere. Check out that tricky pizza topping pie chart example! 🍕
- A good graph is simple and organized. Remember Edward Tufte's advice on avoiding 'chart junk.' 🗑️
- Data-to-ink ratio is key! More ink doesn't mean better graphs. 🎨
Overview
When you begin analyzing a dataset, the first step is to visualize it. This isn't just about making it look pretty; graphs can reveal patterns and relationships hidden in the numbers. Erin Heerey introduces us to the "Datasaurus"—a brilliant demonstration that shows visually distinct graphs can have the same statistical properties. It's a fantastic reminder of the crucial role visualization plays in proper data interpretation.
Erin highlights several real-world examples of misleading graphs, cautioning viewers about the pitfalls of poor data visualization. Remember that colorful pizza topping pie chart? It claimed proportional accuracy but ended up confusing viewers with its uneven slices. Similarly, a graph comparing McDonald's and Burger King's sales uses deceptive scaling to misrepresent data, making one company look much larger than the real figures suggest.
Finally, Erin shares expert advice from famous data visualization artist Edward Tufte, who urges graph creators to minimize 'chart junk' and maximize their 'data-to-ink ratio.' Simple, clear, and truthful representations lead to better understanding and analysis. The video closes with a transition that promises further exploration into measures of central tendency in future content, sparking curiosity about what's next on this data journey.
Chapters
- 00:00 - 00:30: Introduction to Visualizing Data The chapter emphasizes the importance of visualizing data when analyzing a dataset. It highlights the use of graphs for this purpose. An example discussed is the 'datasaurus', a set of graphs that, despite having the same means and standard deviations, visually represent data in strikingly different ways.
- 00:30 - 01:30: Correlation in Data Sets This chapter explores the concept of correlation in data sets, focusing on the relationship between two variables, X and Y. The transcript provides an explanation of how each data point is represented in terms of X and Y coordinates on a plot. Notably, the chapter highlights that both variables have the same mean and standard deviation, emphasizing the importance of understanding the arrangement of data points to interpret their correlation effectively.
- 01:30 - 02:30: Importance of Data Visualization The chapter discusses the importance of data visualization, especially in understanding relationships in data sets. It starts by examining a plot that appears to show no significant relationship between data points. However, it is revealed that all the data sets being discussed actually exhibit a correlation between X and Y, albeit a small negative one, highlighting how visualization can clarify data relationships that might not be immediately obvious from statistics alone.
- 02:30 - 04:00: Good vs. Bad Graphs The chapter 'Good vs. Bad Graphs' explores the visualization of data through different types of graphs. It demonstrates how the appearance of data can vary significantly even with similar statistical correlations. The transcript describes various graph configurations including clouds of points, aligned points forming lines (horizontal, vertical, and diagonal), an X-shaped plot, tightly clustered data in grid patterns, star shapes, and ellipses. The chapter emphasizes understanding these visual differences to distinguish between effective and ineffective data representations.
- 04:00 - 06:30: Examples of Misleading Graphs This chapter discusses the importance of examining graphs critically to avoid being misled by them. It uses the example of the 'datasaurus', a dot-to-dot picture resembling a Tyrannosaurus Rex, to illustrate how data can form misleading shapes or patterns, emphasizing the need to thoroughly analyze data representations.
- 15:00 - 15:30: Conclusion: Measures of Central Tendency The chapter discusses the importance of data visualization to identify patterns. It highlights that despite correlations, patterns may not always be evident without visual aids. The example provided shows multiple data sets with a correlation of -0.6, but only one conforms to commonly seen data structures in Psychology, emphasizing the uniqueness of each dataset's visual appearance.
DescriptiveStats2 Transcription
- 00:00 - 00:30 All right, so when we are starting to look at a data set the first thing we need to do is LOOK at the data. You need to visualize it (by visualizing it, I mean make graphs). This is a famous graph called the datasaurus and the interesting thing about these each one of these little plots that you see in this picture here is that they all have the same means and standard deviations. So in each of these plots that you
- 00:30 - 01:00 see, you see a relationship between a variable called X which is down here on the x-axis for each of the plots and a variable called Y which is up here on the y-axis for each one of those plots. So each point in each one of these data sets has a one value for x and one value for y and the locations of the dots in the pictures tell us where, in X Y coordinates, that dot is located. now the X variable in the Y variable both have exactly the same mean and standard deviation
- 01:00 - 01:30 across all of these data sets and this plot here the one in the upper left corner looks very similar to a data set that really doesn't show much of a relationship between the points. Now interestingly all of these data sets have a correlation between X and Y, a relationship between the data points. There is a very small negative correlation - a negative
- 01:30 - 02:00 correlation of -.06. Now, what you can see from these images is that every single one of these images looks very very different. There is this kind of cloud of points. Here we have points that are aligned in some lines both horizontal, and vertical, and diagonal. We have a plot that looks like an X we have some data points that are very tightly clustered into little tiny groups in a grid pattern. We have a star, we have some ellipses
- 02:00 - 02:30 and we even have the datasaurus up here. This is a little picture that looks like a Tyrannosaurus Rex if you were going to give a Tyrannosaurus Rex dot-to-dot picture for a kid to draw. Now in order to understand your data what you can see from these pictures is that you really need to look at them. You need to view the graphs of your data.
- 02:30 - 03:00 Without a visualization of your data it's really difficult to see the patterns that exist. The correlation may or may not tell you what patterns exist in your data so all of these correlations are -.06, a tiny negative correlation, and yet only one of them looks like the standard data we see in Psychology: this one in the upper left corner. All the rest of these look really really unusual and so that's why
- 03:00 - 03:30 it's important to plot your data, because you don't know what you have until you've actually seen a picture of the data - a picture that is worth a thousand words. But of course not all graphs give us the appropriate story. One of the best ways to understand data is to graph them and graphs provide us with a broad overview of the data especially of the data distribution. They give us information about patterns that occur in the data and they also allow us to have a really quick and intuitive understanding of what it is that the data look like.
- 03:30 - 04:00 However, there are good graphs and there are bad graphs. Good graphs simplify and organize the data clearly and, this is according to Edward Tufte, who is the guru of data visualizations, and he says that graphs should be free of what he calls 'chart junk'. They should be low in 'lie factor' and they should be high in 'data to ink ratio'. Let me show you what these things are.
- 04:00 - 04:30 In this example these are two pictures that have a lot of chart junk. the one on the left is is this is a poll from you.gov in the UK, and it was a poll of people I think they took did this polling outside of a tube Station in London, and they were asking what people's most liked pizza topping was. They told people they could select as many of
- 04:30 - 05:00 the toppings as they wanted, which of course makes the data very difficult to interpret. Now these pizza pies, these slices of pizza here, this is a very eye-catching graph. What you'll see are these evenly evenly sliced pieces of pizza, each of which contains different kinds of toppings so there's a pepperoni and cheese, one there's mushrooms, there's ham and pineapple, and so forth. It looks beautiful except it's really misleading. And here's why:
- 05:00 - 05:30 let's take a topping like pepperoni lots of lots and lots of people like this - it's my kids most favorite pizza topping - and 56% of people say they say they like pepperoni pizza so we have this great big huge piece of pepperoni. Now interestingly, 65 percent of people,
- 05:30 - 06:00 substantially more than the people who say they like pepperoni on pizza, also say they like mushrooms. But look the picture of the mushroom here. It's much smaller than the picture of this pepperoni and even though this piece of pizza here has three mushrooms on it it probably takes up almost less surface area or probably about the same surface area as this piece of pepperoni. Peppers are 60% sweet corn is 42%. What you can see is that the surface area that these toppings take up is not proportional to how many people said they liked them. What would
- 06:00 - 06:30 be better is to slice the pie into how many things people liked best on their Pizza, and have different pieces of pizza that were sliced according to the proportion of the pie that these things should have. You'll also notice that these numbers here, if you add them up make up substantially more than 100 percent which they probably shouldn't in a pie chart. So this is a graph that's pretty uninterpretable because it is full of chart junk. Here's another chart junk
- 06:30 - 07:00 picture. This graph I particularly like - it's eye-catching, the color scheme is really nice.... This is about commuting to work in a number of different cities: Chicago, Los Angeles, ,New York City, Atlanta, San Francisco, Houston, Washington, and Seattle. These are all American cities. The bright red bars show what proportion of people drive themselves to work, then there's dark red bars proportional people who carpool then there's in this lighter blue color,
- 07:00 - 07:30 it's people who take public transit, in dark blue people who walk to work and then the last one includes people who work at home people who bike and so forth. They probably should have mixed the bikers in with the people who walk to work but oh well. Why is this full of chart junk? Well, one of the reasons is that we have the cities themselves, their abbreviations, are where we're getting the statistics and it becomes really difficult
- 07:30 - 08:00 to compare across cities if they're not next to one another. If you think about Los Angeles and Houston Texas, both of these cities actually have a fairly large portion of people who drive in private cars, but we can't actually compare them to one another without lots of rulers and little devices because they're not sitting next to one another. It becomes
- 08:00 - 08:30 very difficult to compare across the cities. The axis is difficult to read. You actually have to do the counting if you want to know. This is presumably a zero, and this presumably is a hundred percent. So you actually actually have to count through these lines to figure out the relative proportions. it would be much more straightforward, even though I like the look of this graph, it would be much more straightforward to display the data in such a way that they were comparable across these cities. They are simply not done this way - again eye-catching graphic, but probably not a great way of showing the data because there's a lot of elements here that are
- 08:30 - 09:00 being represented that really don't need to be represented, and in fact, that are obfuscating. They're not lying but they're obfuscating the real data and the relationships between them. Here are our charts, they're actually lying by distorting the data. Let's look at the one on the left for starters. This is a graphic that was produced a number of years ago that compares
- 09:00 - 09:30 multinational corporations in terms of sales. Now let's look at Burger King versus McDonald's. McDonald's, of course is king! The golden arches, everybody knows McDonald's - we've probably all eaten there more times than is a good idea. And McDonald's, the way this graph is scaled, what we should be reading are the heights of the bars, so McDonald's has a 41 billion in sales you
- 09:30 - 10:00 can see the top of the M here comes just above 40 to somewhere in the neighborhood of 41 billion. We know that this axis is billions of dollars because this little bar here tells us that. Okay that's all fine and good, but let's look at Burger King. Burger King takes in 11.3 billion in sales so that's approximately a quarter of McDonald's take - not quite but close, and it's less than
- 10:00 - 10:30 than an eighth of the size; its logo is less than an eighth of the size of the McDonald's logo. And that's really problematic, it's really misleading because it suggests that Burger King is probably, is less than eight times, is basically an eighth of McDonald's, when in fact that's really only a quarter of McDonald's. The reason they did it this way is because if they scaled it only in this direction and not in this direction you'd have a really really long skinny M and
- 10:30 - 11:00 the rest of these logos will be unreadable. You would have a hard time telling what they were. So they scaled it in both the width and the height directions and that was problematic because the simple surface area calculation makes it look misleading. It makes McDonald's look like the king but like a substantially larger King than it should be. Here's another graph. This is produced by Fox News and it was produced early on in the pandemic. In fact, this is a graph from
- 11:00 - 11:30 actually just within the first month of the pandemic. here within the first month or so, as you'll remember, Covid was declared a pandemic on March the 12th of 2020. and so you can see what they're plotting here - This is according to Johns Hopkins which is a med school that has a great public epidemiology program. So, they were plotting covid-19 cases in the U.S. in various cities now let's look at this graph in some detail. This slope looks like the bunny hill,
- 11:30 - 12:00 right? This is a pretty shallow ramp. So here is the 21st of February here's the first of March so that's less than 10 days. Here's the 10th of March, now we move on to the 15th of March so the difference between 10 and 15 well that's only five days and then there's 15 to 20 that's only
- 12:00 - 12:30 five days and then 20 to 24 that's four days so the scaling on the bottom of this graph is not accurate. Each one of these should be in the same number of days in terms of increments and it's not. That causes this slope to look substantially shallower than it actually is. They do the same thing on the y-axis here right here's zero cases here's 5,000 cases and look at
- 12:30 - 13:00 this jump to 20,000 cases then there's a jump to 35,000 and then there's a jump to 50,000 and then there's a jump to 65,000. So this this part of the graph is substantially compressed from zero to five is a substantially different number than from 5 to 20, 20 to 35, 35 to 50, to 65 and so forth. So this graph should not have a zero as its bottom indicator should it it should have five as
- 13:00 - 13:30 its bottom indicator and so the number of cases in the U.S it looks like it's ramping up very slowly, but in fact actually this ramp was extremely extremely steep it looked much more like the Olympic ski jump than the bunny hill in practice. So graphs like this really distort the way these data look and I'm sure that graphs like this really misled a lot of people in terms
- 13:30 - 14:00 of the degree to which covid was a significant and problematic illness. This is probably one of the reasons that it took the US so long to mount a reasonable response to that illness. And then we have the data-to-ink ratio. This is another element of graph design - the portion of a graphic's ink devoted to the display of data. It's about how much of that graphic could be erased without affecting interpretability. If you look at these, these are two graphs that show exactly the
- 14:00 - 14:30 same data. This is one that has a low data-to-ink ratio. You can see there's sort of a lot of fancy bits, right? They have these graded bars, they have lots of grid lines drawn on the graph, they have categories, they have orders, they have all the decimals, they have these little icons that represent the type of thing being being purchased or ordered from this company. Here,
- 14:30 - 15:00 here's a much cleaner version of that graph. Office supplies, Furniture Technology: three categories of things that get ordered from a particular company. It gives the exact numbers right here so there's no need then for this fancy axis with lots of decimal places; and there the graph itself. the colors are scaled according to how much more of this of one thing is being purchased than the next thing so these two are very similar in color because they are
- 15:00 - 15:30 similar numbers of orders, and then you see that office supplies which are consumables are ordered at a substantially higher proportion. So this is a much easier graph to read, it's much more intuitive. All the junk has been removed so you can just see the patterns much more clearly. All right, we'll jump into measures of central tendency now, and we'll do that in the next video.