Behind the Bars: Data Visualization Demystified

DescriptiveStats2

Estimated read time: 1:20

    Summary

    In this captivating and enlightening video, Erin Heerey delves into the often overlooked but crucial step of data visualization in statistical analysis. The video opens by introducing the Datasaurus dataset, a series of visually distinct plots that share identical statistical properties like means and standard deviations. Erin explains the pitfalls of misleading graphs using examples like an erroneous pizza topping pie chart and distorted sales comparisons between McDonald's and Burger King. She emphasizes the importance of clarity and honesty in visual representation, referencing data visualization expert Edward Tufte's principles of minimizing 'chart junk' and optimizing the 'data to ink ratio.' Through eye-opening examples, Erin makes a compelling case for why visualizing data is indispensable for understanding complex datasets and avoiding bias in interpretation.

      Highlights

      • The Datasaurus example illustrates how graphs with identical statistics can look dramatically different. 📉
      • Erin criticizes a misleading pizza topping pie chart that skews proportional data. 🍄
      • Highlighted a faulty graph comparing McDonald's and Burger King sales, noting its deceptive scaling. 🍔
      • Covered a Fox News graph on COVID-19 that distorted timelines and case numbers, misleading the public. 🦠
      • Introduced Edward Tufte's concepts of 'chart junk' and 'data-to-ink ratio' to design better graphs. 📐

      Key Takeaways

      • Visualize before you analyze! Always look at your data graphically first. 📊
      • The Datasaurus dataset shows visually different plots can have identical stats. Mind-blowing! 😲
      • Misleading graphs are everywhere. Check out that tricky pizza topping pie chart example! 🍕
      • A good graph is simple and organized. Remember Edward Tufte's advice on avoiding 'chart junk.' 🗑️
      • Data-to-ink ratio is key! More ink doesn't mean better graphs. 🎨

      Overview

      When you begin analyzing a dataset, the first step is to visualize it. This isn't just about making it look pretty; graphs can reveal patterns and relationships hidden in the numbers. Erin Heerey introduces us to the "Datasaurus"—a brilliant demonstration that shows visually distinct graphs can have the same statistical properties. It's a fantastic reminder of the crucial role visualization plays in proper data interpretation.

        Erin highlights several real-world examples of misleading graphs, cautioning viewers about the pitfalls of poor data visualization. Remember that colorful pizza topping pie chart? It claimed proportional accuracy but ended up confusing viewers with its uneven slices. Similarly, a graph comparing McDonald's and Burger King's sales uses deceptive scaling to misrepresent data, making one company look much larger than the real figures suggest.

          Finally, Erin shares expert advice from famous data visualization artist Edward Tufte, who urges graph creators to minimize 'chart junk' and maximize their 'data-to-ink ratio.' Simple, clear, and truthful representations lead to better understanding and analysis. The video closes with a transition that promises further exploration into measures of central tendency in future content, sparking curiosity about what's next on this data journey.

            Chapters

            • 00:00 - 00:30: Introduction to Visualizing Data The chapter emphasizes the importance of visualizing data when analyzing a dataset. It highlights the use of graphs for this purpose. An example discussed is the 'datasaurus', a set of graphs that, despite having the same means and standard deviations, visually represent data in strikingly different ways.
            • 00:30 - 01:30: Correlation in Data Sets This chapter explores the concept of correlation in data sets, focusing on the relationship between two variables, X and Y. The transcript provides an explanation of how each data point is represented in terms of X and Y coordinates on a plot. Notably, the chapter highlights that both variables have the same mean and standard deviation, emphasizing the importance of understanding the arrangement of data points to interpret their correlation effectively.
            • 01:30 - 02:30: Importance of Data Visualization The chapter discusses the importance of data visualization, especially in understanding relationships in data sets. It starts by examining a plot that appears to show no significant relationship between data points. However, it is revealed that all the data sets being discussed actually exhibit a correlation between X and Y, albeit a small negative one, highlighting how visualization can clarify data relationships that might not be immediately obvious from statistics alone.
            • 02:30 - 04:00: Good vs. Bad Graphs The chapter 'Good vs. Bad Graphs' explores the visualization of data through different types of graphs. It demonstrates how the appearance of data can vary significantly even with similar statistical correlations. The transcript describes various graph configurations including clouds of points, aligned points forming lines (horizontal, vertical, and diagonal), an X-shaped plot, tightly clustered data in grid patterns, star shapes, and ellipses. The chapter emphasizes understanding these visual differences to distinguish between effective and ineffective data representations.
            • 04:00 - 06:30: Examples of Misleading Graphs This chapter discusses the importance of examining graphs critically to avoid being misled by them. It uses the example of the 'datasaurus', a dot-to-dot picture resembling a Tyrannosaurus Rex, to illustrate how data can form misleading shapes or patterns, emphasizing the need to thoroughly analyze data representations.
            • 15:00 - 15:30: Conclusion: Measures of Central Tendency The chapter discusses the importance of data visualization to identify patterns. It highlights that despite correlations, patterns may not always be evident without visual aids. The example provided shows multiple data sets with a correlation of -0.6, but only one conforms to commonly seen data structures in Psychology, emphasizing the uniqueness of each dataset's visual appearance.

            DescriptiveStats2 Transcription

            • 00:00 - 00:30 All right, so when we are starting to look  at a data set the first thing we need to do   is LOOK at the data. You need to visualize  it (by visualizing it, I mean make graphs).   This is a famous graph called the  datasaurus and the interesting thing   about these each one of these little plots  that you see in this picture here is that   they all have the same means and standard  deviations. So in each of these plots that you
            • 00:30 - 01:00 see, you see a relationship between a variable  called X which is down here on the x-axis for   each of the plots and a variable called Y which  is up here on the y-axis for each one of those   plots. So each point in each one of these data  sets has a one value for x and one value for y   and the locations of the dots in the pictures  tell us where, in X Y coordinates, that dot is   located. now the X variable in the Y variable both  have exactly the same mean and standard deviation
            • 01:00 - 01:30 across all of these data sets and this plot  here the one in the upper left corner looks   very similar to a data set that really doesn't  show much of a relationship between the points.   Now interestingly all of these data  sets have a correlation between X and Y,   a relationship between the data points. There is  a very small negative correlation - a negative
            • 01:30 - 02:00 correlation of -.06. Now, what you can see  from these images is that every single one   of these images looks very very different.  There is this kind of cloud of points. Here   we have points that are aligned in some  lines both horizontal, and vertical,   and diagonal. We have a plot that looks like an  X we have some data points that are very tightly   clustered into little tiny groups in a grid  pattern. We have a star, we have some ellipses
            • 02:00 - 02:30 and we even have the datasaurus up here. This is  a little picture that looks like a Tyrannosaurus   Rex if you were going to give a Tyrannosaurus  Rex dot-to-dot picture for a kid to draw. Now in order to understand your data what  you can see from these pictures is that   you really need to look at them. You  need to view the graphs of your data.
            • 02:30 - 03:00 Without a visualization of your data it's  really difficult to see the patterns that exist.   The correlation may or may not  tell you what patterns exist in   your data so all of these correlations  are -.06, a tiny negative correlation, and yet only one of them looks like  the standard data we see in Psychology:   this one in the upper left corner. All the rest of  these look really really unusual and so that's why
            • 03:00 - 03:30 it's important to plot your data, because you  don't know what you have until you've actually   seen a picture of the data - a picture that  is worth a thousand words. But of course not   all graphs give us the appropriate story. One of  the best ways to understand data is to graph them   and graphs provide us with a broad overview of  the data especially of the data distribution.   They give us information about patterns that  occur in the data and they also allow us to have   a really quick and intuitive understanding  of what it is that the data look like.
            • 03:30 - 04:00 However, there are good graphs and there are bad  graphs. Good graphs simplify and organize the   data clearly and, this is according to Edward  Tufte, who is the guru of data visualizations,   and he says that graphs should be free of what  he calls 'chart junk'. They should be low in   'lie factor' and they should be high in 'data to  ink ratio'. Let me show you what these things are.
            • 04:00 - 04:30 In this example these are two pictures  that have a lot of chart junk.   the one on the left is is this is a poll from  you.gov in the UK, and it was a poll of people   I think they took did this polling outside of  a tube Station in London, and they were asking   what people's most liked pizza topping was.  They told people they could select as many of
            • 04:30 - 05:00 the toppings as they wanted, which of course  makes the data very difficult to interpret.   Now these pizza pies, these slices of pizza  here, this is a very eye-catching graph. What   you'll see are these evenly evenly sliced pieces  of pizza, each of which contains different kinds   of toppings so there's a pepperoni and cheese,  one there's mushrooms, there's ham and pineapple,   and so forth. It looks beautiful except  it's really misleading. And here's why:
            • 05:00 - 05:30 let's take a topping like pepperoni lots of  lots and lots of people like this - it's my   kids most favorite pizza topping - and 56% of  people say they say they like pepperoni pizza   so we have this great big huge piece of pepperoni.  Now interestingly, 65 percent of people,
            • 05:30 - 06:00 substantially more than the people who say they  like pepperoni on pizza, also say they like   mushrooms. But look the picture of the mushroom  here. It's much smaller than the picture of this   pepperoni and even though this piece of pizza here  has three mushrooms on it it probably takes up   almost less surface area or probably about the  same surface area as this piece of pepperoni.   Peppers are 60% sweet corn is 42%. What you  can see is that the surface area that these   toppings take up is not proportional to how  many people said they liked them. What would
            • 06:00 - 06:30 be better is to slice the pie into how many  things people liked best on their Pizza,   and have different pieces of pizza that were  sliced according to the proportion of the pie   that these things should have. You'll also notice  that these numbers here, if you add them up make   up substantially more than 100 percent which they  probably shouldn't in a pie chart. So this is a   graph that's pretty uninterpretable because it  is full of chart junk. Here's another chart junk
            • 06:30 - 07:00 picture. This graph I particularly like - it's  eye-catching, the color scheme is really nice....   This is about commuting to work in a number of  different cities: Chicago, Los Angeles, ,New York   City, Atlanta, San Francisco, Houston, Washington,  and Seattle. These are all American cities.   The bright red bars show what proportion  of people drive themselves to work,   then there's dark red bars proportional people who  carpool then there's in this lighter blue color,
            • 07:00 - 07:30 it's people who take public transit,  in dark blue people who walk to work   and then the last one includes people who  work at home people who bike and so forth.   They probably should have mixed the bikers in  with the people who walk to work but oh well.   Why is this full of chart junk? Well, one of the  reasons is that we have the cities themselves,   their abbreviations, are where we're getting  the statistics and it becomes really difficult
            • 07:30 - 08:00 to compare across cities if they're not  next to one another. If you think about   Los Angeles and Houston Texas, both of these  cities actually have a fairly large portion   of people who drive in private cars, but we can't  actually compare them to one another without lots   of rulers and little devices because they're  not sitting next to one another. It becomes
            • 08:00 - 08:30 very difficult to compare across the cities.  The axis is difficult to read. You actually   have to do the counting if you want to know.  This is presumably a zero, and this presumably   is a hundred percent. So you actually actually  have to count through these lines to figure out   the relative proportions. it would be much more  straightforward, even though I like the look of   this graph, it would be much more straightforward  to display the data in such a way that they were   comparable across these cities. They are simply  not done this way - again eye-catching graphic,   but probably not a great way of showing the data  because there's a lot of elements here that are
            • 08:30 - 09:00 being represented that really don't need to be  represented, and in fact, that are obfuscating.   They're not lying but they're obfuscating the  real data and the relationships between them. Here are our charts, they're actually lying  by distorting the data. Let's look at the one   on the left for starters. This is a graphic that  was produced a number of years ago that compares
            • 09:00 - 09:30 multinational corporations in terms of sales.  Now let's look at Burger King versus McDonald's.   McDonald's, of course is king! The golden arches,  everybody knows McDonald's - we've probably all   eaten there more times than is a good idea.  And McDonald's, the way this graph is scaled,   what we should be reading are the heights of the  bars, so McDonald's has a 41 billion in sales you
            • 09:30 - 10:00 can see the top of the M here comes just above 40  to somewhere in the neighborhood of 41 billion.   We know that this axis is billions of dollars  because this little bar here tells us that. Okay   that's all fine and good, but let's look at Burger  King. Burger King takes in 11.3 billion in sales   so that's approximately a quarter of McDonald's  take - not quite but close, and it's less than
            • 10:00 - 10:30 than an eighth of the size; its logo is less than  an eighth of the size of the McDonald's logo. And   that's really problematic, it's really misleading  because it suggests that Burger King is probably,   is less than eight times, is basically an eighth  of McDonald's, when in fact that's really only   a quarter of McDonald's. The reason they did  it this way is because if they scaled it only   in this direction and not in this direction  you'd have a really really long skinny M and
            • 10:30 - 11:00 the rest of these logos will be unreadable. You  would have a hard time telling what they were.   So they scaled it in both the width and the  height directions and that was problematic   because the simple surface area calculation makes  it look misleading. It makes McDonald's look like   the king but like a substantially larger King  than it should be. Here's another graph. This   is produced by Fox News and it was produced early  on in the pandemic. In fact, this is a graph from
            • 11:00 - 11:30 actually just within the first month of the  pandemic. here within the first month or so,   as you'll remember, Covid was declared a pandemic  on March the 12th of 2020. and so you can see what   they're plotting here - This is according  to Johns Hopkins which is a med school that   has a great public epidemiology program. So,  they were plotting covid-19 cases in the U.S. in various cities now let's look at this graph in  some detail. This slope looks like the bunny hill,
            • 11:30 - 12:00 right? This is a pretty shallow ramp. So here is  the 21st of February here's the first of March   so that's less than 10 days. Here's the 10th of March,  now we move on to the 15th of March so the   difference between 10 and 15 well that's only  five days and then there's 15 to 20 that's only
            • 12:00 - 12:30 five days and then 20 to 24 that's four days  so the scaling on the bottom of this graph   is not accurate. Each one of these should be in  the same number of days in terms of increments   and it's not. That causes this slope to look  substantially shallower than it actually is.   They do the same thing on the y-axis here right  here's zero cases here's 5,000 cases and look at
            • 12:30 - 13:00 this jump to 20,000 cases then there's a jump to  35,000 and then there's a jump to 50,000 and then   there's a jump to 65,000. So this this part of  the graph is substantially compressed from zero   to five is a substantially different number than  from 5 to 20, 20 to 35, 35 to 50, to 65 and so   forth. So this graph should not have a zero as its  bottom indicator should it it should have five as
            • 13:00 - 13:30 its bottom indicator and so the number of cases in  the U.S it looks like it's ramping up very slowly,   but in fact actually this ramp was extremely  extremely steep it looked much more like   the Olympic ski jump than the bunny hill in  practice. So graphs like this really distort   the way these data look and I'm sure that graphs  like this really misled a lot of people in terms
            • 13:30 - 14:00 of the degree to which covid was a significant  and problematic illness. This is probably one   of the reasons that it took the US so long to  mount a reasonable response to that illness. And then we have the data-to-ink ratio. This is  another element of graph design - the portion of a   graphic's ink devoted to the display of data. It's  about how much of that graphic could be erased   without affecting interpretability. If you look at  these, these are two graphs that show exactly the
            • 14:00 - 14:30 same data. This is one that has a low data-to-ink  ratio. You can see there's sort of a lot of fancy   bits, right? They have these graded bars, they  have lots of grid lines drawn on the graph,   they have categories, they have orders, they  have all the decimals, they have these little   icons that represent the type of thing being being  purchased or ordered from this company. Here,
            • 14:30 - 15:00 here's a much cleaner version of that graph.  Office supplies, Furniture Technology:   three categories of things that get ordered from  a particular company. It gives the exact numbers   right here so there's no need then for this  fancy axis with lots of decimal places; and   there the graph itself. the colors are scaled  according to how much more of this of one thing   is being purchased than the next thing so these  two are very similar in color because they are
            • 15:00 - 15:30 similar numbers of orders, and then you see  that office supplies which are consumables   are ordered at a substantially higher proportion.  So this is a much easier graph to read, it's much   more intuitive. All the junk has been removed so  you can just see the patterns much more clearly. All right, we'll jump into measures of central  tendency now, and we'll do that in the next video.