Data Analytics - Descriptive Statistics & Exploratory Data

Estimated read time: 1:20

    Summary

    In this engaging tutorial by CareerFoundry, Dr. Humera takes viewers through an introduction to descriptive statistics and exploratory data analysis using a practical example from the New York taxi dataset. The video covers fundamental concepts such as outliers, mean, median, and how to set up a pivot table to answer key analysis questions. This tutorial is part of CareerFoundry's free data analytics short course designed to give newcomers a taste of life as a data analyst.

      Highlights

      • The video introduces fundamental concepts of descriptive statistics and exploratory data analysis. ๐ŸŽ“
      • Dr. Humera guides viewers on how to identify data outliers and potential errors. ๐Ÿ”
      • Viewers learn to use Google Sheets to calculate statistical values like mean, median, min, and max. ๐Ÿ“ˆ
      • The tutorial demonstrates setting up pivot tables to analyze pickup location frequency and payment methods. ๐Ÿ”„
      • There is an emphasis on defining important analysis questions before conducting data exploration. ๐Ÿง 

      Key Takeaways

      • Mastering the basics of descriptive statistics is crucial for data exploration! ๐Ÿ“Š
      • Outliers can skew your data analysis; identifying them is key! ๐Ÿšจ
      • Google Sheets and other tools make statistical calculations like finding means and medians easy-peasy! ๐Ÿงฎ
      • Pivot tables are magic for summarizing data and answering critical business questions! ๐Ÿง™โ€โ™‚๏ธ
      • Remember to define your analysis questions clearly before diving into the data! ๐ŸŽฏ

      Overview

      Welcome to the ultimate guide on kickstarting your data analysis journey with descriptive statistics and exploratory data analysis. Hosted by CareerFoundry and led by the experienced Dr. Humera, this tutorial dives into the world of data inspection using clear, practical examples from the New York taxi dataset. Perfect for beginners and enthusiasts alike, this video equips you with foundational knowledge to uncover insights from raw data.

        Embark on a statistical adventure as you discover the power of basic numerical operations. Dr. Humera, with her 20-year expertise spanning academia and industry, takes you step-by-step through identifying outliers, calculating mean, median, and employing pivot tables. Watch as complex concepts are broken down into digestible parts, allowing you to reveal hidden patterns and insights.

          This tutorial prepares you to tackle real-world data challenges confidently. By the end, you'll be equipped with the skills to pose critical questions and use tools like Google Sheets for descriptive statistics and exploratory data analysis. Set the stage for your next learning journey into data visualization in the subsequent tutorial.

            Chapters

            • 00:00 - 00:30: Introduction to Descriptive Statistics and Exploratory Data Analysis This chapter introduces the fundamental concepts of descriptive statistics and exploratory data analysis. It features a practical tutorial on exploring datasets to extract meaningful insights, and guides the viewer on setting up a pivot table. The tutorial is conducted by a professional data analyst, Dr. Humera, providing a step-by-step walkthrough. This session is part of Career Foundry's data analytics short course, designed to give participants a taste of the data analyst role in a concise five-lesson structure.
            • 00:30 - 01:00: Welcoming Dr. Humera In the chapter titled 'Welcoming Dr. Humera', the speaker introduces Dr. Humera, who will guide the audience through exploratory data analysis (EDA) and descriptive statistics. Dr. Humera has extensive experience in data analysis, spanning over 20 years in both academia and industry. The chapter sets the stage for delving into deeper data analysis, building upon previous lessons on data cleaning.
            • 01:00 - 01:30: Importance of Descriptive Data Analysis Descriptive data analysis is crucial as it provides initial insights into the data. It helps in understanding the overall structure, identifying outliers, and informing data cleanup processes. It goes beyond just removing duplicates and missing values, assisting in spotting other potential issues.
            • 02:00 - 03:00: Identifying Outliers Identifying Outliers - This chapter introduces the concept of descriptive analysis, highlighting its fundamental elements such as mean, average, maximum, minimum, and median calculations. These basic mathematical operations help in understanding the dataset's basic characteristics and guide further exploration. The chapter also hints at practical applications, like Google's potential investment in a data center, emphasizing the importance of data analysis in decision-making.
            • 03:00 - 04:00: Practical Example of Descriptive Analysis The chapter delves into a practical example of descriptive analysis using New York taxi data. It begins by addressing the necessity of analyzing data to identify frequently asked questions. The focus is on understanding outliers, defined as data points that lie outside the normal range. An illustrative example given is measuring peopleโ€™s heights to identify any anomalies. The goal is to use this analysis to gain insights into the dataset, which in this case, is the New York taxi data.
            • 04:00 - 05:00: Creating Descriptive Statistics Sheet This chapter discusses the identification and significance of outliers in datasets, particularly within the context of descriptive statistics. It explains how outliers are values that fall outside the normal range, such as a height of 10 feet when the typical range is between six and seven feet. Outliers indicate anomalous behavior and are critical in data analysis for identifying abnormalities in data.
            • 05:00 - 06:00: Calculating Mean, Median, Min, and Max The chapter focuses on conducting descriptive statistics and exploratory data analysis on a dataset, specifically a New York taxi dataset. It continues from previous sessions where data cleaning was performed by removing duplicates and missing values. The chapter emphasizes calculating statistical measures such as mean, median, minimum, and maximum values to understand the dataset better.
            • 06:00 - 10:00: Identifying and Handling Incorrect Data In the chapter titled 'Identifying and Handling Incorrect Data', the focus is on descriptive analysis. Descriptive statistics are highlighted as a crucial tool for gaining insights into data patterns. Methods such as mean, average, median, and identifying minimum and maximum values are emphasized as foundational techniques to understand the data. The chapter encourages examining specific columns and parameters that are of interest to gain a better understanding of the data's behavior.
            • 12:00 - 13:30: Defining Key Questions for Analysis In this chapter, the focus is on defining key questions for analysis, particularly concerning taxi service fares. The discussion revolves around analyzing the 'fair amount' parameter found in column K, exploring statistical queries such as average, minimum, and maximum fares. The process includes creating a new sheet for descriptive statistics, separating it from the raw data, showcasing how to add a new sheet within the data analysis tool.
            • 15:00 - 18:00: Using Pivot Tables for Analysis The chapter titled 'Using Pivot Tables for Analysis' explores the basics of utilizing pivot tables in a spreadsheet to analyze data. It starts with renaming a sheet to 'Descriptive Statistics' and outlines steps to calculate various elements of data.
            • 19:00 - 21:00: Question 2: Passenger Count vs Trip Distance In this chapter, the focus is on exploring the relationship between passenger counts and trip distances in a dataset. It involves analyzing various metrics, such as the mean and median, particularly in the context of fare amounts. Attention is given to ensuring that descriptive headings are provided for clarity and that they accurately represent the content under discussion. This analysis aims to provide insights into average fare amounts and their relationship with the number of passengers and the distances traveled.
            • 21:00 - 24:00: Question 3: Payment Types Statistics This chapter discusses statistics related to payment types, focusing on methods to analyze data such as mean and median. The mean is explained as the sum of all data points divided by their count, but it can be affected by outliers. In contrast, the median is the middle value when data points are ordered sequentially, offering a robust measure against extreme values. The chapter aims to explore the practical implications of these statistical methods, specifically in determining aspects like the minimum fare paid by individuals.
            • 24:00 - 27:00: Question 4: Payment Type vs Vendor ID This chapter focuses on performing calculations related to payment types and vendor IDs. It primarily discusses using Google Spreadsheets to make calculations like mean and maximum easier. The chapter highlights the simplicity of starting a formula in such tools by typing '=' followed by the formula name, such as 'average,' and using auto-complete functionality.
            • 28:00 - 30:00: Conclusion and Next Steps This chapter provides a practical tutorial on calculating average values using spreadsheet software. The instructor describes how to select an entire column on the 'Oriental sheet' by clicking on column PA, which automatically includes all the values from column G. They explain the process of closing the bracket and using the enter key to execute the function, resulting in the fair amount being calculated automatically. This chapter acts as a practical conclusion to the steps taught previously.

            Data Analytics - Descriptive Statistics & Exploratory Data Transcription

            • 00:00 - 00:30 Descriptive statistics and exploratory data analysis. It sounds like a mouthful, right? In this video, we're going to give you a practical tutorial on how to start exploring a dataset for meaningful insights. We're also going to teach you how to set up a pivot table. We're working with a professional data analyst here, Dr. Humera. Who's going to take you through every step of the. This is tutorial three of career foundries data analytic short course, if you haven't already, you can sign up for free to join the course in the description below. It's a snappy five lesson course, and we'll be covering everything that you need to know to get your first flavor. What it is like to be a data analyst.
            • 00:30 - 01:00 I'm going to hand over now to Dr. Humera, and she's going to take you through exploratory data and descriptive statistics over to you, Dr. Humera. Thanks a lot real. Hi everyone. Again, this is Dr. Humera and I've been associated with data analysis for over 20 years now, both in academia and in the industry. In this tutorial, we will cover exploratory data analysis and descriptive statistics. We have already seen in the previous tutorial, how to clean up the data. And now we are good to go and dig deeper into the data analysis.
            • 01:00 - 01:30 Descriptive data analysis is actually very important because it gives you the first insights into your data. It will tell you how your data looks like if there are any outliers, which means there is anything which is not right while doing data cleanup, we have already seen how to remove duplicates and how to remove missing values, but there is much more. Some further things might be off, which we can only identify.
            • 01:30 - 02:00 If we look into the data from its descriptive point of view, descriptive analysis will include very simple mathematical operations, like identifying the mean and average of the values. We'll find the maximum minimum and some kind of median. And these analysis will be good enough to get us started and let us know how does the data look like? This will be enough to allow. To explore the data further and make sense out of it. Let me give you a very simple example. Imagine Google wants to invest in a new data center where they only want to keep
            • 02:00 - 02:30 their most frequently asked questions. How would they know? What are the most frequently asked questions? This is actually where this kind of analysis is going to help us. So let's now dive deeper into our data. And actually see what the New York taxi data has to tell us. So, one question you mentioned outliers, but what exactly do you mean by that? So outliers are actually data points, which do not live within the normal range of the data. So imagine if you are collecting a Heights of different people and you
            • 02:30 - 03:00 have a height range in six and seven feet between six and seven feet. And now if you notice the height of a person as 10 feet, you can instantly see that. An outlier because it doesn't fit into the normal range and it's not the right. It couldn't be the right one. So outliers are those which are like lying outside of the normal range. Okay. It sounds like some kind of specified a nominally then you're totally right. Uh, outliers actually indicate an anomalous behavior.
            • 03:00 - 03:30 Definitely. So without further ado, let's dive into tutorial three. We were looking at the data in. And doing descriptive statistics and exploratory analysis. Here we are back at our Google sheet and looking at our New York taxi dataset. In the previous session, we actually cleaned up the data and remove the duplicate and remove missing values. Now let's have a look at what descriptive analysis has to tell us and explore the data a bit more.
            • 03:30 - 04:00 We'll start with the descriptive analysis. The descriptive statistics is important because it gives us insight into the patterns of the data. Some of the methods used here are for example, the mean of the data, the average, the median, the minimum and maximum values. And these are actually good enough to get us started, to get a feel of what the data is about. So let's pick up a couple of columns here, some of the parameters that we find interesting, and we'll have a look at what do they do?
            • 04:00 - 04:30 If you look at the column names in column number K, we have an interesting parameter. That's basically the fair amount. And let's try to have a look at, for example, for these taxi services, what is the average fair that people are paying? What's the minimum fare that they are being. What's the maximum one and so on and so forth to do this, we'll begin with creating a new sheet. So that our destructive statistics is separate from the raw data to create a new sheet. You simply go to the bottom left of the screen and add cheat.
            • 04:30 - 05:00 You feed a sheet one, you can double-click on it and rename it to the name that you want to give. Let's give it a name, descriptive statistics. And you hit enter at this point in time, we want to calculate four elements that we just spoke about. So let's first identify what are the different steps that we want to calculate.
            • 05:00 - 05:30 So we give a heading steps and we also give the heading. What manometer are we talking about? So it was fair amount, so we can take the same. It doesn't have to be the exact same spelling, but just representative of what we are talking about. The next thing we want to find out the mean. So the mean fear and the median. What mean, and median actually represent kind of the average
            • 05:30 - 06:00 of the data, but mean is very sensitive to the extreme positions. So it means you basically add all the elements and divide by the number of elements. So in case there are outliers, then the mean might be a bit district in that case. Median actually helps in case of median, basically all the data points are arranged in order and the central point is picked up. We'll have a look and see what does that. We are also interested in identifying the minimum fare that someone
            • 06:00 - 06:30 paid and the maximum, so we can give the heading mean and mix. Now let's do some calculations. The Google spreadsheet actually allows it to make all these calculations very easy. And this is the same for any other data analysis tool that you use. You start typing a formula, which you can start by pressing equal to and typing the formula that you want to type in in this case, it's the. So you start typing average and you can auto complete by hitting enter.
            • 06:30 - 07:00 Now you want to get average of what the average that you want to take off is in the other sheet. So you select the Oriental sheet and the column PA by hitting the column name, K it will basically select the whole column and include all the values that are in column G. Once you are done, you can close the bracket and hit enter. And here you see the fair amount calculated automatically.
            • 07:00 - 07:30 It's really as simple as it looks like. And in a similar way, we can calculate the median min and max. So to start calculating meeting you start again with the equal sign start typing the median complete by hitting enter, go to the original sheet select column. Close the bracket and hit enter. And here you have 9.5. If you can already see this actually shows that maybe some people were being very low
            • 07:30 - 08:00 or very high fail, but the normal scenario was that people were being around 9.5 as an average fair to calculate the minimum. You start typing with the, to. And type mean again, go to the first sheet to select the column. K. You can actually do this with any column that you want to do right now. We are interested in the fair amount. So this is where we are focusing on.
            • 08:00 - 08:30 Oh, you already see that there is a minimum value of minus 2 43. This, this doesn't look right. We'll explore it in a bit, but let's first calculate the max. I'm sure you're already appreciating the importance of this
            • 08:30 - 09:00 descriptive statistical ability. Because just by calculating some basic numbers, you are able to identify already a problem, a further problem with your data where the amount of fear that somebody had paid is not a correct number. What do we want to do in its simplest form? We would like to remove those values where the numbers are not right. So let's just go back to our original data and delete those right.
            • 09:00 - 09:30 You already know how to filter the data for specific values. We have already done that for missing values. That is blank. And now we can do this for this negative number, but wait a second. It's quite possible that minus 2 43 is not the only negative number in the data set. There might be others, so we can do one more thing that will actually help us to identify all those negative numbers in Germany. And for that let's first go to our original data.
            • 09:30 - 10:00 One simple thing that we can do here is actually to thought our data with respect to the fair amount, by sorting our data in ascending order, which means the lower values come first and the higher values go down by doing this, we'll be able to bring all the negative values to the top of the dataset, and then we can remove them. So let's do. One way to do this is to select the whole sheet.
            • 10:00 - 10:30 We go to data and we sought the range. When you click on thought range, you can go to the advanced range, sorting options. You want to specify the data has a headroom because your first row is a heading blue and you want to solve them with respect to the fair amount. So you can sort by fair. We want to sort them from lowest to highest. So we let it be a agency and then be sought once the sorting is done,
            • 10:30 - 11:00 you can see that all the rules with negative fair amounts, they will populate on the top of the screen. We can actually pick them up. So I'm scrolling to the last value, which is negative so that I can select everything from there till the top. I also see a lot of vetoes in here, which means many people didn't pay at all. You may want to decide if you want to keep them or delete them as well.
            • 11:00 - 11:30 We will keep them at the moment. So these are almost 400 data points, which we will remove. Now you pick the one from where you want to start selection and by pressing shift and selecting the other endpoint, you are not able to select all the data points between those two. Right click and delete the select patrols. This will actually remove the data points with negative fear elements. We can now go to the descriptive stats again, and we can see here that the
            • 11:30 - 12:00 minimum value is indeed CDOT, but this, we are able to identify some more problems and fix them before. Proceeding, but this, we are also able to see with respect to the fair amount, what will people being at the minimum and the maximum value and on an average, now it's time for digging further into the data and we perform more exploratory data analysis. The first question, which is actually also part of the responsibility of a data analyst, is to identify what these questions are.
            • 12:00 - 12:30 What is it that you want to find out? What is it that may be important for the business? Before we start doing any kind of analysis. We need to take a step back and think about the questions that we want to answer. Defining these questions is as important as finding those answers by looking at the data. Some of the questions that we may be interested in is for example, the pickup locations and the drop off locations.
            • 12:30 - 13:00 Is there a link between the two, how many times people are getting on at a certain point and dropping off at a certain location? All trip distance. What is the typical distance that people are traveling when you're using the taxi service? Is there a relation between passenger count and the trip distance? How are the people paying for example, payment type and is the payment type in any way connected to the vendor ID? All these questions are what we can explore for.
            • 13:00 - 13:30 But let's take a brief moment and note down all the questions that we need to answer so that we can start our analysis. And now we can dive deeper into the exploratory analysis. Let's say the four key questions that we want to identify are the following. The first one is pick up location frequency. We want to find that. For each pickup location, how many times the ride was initiated at that point?
            • 13:30 - 14:00 The second question that we want to answer is the relation between the passenger count and the trip distance. The third one is basically about the payment statistics. So we want to identify what kind of payment modes people are using. And the fourth one is where we want to note if there is any relation and what is the relation between the payment types and the vendor ID. So let's now dig deeper into these questions one at a time, and we'll start with the pickup location frequency.
            • 14:00 - 14:30 To identify these relationships between the different data elements. The pivot tables actually are very helpful. They help us extract data points, group them together with respect to different parameters, like count minimum, maximum average, and so on and so forth and help us visualize data in a concise way. Let's not have a look at how we can use pivot tables to answer these questions. Starting with pickup location, frequent.
            • 14:30 - 15:00 We go to the insert menu and click paper table, definitely open a pop-up, which will actually ask you where you want to create your pivot table. You select a new sheet in this new sheet. You will see two sites on the left side. You'll see the labels, rows, columns, and values. And on the right side, you'll see a pivot table editor. What elements are we interested in? We had interested in the pickup location.
            • 15:00 - 15:30 So the rules that we want to pick up at the pickup location. So you start with an ad and you specify pick up location. We are not done yet. We have only specified that the first part of this equation is handling the pickup location. So what we see here are all the pickup locations populated in a column, but we want to group them with respect to the frequencies and what are frequencies, frequencies are basically counts.
            • 15:30 - 16:00 So what kind of values do we want? You can click the add for the values and you can specify that you want to have pickup location ID. Then you can specify that you want to count the pickup location, but this, our table is complete now. So here, what you see is for the first pickup location that I was initiated five times and similarly for location ID for the ride was initiated seven degrees.
            • 16:00 - 16:30 Before we do any processing on this one, it's better to always copy this data from the pivot table into a separate sheet so that we can make more changes in the pivot table to answer more questions without effecting our analysis. So let's now create a new sheet and copy data from here to that new sheet, you click the plus at the bottom left, it creates a new. This is all about pickup location frequency.
            • 16:30 - 17:00 So we lean in this ethic of location frequency, bill typing, the column names. So pickup, location, and frequency. Of course you don't need to have the same names as in the original dataset, but it should be something representative. And now you can go to your pivot table and copy the data. We need to be careful of copying, not the whole columns, because in
            • 17:00 - 17:30 that case, your pickup location, frequency data points will be connected to the original pivot table. And in case you will make any change on this sheet, it will actually be affecting your other sheet. So just be careful and copy only the desired values. Let's copy these data. Let's see how many data points do we have? They are almost 204. You'll select all of them, right.
            • 17:30 - 18:00 Click and copy, and now become to the pickup location, frequency sheet, and based on. Of course, you can do a number of things just by looking at this data. So how about we simply for them based on frequency and we'll be able to identify which of the most popular pickup location, you know, how to sort the data here, you select the data points. You go to data sort range under advanced Ridge.
            • 18:00 - 18:30 Options. Your data has a data at all, which you want to have as frequency. And this time we want to do it in descending order. So the highest value first, and when you thought them, you can see that quite example, the pickup location to 2 37 was the most frequent one where 4,192 times a pick up of the right happen.
            • 18:30 - 19:00 And. So on and so forth with this, we have identified the first question that what is the frequency of our individual pickup locations. Now let's move on to our next question. The question number two is if there is any relationship between the passenger count and the trip distance, let's go back to our pivot table and extract information for these two videos. We are back in our pivot table. We can remove the entries from the first question and that felt it was important that.
            • 19:00 - 19:30 Table is not linked to your real analysis. So the first video that we have here is the passenger count. So let's add as rules, passenger count. Now we have our passenger count. What's the other video? That's the trip distance. So let's use values for trip distance and here. What do we want to check that? Say, we want to have the average of the trip distances. So here we have our messenger.
            • 19:30 - 20:00 And as we can see, there were Bessinger from zero to maximum six, and now let's see what's the average distance that they have been taking up in their trips. So in case of values, you want to get the trip distance. But this time we don't want to count that we want to get the average of these distance. So here we have that just like before we don't want to leave this data in our pivot table. Rather, we want to copy this into another sheet for let's just do this.
            • 20:00 - 20:30 Now we are creating a new sheet and naming it as the messenger versus strip distance. We have the two labels messenger account trip distance. And now we can copy the data from our pivot table.
            • 20:30 - 21:00 So here we have our Beth and Yukon versus strip distance. And now we are good to go with question number three, question number three is maybe one project. The payments to. We remove the existing elements from the pivot table if before. So we feel like the payment type and add value. So we have the payment type, select it as rows, and now we
            • 21:00 - 21:30 can add values for payment type. And again, this time we want to come as before we will copy this into a new sheet and now we can copy the format. And finally we can move on to our fourth question. Our fourth question is a bit different here. We want to compare the payment type is the respect to the vendor ID. Let's see how we can do that as before remove the earlier elements. So remove the payment type. The first variable that we want to talk about is the payment type.
            • 21:30 - 22:00 And the second one is the variable. This one is going to be different from earlier. Here we are not only talking about absolute counts. Rather. We want to relate to variables, payment type and vendor ID, and then compare them with their absolute counts. So here we are not only talking about rows, rather rows and columns. So let's see how we can do this. The first you add your payment type. And now you can add the column, which of your vendor ID, and now you want to do the calculations for counting.
            • 22:00 - 22:30 You can count either vendor ID or payment type. It will give you the same result. Now you see, it's not just a two column table, rather. It has four rows because there are four types of payment. And there are two columns because there are two types of vendors and each field is actually corresponding to one payment type versus vendor ID as before we'll copy all this information to a secretary. We can see that for render number two payment type three and four. There is no data.
            • 22:30 - 23:00 Apparently for vendor two, nobody actually use payment by three and four. That's basically it with this. We are done with our initial exploratory analysis. In this video, you have learned how you can do descriptive statistics that actually helps you uncover. Patterns into the data and also identify for the problems of your data and you are able to answer. Several questions about your data by doing exploratory analysis in the next tutorial, we'll see how we can visualize this data back to you, Will.
            • 23:00 - 23:30 Wow. Thanks Humera. That was a really exciting deep dive into descriptive statistics coming up in tutorial four. We're going to be looking at data visualization. We'll be showing you how to do graphs, charts and visual displays for the data that we've covered so far. Thank you for making it so far and congratulations for successfully completing tutorial 3!.