Get the latest AI workflows to boost your productivity and business performance, delivered weekly by expert consultants. Enjoy step-by-step guides, weekly Q&A sessions, and full access to our AI workflow archive.
Summary
In this comprehensive video tutorial by Derek Banas, viewers can learn everything about statistics essential for understanding data science and machine learning. The video covers key statistical concepts such as sample populations, data types, variance, standard deviation, normal distribution, hypothesis testing, and various data visualization techniques. Derek also demonstrates practical applications of these concepts, like using statistics for business decision-making and analyzing relationships within data. The tutorial concludes with a discussion on chi-square tests and how probabilities integrate with these statistical methods.
Highlights
Statistics is the science of collecting and analyzing data from sample populations π
Various data types include categorical, numerical, and qualitative data, each with distinct characteristics π
Common data visualizations include pie charts, bar charts, and histograms π
Understanding the mean, median, and mode helps summarize data effectively βοΈ
Hypothesis testing and p-values help verify assumptions about data distributions and populations π
Key Takeaways
Understanding statistical concepts is crucial for data science and machine learning success π
A strong grasp on statistics and probabilities can lead to effective business decision-making π
Visualizing data with charts and tables helps identify patterns and relationships π
Hypothesis testing allows data scientists to validate assumptions and forecast outcomes π
Statistical methods like standard deviation and variance help manage data distribution π―
Overview
Statistics is a crucial discipline in data science and machine learning, dealing primarily with data collection and analysis. Through comprehensive explanations, Derek Banas educates viewers on how statistical concepts like sample populations and data types form the backbone of these fields. Whether handling categorical or numerical data, the tutorial emphasizes accuracy and detail in every analysis.
Incorporating practical examples, Derek walks through complex data visualizations, explaining elements like cross tables, pie charts, and histograms that help make sense of big data. He further explains measures such as mean, median, mode, and introduces statistical notions like variance and standard deviation to enrich comprehension. These methods are practical for identifying trends and drawing insights.
Lastly, the video elaborates on hypothesis testing, Z-scores, and the significance of confidence intervals when testing assumptions. Derek illustrates how applying these techniques can guide strategic decisions based on reliable data interpretations. Also covered are concepts like linear regression and chi-square tests, demonstrating their use in predicting outcomes and assessing relationships within data.
Chapters
00:00 - 00:30: Introduction to the Tutorial The 'Introduction to the Tutorial' chapter introduces viewers to a statistics video tutorial aimed at teaching essential statistics knowledge necessary for understanding data science and machine learning. It mentions a previous video on probabilities, suggesting that watching both will provide a comprehensive understanding of statistics and probabilities. The chapter highlights statistics as the science of collecting and analyzing data.
00:30 - 01:00: Understanding Population and Sample The chapter titled 'Understanding Population and Sample' explains the concepts of population and sample. The population includes all items or people of interest in a particular analysis. A sample, on the other hand, is a subset of the population that is used for analysis. The focus is on the successful outcomes or results desired from the sample, such as age, car ownership, or college graduation status.
01:00 - 02:00: Types of Data types of data, including categorical data which describes certain characteristics. Symbols such as uppercase M, lowercase s, and lowercase n are used to denote population successes, sample successes, and total sample from the population, respectively.
02:00 - 04:30: Visualizing Data The chapter titled 'Visualizing Data' covers the importance of identifying unique characteristics in data, such as age, car ownership status, gender, and educational background. It emphasizes the usefulness of halting the tutorial to take notes, which can serve as a comprehensive cheat sheet for better understanding the complexities of numerical data.
04:30 - 06:00: Measures of Central Tendency The chapter 'Measures of Central Tendency' begins by discussing the characteristics of data, specifically focusing on whether data is finite or infinite. Infinite data does not have an ending value, while finite data does. The chapter continues by explaining continuous data, which can be broken into infinitely smaller amounts, exemplified by measurements such as distance, height, and weight. Additionally, qualitative data is introduced, described as either nominal, meaning it consists of named categories.
06:00 - 08:00: Variance and Standard Deviation The chapter titled 'Variance and Standard Deviation' discusses different types of data classifications. First, it explains nominal data, which is used for naming or labeling variables without any specific orderβsuch as race. Next, it covers ordinal data, which is named and possesses a specific order, with examples including categories like bad, okay, good, or great. Finally, the chapter introduces quantitative data, which can be categorized into ratio or interval data, characterized by specific quantities or amounts.
08:00 - 11:00: Covariance and Correlation In the given chapter titled 'Covariance and Correlation', the concept of quantitative data is introduced through an example: picking a number between two defined amounts, such as between 8 and 16. The chapter discusses various methods of data visualization, particularly focusing on cross tables which illustrate relationships between rows and columns of data. It also touches on the concept of frequency, describing it as a measure of how often something occurs, as demonstrated through sampling. The chapter likely seeks to build a foundation for understanding how covariance and correlation can be analyzed through visual representation and frequency analysis.
11:00 - 14:00: Applications in Real World The chapter titled 'Applications in Real World' covers different chart types and their application in real-world scenarios. It starts by discussing a study of 100 random men, indicating that 78 did not exercise. The use of pie charts is explained, highlighting that each slice represents a category with its size indicating frequency, and the total must always equal 100%. Bar charts are described next, emphasizing how the length of bars represents frequency across categories. Finally, pareto charts are mentioned, suggesting their use in listing categories.
14:00 - 16:00: Probability Distribution The chapter focuses on probability distributions, highlighting the importance of frequency distribution tables. It explains how these tables are used to list a range of test scores and the number of students who scored within each range. Additionally, it discusses categories in descending order and the inclusion of a cumulative frequency line, which represents the sum of all preceding frequencies.
16:00 - 18:00: Normal Distribution This chapter discusses the concept of a normal distribution, emphasizing the differences between histograms and bar graphs. Histograms display the distribution of data across ranges, whereas bar graphs categorize data. Importantly, histograms have touching bars. Additionally, the chapter introduces the mean (average) which is calculated by summing all values.
18:00 - 19:00: Central Limit Theorem The chapter discusses the concept of the mean in statistics, represented by the symbol 'mu' for a population mean and 'x bar' for a sample mean. It emphasizes the importance of noting these definitions and cautions about potential issues with the mean, particularly how outliers can significantly impact the results. An example is given to illustrate this effect.
19:00 - 22:00: Z-Score and Confidence Intervals This chapter delves into the concepts of Z-scores and Confidence Intervals, discussing their significance in statistical analysis. It begins with explaining how outlier values can affect the mean of a dataset, potentially causing issues in data interpretation. To address this, the chapter highlights the use of the median, which can mitigate the impact of outliers by focusing on the central data point. In cases where there's an even number of data points, it takes the average of the two central values. Examples are promised to further illustrate these statistical tools.
22:00 - 25:00: T-Distributions In this chapter titled 'T-Distributions', the focus is on statistical concepts related to data analysis and probability, specifically discussing the mode and variance. The mode is defined as the value that appears most frequently within a data set. If all values occur at equal frequency, there is no mode, whereas multiple modes exist if different values occur with the same highest frequency. Variance is explained as a measure of how data is distributed around the mean, providing insight into the spread or dispersion within the set.
25:00 - 26:00: Dependent and Independent Samples The chapter 'Dependent and Independent Samples' focuses on statistical concepts related to variance. It distinguishes between the variance of a population and a sample, explaining how each is represented by different symbols. It further elaborates on the calculation process of variance for a sample, which involves finding the mean, summing the squared differences between each sample value and the mean, and then dividing by the number of samples minus one.
26:00 - 33:00: Hypothesis Testing In the chapter titled 'Hypothesis Testing', the discussion focuses on the calculation of variance in statistics, particularly highlighting the difference between calculating variance from a sample versus the entire population. It is noted that when using a sample, a specific adjustment in the formula is made (minus one), referred to here. It describes that if the entire population's data were available, this adjustment would not be needed. The chapter also points out how squaring values while calculating variance gives more weight to outliers, demonstrating with a sample variance calculated as 2.5.
33:00 - 39:00: Linear Regression The chapter 'Linear Regression' introduces the concept of standard deviation, which is the square root of variance. It explains that a larger standard deviation indicates that numbers are more spread out, while a smaller standard deviation suggests that the results or samples are more closely clustered.
39:00 - 41:00: Coefficient of Determination This chapter discusses the coefficient of determination. It highlights the usage of the coefficient of variation for comparing measurements on different scales, such as miles and kilometers. The chapter emphasizes how the same distance measured in different units can affect standard deviation calculations.
41:00 - 43:00: Root Mean Squared Deviation The chapter discusses the concept of standard deviation in relation to measuring distances in miles and kilometers. It highlights the issues that arise due to the difference in standard deviation values: 0.645 for miles and 1.038 for kilometers. However, when these values are divided by their respective means, the dispersion becomes equal, calculated to be 0.1721. This equal dispersion is presented as crucial information to be utilized in the ongoing tutorial.
43:00 - 47:00: Chi-Square Tests The chapter explains the concept of covariance and its role in determining if two groups of data move in the same direction, indicating a correlation. It uses the example of comparing earnings to the market capitalization of a corporation, which is the total value of a corporation, to illustrate the calculation method.
47:00 - 48:00: Conclusion The final chapter provides a detailed explanation of a statistical calculation that involves subtracting the mean from each value of a dataset and multiplying the results. The steps involve getting each individual value of x, calculating x minus the mean of x, then multiplying it by y minus the mean of y, and finally dividing by the number of samples minus one. A final value, such as 5803, indicates movement if greater than zero.
Statistics for Data Science & Machine Learning Transcription
00:00 - 00:30 well hello internet and welcome to my statistics video tutorial in this one video you're going to learn pretty much everything you're going to need to know statistics wise to be able to understand data science and machine learning now i have a previous video which i have a link to in the description on probabilities and if you watch both videos you will have a really strong grasp on statistics and probabilities now statistics is the science of collecting and analyzing data taken from a
00:30 - 01:00 sample population and the population is going to represent all items or people of interest in what you're analyzing a sample is a subset of the population that we can analyze and we mainly focus on successes or results we are looking for in a sample examples of this would be things like age whether somebody is a car owner or not whether somebody is a college graduate or not
01:00 - 01:30 sex whether they're a homeowner etc and in this diagram you're looking at here the uppercase m is going to represent the successes in the population uppercase n is the total population lowercase s represents the successes in the sample and lowercase n is going to represent the total sample from the population now there are many different types of data categorical data is going to describe what makes a
01:30 - 02:00 thing unique like for example age or whether somebody is a car owner or sex or whether they're a graduate or simply any answer to a yes or no question and as you're watching this video tutorial pause your way through it and write down notes and if you do that you will have like an awesome cheat sheet that you can use to really really understand this topic numerical data
02:00 - 02:30 is either going to be finite meaning that it has an ending value or infinite meaning the opposite it has no ending value continuous data is data that can be broken down into infinitely smaller amounts so you can think of things like distance and height and weight you can constantly break them down into smaller and smaller units qualitative data can be either nominal meaning that it is named data it is
02:30 - 03:00 mainly data for naming something which doesn't have an order for example race would be an example because there are many different races but there is no order to them ordinal data is also named but it has an order and then examples of that would be things like bad okay good or great and then we have quantitative data and it is like a ratio or interval being an amount
03:00 - 03:30 between two defined amounts for example you could say pick a number between 8 and 16. that would that would be an example of quantitative data now there are many different ways to visualize data this is what we call a cross table and it shows relationships between rows and columns of data frequency shows how often something happens and here we can see that when we sampled
03:30 - 04:00 100 random men that 78 were men that did not exercise with pie charts each slice is going to represent a category and the size of the slice is going to represent its frequency and mainly what differentiates it from other charts is that it must always equal 100 bar charts have bars that represent the categories and the bar's length is going to represent the frequency now pareto charts are going to list
04:00 - 04:30 categories in descending order and they are also going to include a line that represents the cumulative frequency or the sum of all of the other frequencies that proceeded a frequency distribution table is going to focus on the number of occurrences or the frequency and here what we're doing is we list a range of test scores and how many students scored in that
04:30 - 05:00 range a histogram is going to differ from a bar chart or a bar graph in that histograms are going to show the distribution of grades in a range and in this example versus using categories like we do with a bar graph and also histograms are drawn with the bars touching now the mean or average is going to provide an average value by summing all values
05:00 - 05:30 and dividing by the number of components and mu the symbol mu this guy right here is used to represent the mean of the population x bar is going to represent the mean of a sample and you should definitely be writing these things down and while it can be very useful meaning the mean or the average often outliers are going to dramatically affect your results for example one two three four five has a mean of
05:30 - 06:00 three which looks like it makes sense however one two three four 100 is going to have a mean of 22 which could cause some issues and i'm going to provide a ton of examples of how we can use all these things very soon now the median is going to try to eliminate the influence of outliers by returning the number at the center of the data set and if you have an even number of components instead it's going to take the two center values and return the average of
06:00 - 06:30 those two values the mode is going to return the value that occurs the most often if components occur at an equal rate however there is no mode meaning there are no double values in that situation there's no mode and if there are multiple values that occur at the same frequency in that situation you would have more than one mode variance is going to measure how data is spread around the mean and there is both a
06:30 - 07:00 symbol for variance of the population which is this guy right here as well as another symbol that represents the variance for a sample which you can find right here and to find it we first calculate the mean and then we sum all sample values minus the mean squared then we divide by the number of samples -1 in a situation in which we're
07:00 - 07:30 calculating a variance from a sample versus the entire population which is what we'll almost always do if we instead had data on the entire population we would not include this minus one part here and you can see right here if we calculated using this sample right here and our variance formula that this sample would give us a total variance of 2.5 now because we square values with variance that's going to give us some extra weight with our outliers and for
07:30 - 08:00 this reason we find the square root of the variance to find what is called the standard deviation that's what the standard deviation is it is just simply the square root of the variance and in situations in which the standard deviation is large that means that the numbers are more spread out and when the standard deviation is smaller that means that the results or the samples
08:00 - 08:30 are closer to the mean now the coefficient of variation is going to be used to compare two measurements that operate on different scales so what i'm doing here is comparing miles to kilometers three miles is approximately equal to 4.8 kilometers now even though they measure the same distance because they use different units that is not seen whenever we calculate our standard deviation as you can see right here we have a
08:30 - 09:00 standard deviation for miles which is 0.645 and a standard deviation for kilometers which is 1.038 that can cause some issues however if we come in and we divide by the mean which you can see mean for miles mean for kilometers we can see that they actually have the same exact dispersion and that works out to 0.1721 very useful information that we'll be using as the tutorial continues
09:00 - 09:30 now covariance is going to tell us if two groups of data are moving in the same direction meaning they are correlated together or they just simply influence each other and here what i'm going to do is compare whether earnings are going to affect the market capitalization of a corporation market capitalization just a big word for the total value of a corporation now you're going to make this calculation by plugging in the values
09:30 - 10:00 minus their mean and then multiply this guy right here means sum so you're going to go through get each individual value of x minus the mean of the total of x and then multiply that times each individual value of y minus the mean of all the values of y and then you're going to divide by the number of samples minus one and if you do that you will get a value of 5803. now if the value is greater than zero that means those values are moving
10:00 - 10:30 together if they are instead less than zero that means they are moving in opposite directions and zero just means that they are completely independent and as the tutorial continues i'm going to show you way better ways to calculate how samples are influencing upon each other we're going to get into something called regression and before i get into a bigger example i'll talk about the correlation coefficient and what it does is it adjusts the
10:30 - 11:00 covariance so that it is easier to see the relationship between x and y and its value can't be greater than one or less than negative one and the closer you get to one the closer the relationship between the values and in this example what we're doing is we're plugging in the standard deviations of the market cap as well as earnings and whenever we do this we get a value of 0.660
11:00 - 11:30 and what that means is they are correlated perfect correlation would have a value of 1 0 shows independence and once again negative values show an inverse correlation so now what we're going to do is we're going to take all these different things that we've been talking about and show you how extremely powerful they are in a real world situation okay so let's say a company comes to you you're a data scientist and they say we have all this information about sales who we what companies we sold to
11:30 - 12:00 our contact information the sex the age where we sold it what exactly we did sell it what type of computer we sold sale price profit all of these the different lead sources that led to sales month of sales years of sales we have all this data however we don't have any idea if it's useful or if there's anything we can do with it well they come to you and you go and you plug in all of this information and look at this awesome information you could then provide to them
12:00 - 12:30 all right so i went and wrote up some code for this and basically now what i can do based off of what i know about the company i can tell them that it is more profitable for them to sell to females versus males i can come down inside of here and in regards to which state is most profitable here we have a state of west virginia 0.26 we have ohio 0.3 new york point one and we can say that pennsylvania is the most profitable
12:30 - 13:00 place we can sell so we're all ready we know that it's better to sell to women in pennsylvania i can then come down inside of here and tell them that hp or hewlett-packard products are more profitable so we're selling hewlett-packard products to women in pennsylvania what are we going to what type are we going to try to sell well we're going to try to sell laptops those are the most profitable and what lead sources work the best well we have 0.39 here and 0.40 and that works
13:00 - 13:30 flyer 2 has been found to be the most profitable way to market to people and we also found that february is the most profitable month to sell and there's additional information on year and also we have age ranges and we know that it is most profitable to sell to the age range of 50 to 80. so with the calculations and formulas that i just
13:30 - 14:00 showed you you can see very quickly extremely how valuable all of these formulas are because we can very very specifically target that customer that is going to provide us with the greatest opportunity to dramatically increase our profits and that brings us to the probability distribution and what it's going to do for us is find the probability of different outcomes now a coin flip i'm sure you can understand has a probability distribution of 0.5 you can
14:00 - 14:30 only get either a head or a tail 50 percent a single die roll has a probability distribution of 1 6 or 0.167 there are six sides and hence that's what your probability would be it's very important to remember that the sum of all of the possible probabilities is always going to be equal to one you could also see the probability distribution here if we would roll two die now a relative frequency histogram is going to chart out
14:30 - 15:00 all of these different probabilities and pay particular attention to the shape of the chart because next we're going to talk about something called the normal distribution and a normal distribution is when data forms what looks like a bell that's why it's called a bell curve and in this situation in which you have a normal distribution one standard deviation is going to represent 68 of all of your data
15:00 - 15:30 while two standard deviations is going to cover 95 percent and 3 is going to cover 99.7 percent and remember from what we talked about previously standard deviation is just going to measure how much your data is spread out from the mean or the average now to have what we call a normal distribution the mean as well as the median and mode must all be equal also 50 of values are
15:30 - 16:00 both less than as well as greater than that mean and also you're going to have a standard deviation of one that is in what we call a normal distribution now a standard normal distribution has a mean of zero and a standard deviation of one like i said and if we calculate the mean we see that we get a value of 4 based off of our sample and this is the sample that we're working with on the right side of the screen
16:00 - 16:30 and if we calculate the standard deviation we see we would have a value of 1.58 now we're going to be able to turn this into a standard normal distribution by simply subtracting the mean from each value and then dividing by the standard deviation and if we do that we get the chart that you see here as you can see we have our mean value now at zero and you can see here the process of subtracting that mean and then dividing
16:30 - 17:00 by the standard deviation now the central limit theorem states that the more samples you take the closer you get to the mean and from looking at this formula i think it becomes clear that that would be true also the distribution will approximate the normal distribution and as you can see as the sample size increases the standard deviation is going to decrease now the standard error is going to measure the accuracy of an
17:00 - 17:30 estimate and to find it we're going to divide the standard deviation by the square root of the sample size and again notice as the sample size increases the standard error is going to decrease and that brings us to something that is very powerful called the z-score and the z-score is going to give us the value in standard deviations for the percentile that we are looking for for example if we want 95 percent of the data
17:30 - 18:00 it tells us how many standard deviations are are going to be required and the formula is going to ask for the length from the mean 2x and then divide by the standard deviation and this will all make complete sense if we just look at a simple example now on the right side of the screen you can see a z table and if we know our mean is 40.8 and the standard deviation is 3.5
18:00 - 18:30 and we want the area to the left of the point 48 we perform our calculation to get 2.06 we then find 2 on our z table see there's two and then we look for .06 at the top of the z table and we look for where those columns and rows meet which is right here at .98030 so this tells us that the area under the
18:30 - 19:00 curve is going to make up 0.9803 of the total so now let's talk about confidence intervals now point estimates which is what we have pretty much exclusively used so far can be somewhat inaccurate an alternative to them is an interval so for example if we had three sample means as you see here we could instead say that they lie in the interval of between five and seven we can then state how confident we are
19:00 - 19:30 in this interval and common amounts that you would be confident about would be ninety percent ninety-five percent ninety-nine 99 and just to break down exactly what that means if we had a 90 confidence that means that we expect 9 out of our 10 intervals to contain our mean value alpha is another guy we're going to be hearing a lot about and it's going to represent the doubt we have which is basically just going to be the value of 1
19:30 - 20:00 minus our confidence or how confident we are that we will be right and now what i want to do is show you how to calculate a confidence interval now we are going to need to do this a sample mean we're going to need an alpha a standard deviation and the number of samples represented by the lower case n right here so there's your alpha this is going to be your z table or this is standard deviation i mean this is going to be your alpha this is
20:00 - 20:30 going to be a value taken from your z table and this is going to be the sample mean and this guy right here that follows the plus or minus is going to be representative of your margin of error so let's take us through an example of exactly you know something kind of fun that would explain this a little bit better so let's say we wanted to we were going to get a new a new job as a player for the basketball team the houston rockets and we wanted to calculate the probable
20:30 - 21:00 salary we would receive as a new player well to start off we know the mean salary which is a big number eight million nine hundred seventy eight thousand so forth and so on and we're looking for our results to be precise to ninety five percent meaning that we want to have a confidence of 95 percent that this salary will be accurate we're going to be able to get our alpha from this confidence which is just going to be 5 the critical probability is found
21:00 - 21:30 by taking 1 minus alpha divided by 2 and we're going to get a value of 0.95 whenever we calculate that and then what we need to do is look up our z in our z code in our table that we have here and if we search for 0.975 right here we find that the z code works out to 1.9 and 6 right there then we're going to find our standard deviation and then plug in our values and whenever
21:30 - 22:00 we do this we can find that our new confidence interval in regards to how much we can expect to receive as a new player for the houston rockets is going to fall between 2.6 million and 15.29 million dollars all right so good stuff and congratulations on that new salary now the students t distributions are going to be used whenever your sample sizes are either small or the population variance is unknown and a t distribution looks like a normal
22:00 - 22:30 distribution except that it just simply has fatter tails and what this means is just that there is wider dispersion between the variables and in situations which we know the standard deviation we can just go and compute all this information using our z scores and our z tables and use our normal distribution to calculate probabilities however we are not always provided with that information the formula for calculating these values
22:30 - 23:00 is going to be just our sample mean that we have right here the value we're going to receive from our t table multiplied times the standard deviation divided by the square root of the number of samples minus one and the number of samples minus one as you're going to see is very often also referred to as the degrees of freedom all right so let's go and use t-distribution with a real-life example let's say that a manufacturer is promising that their brake pads will last for 65
23:00 - 23:30 000 kilometers with a 0.95 confidence level we however go and make some calculations and our sample mean works out to sixty two thousand four hundred and fifty six we then do further calculations and find that our standard deviation works out to two 2418 degrees of freedom is the number of samples taken minus one so if we take 30 samples that means our
23:30 - 24:00 degrees of freedom is going to equal 29 and if we know the confidence is 0.95 then we subtract 0.95 from 1 to get a value of 0.05 and if we look up 29 and 0.05 in our t table here's the 29 right there and the 0.05 we get a value of 1.699 if we then go on and plug our values into our formula we find the interval
24:00 - 24:30 for our sample and you can see that that interval is going to work out to 61 693 and 63 218 the manufacturer is promising 65 and you can see that we are very confident that the manufacturer is incorrect and in that situation we would either just decline to buy the brake pads and say that the manufacturer doesn't know what they're talking about or we would go and take additional
24:30 - 25:00 samples i'd like to now talk about the difference between dependent and independent samples now with dependent samples one sample can be used to determine the other samples results and you'll often see examples of cause and effect or pairs of results whenever you're dealing with dependent samples an example would be is if i roll a die what is the probability that it is odd well i first have to roll the die
25:00 - 25:30 to then be able to judge whether it's even or odd or not and i think it's clear how those are dependent another thing you could do is if you could have subjects people lift dumbbells each day and then record results before and after the week um that would be another situation in which you would be dependent because you would need to find the first value before you would find the value that comes at the end of our recording period independent samples are those in which
25:30 - 26:00 samples from one population has absolutely no relation to another group and normally whenever you're looking at them you're going to see the word random many many times versus not necessarily the cause and effect terms that we saw before with dependent samples an example of this would be basically blood samples are taken from two random people that are then judged or tested at lab a and then you would
26:00 - 26:30 say that they were going to take 10 more random samples from lab b then you know that you're dealing with a situation in which you have independent samples another situation would be is if you give one random group a drug and then you give another random group a placebo and then test the results i think it's kind of clear that both of those things are completely independent from each other now whenever we are thinking about probabilities
26:30 - 27:00 we first must create a hypothesis and i'm guessing you probably know what it is but a hypothesis is just simply an educated guess that you can test now if you would say restaurants in los angeles are expensive that is not a hypothesis that is simply a statement and why is it a statement well it's because it's a statement that you cannot test against if however we say restaurants in los angeles are expensive versus restaurants in pittsburgh
27:00 - 27:30 we can test against that now the technical name for the hypothesis we are testing is going to be called the null hypothesis an example of this is a test to see if average used car prices fall between a value of nineteen thousand dollars and twenty one thousand dollars now the alternative hypothesis which is going to include all other possible prices so that means all used car prices are less than nineteen thousand or all used
27:30 - 28:00 car prices are twenty one thousand dollars is the alternative to the null hypothesis now whenever you test a hypothesis the probability of rejecting the null hypothesis when it is actually true is called the significance level and it once again is represented with alpha and common significance levels are going to be 0.01 0.05 and 0.1 and previously we talked about z tables
28:00 - 28:30 if the sample mean and the population mean are equal then that means that the z is going to be equal to zero and if we create a bell graph and we know that alpha is 0.05 then we know that the rejection for the null hypothesis is going to be found at alpha divided by 2 or 0.025 then we can use a z table and we know that mu is going to be equal to 0 and alpha divided by 2
28:30 - 29:00 is going to be equal to 0.025 and what we would then need to do is find that the rejected region is going to be less than negative 1.96 as you can see here 1.9 and 06. there you go and the other situation would be that it is greater than 1.96 and what we're basically doing is just calculating this area inside of here and because there are two sides that lie
29:00 - 29:30 outside of the null hypothesis this is known as a two-sided test now with a one-sided test for example if i say i think used car prices are greater than 21 000 the null hypothesis is everything to the right of the z code so in that situation we would use alpha instead of alpha divided by two which is one minus 0.05 or 0.95 and in the z table if we
29:30 - 30:00 look that up 0.95 you can see that works out to 1.65 now whenever it comes to hypothesis errors there are going to be two types there's going to be type one errors and that would be called a false positive and they refer to a rejection of a true null hypothesis and the probability of making this error is just simply going to be alpha then you also have type 2 errors which
30:00 - 30:30 are called false negatives which is when you accept a false null hypothesis and this error is normally going to be caused by poor sampling so just so you know that and the probability of making this error is going to be represented with the symbol beta now the goal of hypothesis testing is to reject a false null hypothesis which has a probability of one minus beta and you increase the power of the test
30:30 - 31:00 just simply by increasing the number of samples and i think this example here will clear up everything if you were at all confused so if you believe that the null hypothesis is that there is no reason to apply for a job because you don't think you're going to get it you could call this the status quo belief if you then don't apply for the job and the null hypothesis was correct you'd see that your decision was correct also in the situation in which you rejected the null hypothesis
31:00 - 31:30 and applied for the job and you got the job you would see again that you made the correct decision if however the hypothesis was correct and you applied anyway that would be an example of a type 1 error and again if you chose not to apply but the hypothesis was false this would be an example of a type 2 error and that brings us to mean testing so let's say i want to calculate if my sample is higher or lower than the
31:30 - 32:00 population mean to find out i'm going to need a two-sided test and the population mean is going to be the null hypothesis and that null hypothesis is that break pads should last for 64 000 kilometers here is all of my sample brake pad data and we are going to calculate our sample mean our standard deviation our sample size as well as our sample error now we are
32:00 - 32:30 going to need to standardize our means so that we can compare them even if they have different standard deviations and we standardize our variable by subtracting the mean and then divide by the standard deviation and whenever we do this we normalize our data meaning that we get a mean of zero and a standard deviation of one like we saw in the previous example then we're going to take that result and we're going to get the absolute value of it now in the situation in which my
32:30 - 33:00 confidence is 0.95 that means that my alpha is going to be 0.05 and since we are using a two-sided test we use alpha divided by two which works out to 0.025 if we subtract 0.025 from 1 that's going to give us a value of 0.975 and then if we look up 0.975 on our z table we get a z-score of 1.96
33:00 - 33:30 now what we need to do is compare the absolute value of the z-score we calculated before which is 8.99 to the critical value which is 1.96 and if 8.99 is greater than 1.96 which it is we reject the null hypothesis and to be very specific what we are saying here is that at a 0.95 confidence level we reject that the brake pads have an average life cycle
33:30 - 34:00 of 64 000 kilometers and this brings us to the p-value which is the smallest level of significance at which we can reject the null hypothesis now in our example we found a z-score of 8.99 which isn't even on our chart let's say instead that the null hypothesis was that the brake pads would last 61 750 kilometers that would mean the hypothesis would be correct at 1 minus 0.99996
34:00 - 34:30 or at a significance of point zero zero zero zero four so here the p-value for a one-sided test is point zero zero zero zero four and for a two-sided test that would work out to point zero zero zero zero four times two which works out to point zero zero zero eight all right so enough with the p values and on with linear regressions now we've
34:30 - 35:00 been talking about things like machine learning and such basically you might say to yourself why do i need all the statistics and probability stuff well basically neural networks are going to be made up from huge data sets that are very very hard to work with and we can statistically calculate outputs based off of our sample inputs if we believe there is a linear relationship between two different types of data meaning as one increases
35:00 - 35:30 so does the other and in those situations we can make predictions and quite simply linear regression is just looking at samples and fitting a line to those samples now we do this like we do with any linear equation we find the slope and then we find bo in this situation which is going to represent our y-intercept or where our line intercepts the y-axis and basically we are
35:30 - 36:00 averaging the sample points to our line and this specifically is called the regression line and we are specifically going to say that this is a regression line by using y hat instead of the symbol y and here is the basic formula for calculating b1 what we need to do is sum all values of x minus their means and the same for all the values of y and then we square the results to eliminate any possible negative values
36:00 - 36:30 and then after that we divide by the sum of x minus the mean again and square that value and in that situation we would have our slope for our line and then quite simply after we have that we can calculate our y-intercept down here by just plugging in our slope and solving for x and y so in this situation what i want to do now is i want to provide an example of how a linear regression can be useful
36:30 - 37:00 and how it can analyze data so what i have here is i have temperatures average temperatures per month and the number of sales of ice cream on those individual months and what i want to do is my null hypothesis is as the temperature rises and falls across the course of the year that the number of ice creams increases as it gets hotter and decreases as it gets colder so i want to test that and regression is a great way to do that
37:00 - 37:30 so i'm going to get my means for x and y because i need that for a little formula here i'm going to sum the product of each value of x minus the mean and do the same for each of the values of y and then i am going to come in and get the sum of all the values of x minus the mean squared then i need to come down and find the slope by dividing those values to get a value of 5.958 and then i'm going to calculate
37:30 - 38:00 the value for our y-intercept which is going to come out to 35.56 and now that we have this we can create the formula for the line which you can see all the values for in the table on the right so that's where i went and calculated the new values of y using our regression line versus the original sales and you can see how they differ so you may ask yourself well how do we find out if our regression line is a good fit or not for our data well we do that with something we have
38:00 - 38:30 already covered which is the correlation coefficient now remember that the correlation coefficient is going to calculate whether the values of x and y are related or correlated and we calculate it by finding the covariance of x and y and then divide by the product of the standard deviations of x and y and if the value is close to 1 then the data is highly correlated which means our regression line should
38:30 - 39:00 have a very easy time modeling our data so let's work through an example where we find the correlation coefficient well first what we're going to need to do is calculate the covariance for all x and y values which is going to equal 1733 and then now that we have the covariance we can divide it by the standard deviation of x multiplied by the standard deviation of y and whenever we do that we get a value of 0.9618
39:00 - 39:30 and since 0.9618 is so extremely close to 1 we know that our linear regression line will very very tightly match our data and now i want to talk about the coefficient of determination now there are numerous calculations involved in creating a regression line like i think you saw there in a previous example meanwhile however i think you can definitely agree that it takes seconds and zero to no
39:30 - 40:00 thought to calculate the mean line you can see here is the mean line which is just simply 400. so our big question here is is it worth it to go through the hassle of calculating this regression line versus just simply barely thinking and getting our mean line well basically the coefficient of determination is going to tell us now the coefficient of determination is going to be calculated as a percentage and what we need to do is to calculate the sum of the square errors between
40:00 - 40:30 the mean and the sample points and basically what we do is we build a square from the points to the mean line so all these individual squares that we have here and these would be the mean line squares and they have to be perfect squares and then what we're also going to do is create squares from the regression line to all of our sample points so you can see the black squares represent the
40:30 - 41:00 areas that we're going to calculate in regards to the difference between the samples and our regression line what we can do then is sum the areas of the squares for both subtract and then find out how much area we eliminated with our regression line so for example let's say the sum of the square areas for the mean line works out to 1000 and again let's say the sum of areas for the regression line works out to be 150
41:00 - 41:30 we can then calculate that 85 percent of the error is eliminated whenever we calculate the regression line and i think it is quite clear now that yes it makes sense to take the time to calculate the regression line now the root mean squared deviation is the measure of the differences between sample points and our regression line and we are using all of these formulas just to be quite clear here to better understand
41:30 - 42:00 how well our regression linear equation is estimating our data versus the alternatives that are very often much easier to come upon so we find the residual for each of our different data points so that just means the difference between our regression this line right here the residual all right and that's going to be represented with the letter e so it just shows the distance from the sample point to the actual regression line and then what we're going to do is take
42:00 - 42:30 the sum of all the residuals squared and then divided by the total number of samples minus 1. and on the right side of the screen we have the table with both the samples and the regression line so what i'm going to do is find the root mean squared deviation now if i calculate e by subtracting the value of my regression line from the sample y i then square all those values and find their result
42:30 - 43:00 or their sum i then divide by the number of samples minus 1 and then find the square root and if i do all of that i get a final root mean squared deviation of 28.86 and what that means is for one standard deviation which makes up for 68 of all my samples our regression line will be off at most plus or minus 28.86 and we could then add
43:00 - 43:30 as well as subtract 28.86 and create two more lines that will capture 68 of all the possible values and we could even go further and add in another line on the top and the bottom and capture 95 percent of all the different points all right and we are getting very close to covering everything so what i want to do now is finish up by talking about chi-square tests however it's important
43:30 - 44:00 to know that before you can perform the tests you must meet the conditions that the data must be random the data must be large and what they mean by that is each cell must have a value greater than five and that your data must be independent which means that you either use sample with replacement or you target 10 or less of the population and i talked about sample
44:00 - 44:30 with replacement in the probability tutorial series so check that out basically it just means that if you're sampling from the population that after you take a sample you then put that person or that thing back in to potentially get picked again all right that's all it basically means so basically this chi-square test of homogeny is going to be used whenever you want to look at the relationship between different categories of
44:30 - 45:00 variables and this is mainly used whenever you sample from two groups and want to compare their probability distributions now what we're trying to find is if age has an effect on people's preferences for a favorite sport our null hypothesis is that age doesn't affect the favored sport and the alternative of course is that the age does affect so if we calculate the percentages for all columns
45:00 - 45:30 we get those results and now to prove that the null hypothesis is indeed true we should expect that 25 of 18 to 29 year olds should prefer the nba for example also the percentages should work out for all other sports organizations and the easiest way to calculate the expected value for each cell in the chart is to multiply the cell column value which would be 35 in this situation if
45:30 - 46:00 we're working on this value by the row total which would be 66 and then divide by the total number of people that were sampled which would be 142. so if we do this i can have an expected value for 18 to 29 year olds that like the mba to work out the 66 times 35 divided by 142 or 16.3 and you can see how i rounded these different values around here and i basically just calculated this
46:00 - 46:30 expected value for each of those cells based off of age and the different sports teams and you can see that the row column totals are still going to be the same and the chi-square formula is going to be found right over here which is basically observed minus expected squared the sum of those values divided by expected and if we perform this calculation we get a final value of 7.28
46:30 - 47:00 and the larger this value the more likely these values are going to affect each other which means our null hypothesis is true and we're going to look up these values in what is called a chi-square table but first we have to also calculate the degrees of freedom and the way we get that is to multiply the number of columns minus one by the number of rows minus one so we can see four minus one is three times two minus one is one which gives us a final
47:00 - 47:30 degree of freedom value of three and then we jump over and get our chi-square table and we find our degrees of freedom and the closest match to 7.28 and whenever we do that we find that we have between a 90 to 95 confidence that age does not affect a person's favorite sport okay so there you go guys that is a rundown of just about everything you're gonna learn about statistics and if you combine it with my probability tutorial probabilities out of an average textbook
47:30 - 48:00 with examples for a whole bunch more hopefully you guys found this useful if you actually made it to the end of the tutorial please leave me a comment i would greatly appreciate it and like always please leave your questions and comments down below otherwise till next time