Statistical Insights for Part 1
Sampling and Estimation (2025 Level I CFA® Exam – Quantitative Methods – Module 5)
Estimated read time: 1:20
Summary
This video transcript introduces Level I CFA candidates to the essential concepts of Sampling and Estimation within the Quantitative Methods module. The discussion kicks off with a practical illustration involving dividend yields of Nasdaq and NYSE firms, emphasizing the practicality of sampling to understand large data sets. Various sampling techniques such as simple, stratified, and cluster sampling are explained, and errors like sampling and non-sampling errors are discussed. Key statistical theories, like the Central Limit Theorem, are introduced along with concepts of bias in sampling, ending with a discussion on the crucial statistical tools of bootstrapping and jackknifing to enhance the accuracy of population estimations.
Highlights
- Sampling helps in deriving conclusions from large data sets by analyzing a small, manageable portion. 🎯
- Probability sampling assures every member has an equal chance while non-probability is subjective. 🧠
- The Central Limit Theorem ensures the means of samples are normally distributed, aiding in accurate estimations. 🔄
- Biases are sneaky! Ensure your sample represents the population to avoid misleading conclusions. ⚠️
- Bootstrapping enhances accuracy by replicating samples, while Jackknifing removes one observation at a time for precision. 🔁
Key Takeaways
- Sampling is a cost-effective way to study large populations, like the 6,000 firms trading on major exchanges. 📊
- Two main sampling types: probability (where everyone has an equal chance) and non-probability (based on researcher discretion). 🎲
- Central Limit Theorem is crucial: it explains how sample means distribute normally around the population mean. 🎯
- Bootstrapping and Jackknifing are resampling methods to draw better conclusions about populations. 🔄
- Recognize biases in sampling like data snooping and survivorship bias to avoid skewed results. 🚫
Overview
Sampling and estimation are fundamental concepts in quantitative finance, providing a framework for analyzing large data sets by studying a manageable subset. This approach helps reduce costs and increases efficiency by allowing us to draw informed conclusions about an entire population from a small sample. The video explores various sampling techniques, such as simple random sampling and stratified sampling, integral to constructing an accurate picture of a population.
Diving deeper, the transcript explains fundamental statistical principles, such as the Central Limit Theorem, which is pivotal in understanding how sample means tend to follow a normal distribution. This understanding is crucial for creating accurate estimations and confidence intervals, enabling analysts to make sound predictions and decisions based on sample data.
The discussion concludes by highlighting advanced resampling techniques like Bootstrapping and Jackknifing. These methods improve the reliability and validity of statistical estimates. The video also warns against common biases like data snooping and survivorship bias, emphasizing the importance of critical thinking and rigorous methodology in statistical analysis.
Chapters
- 00:00 - 01:00: Introduction The chapter introduces the topic of quantitative methods in the context of the CFA program, focusing specifically on sampling and estimation. It starts with an illustrative example centered on the study of dividend yields from companies listed on the NASDAQ and New York Stock Exchanges.
- 01:00 - 02:00: Sampling and Estimation In this chapter titled 'Sampling and Estimation', the focus is on the process of collecting data from a large population by selecting a smaller sample to reduce cost and improve efficiency. Instead of gathering data from all 6000 firms, a sample size ranging from 50 to 1000 firms is proposed to represent the whole. The chapter underlines the importance of sampling as a strategy to make data collection more feasible.
- 03:00 - 07:00: Probability Sampling vs Non-Probability Sampling This chapter discusses the concept of probability sampling versus non-probability sampling in statistical analysis. The core focus is on how to select a sample that can help in making general conclusions about a larger population. Key objectives include estimating parameters such as the mean and standard deviation of dividend yields using the sample data. This involves understanding different sampling techniques to ensure accurate estimations and conclusions.
- 07:00 - 09:00: Sampling Errors and Non-Sampling Errors The chapter discusses how to estimate parameters for a population using sample data, specifically focusing on the concept of sampling errors and non-sampling errors. Using tools and methodologies, the chapter aims to teach how to draw conclusions about populations of data, such as dividend yields, operating cash flows, and board composition, while acknowledging the presence of errors in estimation.
- 09:00 - 10:00: Simple and Stratified Sampling In this chapter, the focus is on introducing the concepts of simple and stratified sampling. While the terms may have appeared in previous readings, such as probabilities and errors, this section aims to delve into these specific types of sampling. The chapter makes connections with previously discussed topics like the central limit theorem, standard errors, estimators, and confidence intervals. It sets the stage for a deeper exploration of simple and stratified sampling techniques, which have not been covered before this point.
- 11:00 - 14:00: Cluster Sampling The chapter discusses important concepts such as bootstrapping and jackknifing, which are foundational for more advanced study in the following levels of the course. There is an emphasis on understanding these concepts early. Additionally, the chapter includes a critical learning outcome that involves self-assessment and the evaluation of potential biases in sampling. This evaluative aspect is highlighted as a significant area for examination questions.
- 14:00 - 17:00: Non-Probability Sampling Methods This chapter introduces and explains the concept of non-probability sampling methods by contrasting it with probability sampling methods. It begins by laying down the foundational difference, noting that in probability sampling, each member of the population has an equal chance of being selected. The chapter is set to delve deeper into when and why one might choose non-probability sampling over probability sampling, although specific non-probability methods are not detailed in the provided excerpt.
- 17:00 - 21:00: Central Limit Theorem The chapter introduces the concept of sampling, focusing on simple random sampling and stratified random sampling.
- 21:00 - 25:00: Standard Error of the Sample Mean This chapter discusses different sampling methods, focusing on non-probability sampling techniques such as convenient sampling and judgmental sampling. The chapter aims to differentiate between sampling methods and introduces the concept of sampling error, which is the difference between the observed value of a statistic and the true value.
- 25:00 - 30:00: Properties of Estimators The chapter discusses the concept of the properties of estimators, using the example of dividend yields. It starts by defining the population dividend yield, assumed to be five percent in the example, and examining a sample of 30 large firms from exchanges. If the sample average yield is also five percent, it leads to a sampling error of zero, highlighting the estimation accuracy.
- 30:00 - 35:00: Confidence Intervals This chapter discusses the concept of confidence intervals, emphasizing the difference between the sample mean and the population mean. The primary reason for discrepancies is identified as sampling error, which occurs when a selected sample is not representative of the entire population. An example involving firms and their dividend yields is used to illustrate this point.
- 35:00 - 43:00: Resampling Methods: Bootstrapping and Jackknifing The chapter discusses resampling methods, specifically focusing on bootstrapping and jackknifing techniques. It illustrates the potential errors in sample selection, using the example of choosing 30 firms that do not pay dividends, which leads to a biased mean dividend yield. The chapter emphasizes that non-dividend-paying firms are not representative of dividend-paying ones. Additionally, it touches on considerations in qualitative research methods.
- 43:00 - 57:00: Biases in Sampling This chapter discusses biases in sampling, particularly focusing on non-sampling errors related to survey formation and question phrasing during interviews. It emphasizes the importance of random sampling and including the whole population in the sample. The chapter acknowledges the practical challenges of handling population data, primarily due to its potentially large size.
Sampling and Estimation (2025 Level I CFA® Exam – Quantitative Methods – Module 5) Transcription
- 00:00 - 00:30 this is level one of the cfa program the topic on quantitative methods and the reading on sampling and estimation allow me to begin this recording with a quick example let's suppose that you and i are super interested in dividends and in particular we want to study the dividend yields of the 6 000 or so firms that trade on the nasdaq and the new york stock exchanges notice that first part of the title of
- 00:30 - 01:00 this reading sampling well we could go out and collect data on all six thousand of those firms maybe that data is available maybe it's not but what we're likely do likely to do is to probably reduce our cost and try to improve efficiency let's just take a sample of those 6 000 firms maybe our sample size will be 50 maybe it'll be 100 maybe it'll be a thousand but what we're going to do is take that sample
- 01:00 - 01:30 and probably try to extract some important characteristics about that sample so that we can make general conclusions about the population example we're probably interested in the mean dividend yield and probably the standard deviation of the dividend yield and so what we're going to do the second part of that dep that title estimation we're going to use that stuff from the sample to estimate and make conclusions
- 01:30 - 02:00 regarding the entire population of dividend yields now the one thing that we're going to know is that we're going to make that estimation with some kind of an error so really what we're doing in this reading is trying to be able to use some tools to make conclusions about a population of data like dividend yields like like operating cash flows like composition of boards of directors
- 02:00 - 02:30 now you'll see that going through these learning outcomes statements that there are terms in there that we've seen in some previous uh readings probabilities we did that errors now we didn't we haven't done simple or stratified or cluster sampling so that's a new one that third one but we've talked about central limit theorem we've talked about standard errors we've talked about estimators and confidence intervals i will introduce this concept of
- 02:30 - 03:00 bootstrapping and jackknifing these are going to be important concepts when we get into level two so it's important that we understand that here early in level one and then a final learning outcome statement uh asks us to look at ourselves and look at the surrounding area and say okay we've we've done this sample are there any kind of biases in there and those are really really good potential exam questions
- 03:00 - 03:30 all right so let's go ahead and start off with the difference between probability sampling and non-probability sampling and based on our discussions from the previous uh readings you ought to be able to figure out what the difference between these two are look at that first diamond point under probability sampling each member of the population has an equal chance of being selected inside of a sample so we can do that at least two
- 03:30 - 04:00 ways that we'll talk about here in this reading simple random sampling and stratified random sampling and there's a good example if we have a list of 500 hedge funds and we want to take a sample of 50 what we could probably do is some kind of a random number generator think of an excel spreadsheet where you put these 50 these 500 somewhere and you assign numbers and then you hit your random number generator function to get those uh 50 hedge funds
- 04:00 - 04:30 so equal chance of being selected non-probability sampling on the other hand has a lot to do with the researcher and how he or she views the population and or the sample so there's one called a convenient sampling and there's one called a judgmental sampling we'll we'll talk about that in just a few minutes let's go ahead and get some definitions out of the way sampling error difference between the observed value of a
- 04:30 - 05:00 statistic and the quantity it is intended to estimate so let's go back to my example with the dividend yields let's suppose that the population dividend yield is uh five percent and we pick a sample of let's say 30 firms that are the largest firms traded on either of those two exchanges and the average dividend yield for some crazy reason turns out to be five percent well then the sampling error would be zero
- 05:00 - 05:30 so notice what we have written as the example any difference between the sample mean and the population mean and of course this can be caused by a number of reasons but the primary reason is for is through what's known as the sampling error the selection of the sample is not representative of the population of a whole so let me take a crazy example here we have these 6 000 firms some of those are going to have dividend yields of zero so let's suppose that in our
- 05:30 - 06:00 selection of a sample that we pick 30 firms that don't pay a dividend so you you you get their their mean dividend yield it's it's going to be zero right but you erred in selecting your sample because it wasn't really representative right non-dividend-paying firms are not really representative of dividend-paying firms and then in in some qualitative research there are ways to
- 06:00 - 06:30 incur non-sampling errors that have lots to do with formations of surveys and questions during interviews look what we have written down there on the bottom line the sample must be drawn randomly and encompass the whole population remember that we would love to have population data but the population data might be you know it might be too big right we can't carry it on our shoulders or and this is probably a more likely outcome that it's way too
- 06:30 - 07:00 expensive to find out and to collect all of the population data so what we try to do write marginal cost marginal benefits so that's why we put together this whole sampling process all right let's look at the difference between simple and stratified sampling simple is exactly exactly what you would think that it would mean that there's an equal probability of being picked and it's just selecting a sample from an entire population so it doesn't favor
- 07:00 - 07:30 one set of the population and it doesn't exclude another set of the population but what if what if we go back to my dividend yield example and we ask ourselves the question hey what happens if there are interesting characteristics about the firms that pay dividends maybe there are interesting characteristics about the industries of those firms that
- 07:30 - 08:00 pay dividends so notice what we have in that in that gray box glaring differences within the population ah unreliable simple sample inferences so we stratify so all we do is we we divide it here and here and here and here and so look at that last circle point each stratum is composed of elements that have a common characteristic and so here's uh here's a good example
- 08:00 - 08:30 so suppose we have a thousand stocks some are small some are mid some are large cap stocks so you have a thousand right but then you have these different categories or stratums and you just simply divide those and so let's go back up to the very top once the population has been subdivided a simple random sample is taken from each stratum and it's combined to form an overall final sample and so what that's going to do is that's going to be representative
- 08:30 - 09:00 of the population and that that should make perfect sense now back here let me go back here real quick here we kind of uh were aware of some of those differences right small mid and large cap that makes sense but let's suppose that we're kind of maybe unaware of some of those differences yet those differences those underlying differences might to might influence our our sampling process so to handle that we can do what's known
- 09:00 - 09:30 as cluster sampling and there are a variety of ways of doing this but look at the picture there we could just draw a picture and plot all these variables on some type of a graph some are more complex than others obviously but notice that we call this good clustering and let's go back to my dividend yield example something that maybe we wouldn't think of but maybe we'd find out if we do some clustering how about this how about the age of the boards of directors that are declaring these
- 09:30 - 10:00 dividends so maybe maybe there is a sub-sample maybe there's a cluster of boards that have relatively younger board members and when those board members sit around talking about dividends they say something like oh you know what let's reinvest all of those cash flows into positive net present value projects which will allow us to pay dividends higher dividends at some time in the future
- 10:00 - 10:30 on the other hand maybe you have a population of board members that are in their 50s or 60s and and these uh smart men and women say well we want to pay a dividend to our shareholders because they need it as income so it's really just one of those differences of willingness to take risk you know in general younger people are more willing to take risk than people in their 50s or 60s so maybe maybe the clustering here and we can do this
- 10:30 - 11:00 in one stage or two stage maybe that cluster can identify those important variables and then of course those clusters are also useful even if it doesn't even if it doesn't identify those those individual elements but clearly if there is clustering then we need to make sure that we include all of the clustering in the sampling i told you we'd get back to these
- 11:00 - 11:30 non-probability sampling so convenience sampling members in the sample are selected because of their convenient accessibility and proximity to the researcher so this is the path of least resistance let me tell you a quick story when i was when i was in graduate school working on my dissertation and then trying to publish some other papers i was doing a research on dutch auction repurchases and back in those days the only
- 11:30 - 12:00 uh available data set or information set came from what was known as the wall street journal index it was this big old thick book and i would spend tons and tons of hours in the library looking inside of this annual wall street journal index to find firms to identify firms that used a dutch auction repurchase hours and hours down there i would prefer to call that inconvenient
- 12:00 - 12:30 sampling because it wasn't it wasn't easy to do for me but you can get the sense here convenience if it's right here then i'm going to use that that sample judgmental sampling this is even more interesting because this is based on what what you would think is the researcher's knowledge and professional judgment so that the researcher says something like okay going back to your example jim with the dividends well i know that these 100 firms they better represent
- 12:30 - 13:00 the entire population than all these others based on my knowledge and professional judgment now of course when you have this judgmental sampling you're going to be subject to some bias and that's really why the institute puts that last learning outcome statement in this in this reading and we'll go ahead and look at that at the very end all right here's an los that we've seen before explain central limit theorem and
- 13:00 - 13:30 its importance we did this just in a previous reading and remember central limit theorem tells us lots and lots of stuff about a sample and you remember i gave you a couple of guys names chaba chef and liopenoff and these guys lived a long long time ago and they were studying statistics and they said all right we can take a sample but then let's suppose that we take other samples you know the supposed population is this big you know how many firms pay a dividend
- 13:30 - 14:00 out of those 6 000 on the nasdaq in the new york stock exchange i don't know maybe i don't know 4 500 let's say so you know we take a sample here of 100 and a sample here of 100 and a sample here of 100. well this central limit theorem tells us that if you take all these samples and then you take the mean of all of those samples and then the standard deviations and the variances well you get a picture that shows down there on the bottom right of the slide deck so let me go ahead and read that box just
- 14:00 - 14:30 to make sure you understand this from a statistical sample a statistical standpoint uh simple random samples each from size n from a population with a mean and a variance the sample mean x approximately has a normal distribution so each of those samples if you do enough of them right they're going to look like a normal distribution and so look at the circle point this is a very useful tool tool of course it's a very
- 14:30 - 15:00 useful tool and then we can go look a little bit more specifically at this illustration and you should remember this from the very last recording that i made 68 95 and 99 that's one and two and three standard deviations away from the mean and so the really cool thing about this is we're going to use the basis of this central limit theorem to do things like confidence intervals
- 15:00 - 15:30 but before we head on to confidence intervals let's go ahead and talk about the standard error of the sample mean that's the standard deviation of all of the sample means and what it does is it gives us an estimate an estimate of variability and look at that second half blocked point there while the standard deviation measures the variability obtained within one sample the standard error gives the estimate of
- 15:30 - 16:00 the variability between all of those samples so that's a really really super definition and distinction that you need to make remember the difference between standard deviation and standard error if i were creating exams i would absolutely ask that question so then the question becomes part of that loss calculate right so we need to calculate it so there's a good old formula there if we know if we know the population standard deviation then we just divide
- 16:00 - 16:30 by the square root of n now why is this so important gives the analyst an idea of how precisely the sample mean estimates the population mean you know for example going back to my dividend yield suppose that population mean is five percent and our sample mean we do a sample over here and a sample over here in a sample over here the means are 10 and 20 and 30 boy we have to scratch our heads and say all
- 16:30 - 17:00 right we did something wrong but that something wrong is going to be formalized in this standard error so look at the second arrow point a lower value indicates more precision a larger value indicates less position precision and then there's a simple simple example at the bottom of the page where we just take our population standard deviation so it's known remember that's got to be known 3 divided by the square root of 30 gives
- 17:00 - 17:30 us 55 cents all right let's bury ourselves back into a undergraduate statistics class about an estimator so this is a sample statistic used to estimate something that's not known inside of the population and of course the great example is the sample mean but there are other things out there like variances and and correlation coefficients and and all
- 17:30 - 18:00 sorts of other things out there that we can that we can estimate but those are probably the two the first moment and the second moment of the distribution so notice that part of the loss describe desirable properties of an estimator all right so we need to be unbiased all right so the point estimator and the reading uses the notation greek letter beta so beta sub i and notice it has a little squiggly thing on top of it and that's known as a hat so we say beta hat is an
- 18:00 - 18:30 unbiased estimator of the true population parameter and that's just a regular beta sub i so in order to prove this from a statistical standpoint we just put an expectations operator around the beta hat and say the expected value of that sample is equal to the true population value all right so think about this so we have
- 18:30 - 19:00 a mean we know we don't know what that we don't know what that uh true population mean is but we have samples all right and so we take an expected value of that sample mean and we if we expect it to be equal to on average right an expected value is an average we expect it to be equal to the true population statistic i'm sorry the population parameter on average now what that means from kind
- 19:00 - 19:30 of a practical maybe even a tactical standpoint is let's go back to my let's go back to my dividend yield examples what did i say that population was let's suppose it's five percent whether or not we know it but here we don't know it but but think of it this way that let's suppose that none of our estimates in the state samples are are exactly five percent some are above and some are below think of this to be unbiased the sum above and the sum below they have to cancel each other out
- 19:30 - 20:00 so that's what the last arrow point down there this difference should be zero if an estimator is unbiased and i remember my econometrics professor in graduate school i had this this guy for a couple of classes uh he emphasized the unbiased nature of an estimator as being super important but it was uh met with equal importance by this idea of efficiency and so if we have two of these if we have two of these
- 20:00 - 20:30 unbiased estimators then what we want to do is we want to pick the one that has the smaller of the two variances and then consistency this is also important but think of it this way that as we increase our sample size so we go from 30 to 50 to 80 to 100 to a thousand whatever that is the probability that our estimate approaches the exact true value of the population that has to be 100 all right so let's go ahead and swing to
- 20:30 - 21:00 confidence intervals and to understand a confidence interval we need to introduce a new term so a point estimator a sample statistic used to estimate an unknown population parameter and this is just another name for the sample mean it's going to be a point estimator so you have all this data and you just figure it out you know put your finger right in the middle somewhere there's my point i'm pointing it and forgive me i don't mean to point at you guys as i'm doing this my mother never
- 21:00 - 21:30 always taught me don't ever point at someone so i i didn't mean to do that but i was pointing at the data set so what are we doing with the confidence interval we know that when we come up with a mean we know that's an estimate and we're going to make that estimate with error so we need to figure out is our error is it this big is it this big or is it this big and one way we can do this is by constructing a confidence interval so to construct a confidence interval we're
- 21:30 - 22:00 going to go ahead and start with the mean so we call that the point estimator and then we're going to add and subtract from that and you'll see that here in just a second but what we need to do is we need to know about precision so we're going to have a reliability factor times that standard error that we talked about just a few moments ago and that reliability factor is going to depend on the level of confidence are a confidence interval a level of confidence that the
- 22:00 - 22:30 researcher wants to illustrate you know for example i think i've used this in a previous recording whenever i uh whenever i can't remember someone's name you know you have all these people in your life and you forget some of these people's names uh i always will ask a friend i'll say hey you see that dude what's his name and uh so i know it right and so the person you know my friend will say oh you know that's todd over there and i'll immediately ask i'll immediately say what's your percentile on that and
- 22:30 - 23:00 they'll say uh 50 you know if they say 50 i'm not gonna go say hey todd how are you doing but if he if my buddy says 100 well then i'm gonna go say hey todd how are you doing with confidence all right so what is the confidence and so in statistics we like 90 confidence but we love 95 percent confidence and sometimes we even like uh 99 confidence so there's the table right there ten percent uh five percent and one percent those are called levels of significance and they
- 23:00 - 23:30 correspond to degrees of confidence what did i just say 90 95 and 99 and remember in the previous recording i showed you how to look at a z table and so there are the corresponding z values for the level of significance and the degree of confidence so what we want to say is something like all right let's go back to my dividend yield what was that population mean that population mean was 5
- 23:30 - 24:00 what was the sample mean i'm not sure if i gave you a sample mean suppose the sample mean was 4.5 so what we want to be able to say is with 95 confidence we want our sample to fall within that range so what is that confidence level you know maybe it'll be between let's say four percent and six percent we're 95 confident that that that that interval will contain our our estimate
- 24:00 - 24:30 then it'll contain our population so look at in the our population mean so look in that first uh first block point 95 of the inter intervals would contain the true parameter which means just 5 would not contain it all right so there's our confidence interval equation right point estimate plus or minus a reliability factor times the standard error so what do we call that that's our precision how precise do
- 24:30 - 25:00 we want to be and notice it depends on a couple of factors it depends on the standard error so if we have a huge standard error then the confidence interval is going to be super wide and so imagine if we say hey jim i'm 95 confident that the true dividend yield of the population of those 6 000 stocks on the two exchanges is between zero percent and 40 percent oh my gosh that does us almost no good
- 25:00 - 25:30 right so our sampling process has to be efficient and consistent and unbiased all those things that we just talked about because what do we want to do we want to find we want to find this where remember i've said this to you multiple times in other recordings you know what we're trying to do is squeeze we're squeezing down so we have the lowest possible standard deviation and we do that we do that based on our statistical knowledge here in quantitative methods and we're going
- 25:30 - 26:00 to apply that to lots and lots of stuff in level 2 and level 3. all right so there's a good old formula there confidence interval for the mean right and look at scenario one this is with a known variance so all we're going to do is take our mean which is x and plus or n plus or minus and there's our z value and we're going to get that from the bottom right column in that table and we're cutting it in half because uh we're cutting the alpha in half because it's two-tailed right we're going plus
- 26:00 - 26:30 and minus so there's a tail up and there's a tail down but maybe it's better if you look tail left and tail right and we're going to multiply it by the known population standard deviation right there's that sigma divided by the square root of n so the key here is reading the question stem somewhere in the question stem they say hey the variance is known to be 20 well there you go
- 26:30 - 27:00 but how about if the variance is unknown so back here if the variance is known just think about it this way you're going to use z if the variance is unknown then you have to go back to the student's t distribution and we talked about the t distribution in a previous recording notice with the t-distribution we really care about n so that's what we're doing down the left-hand column there remember degrees of freedom and degrees of freedom for a
- 27:00 - 27:30 t-test uh in this case is going to be n minus 1 degrees of freedom i know i've done this before but let me go ahead and do this just quickly again just to make sure you understand degrees of freedom if you were in my class and you took two tests and you made you you averaged 80 on those two tests you knew that you made a 90 on the first test but you forgot what you made on the second test well knowing the mean which was 80
- 27:30 - 28:00 right knowing the first score 90 well you can back in you you you can figure out that you had to have made a 70 on the second test to be able to have a mean of 80 percent that's degrees of freedom n minus 1. so if you took a hundred tests of mine and you knew the mean was eighty percent you'd have to know 99 of those scores so that you can get back into uh that last one that's what degrees of freedom really means and so over on the right hand each of those
- 28:00 - 28:30 columns are different levels of confidence and those are the corresponding t values so you probably have to read at least a partial table on the exam uh keeping in mind let me tell you what i said to you in the last recording remember that as n gets larger the t distribution becomes a z uh distribution uh normal distribution
- 28:30 - 29:00 and so if you have an unknown variance and your sample size is a million you could probably use the z table all right how about scenario three the population mean when the variance is unknown and the sample size is large enough this is what i just described we're going to go ahead and throw that central limit theorem at us and we can use the z all right how about a couple of general
- 29:00 - 29:30 comments here larger samples are preferred to smaller sample ones because we probably didn't even need to read this week this reading right larger samples are preferred just because of regular old common sense but the confidence intervals are narrower so that's important and look at that second one this is probably even more important this is why we bolded it standard error is smaller when the sample size increases now of course i don't want you just to go ahead and think that oh we want tons
- 29:30 - 30:00 and tons of observations in every one of our samples because a it costs money to collect this extra data but then look at that first one underneath the orange orange block however right with large sample sizes it's possible it's possible that old data gets mixed in with new data and that it might not be relevant it might not be reliable and it may violate all of those kinds of assumptions that we made about
- 30:00 - 30:30 consistency etc etc so let's be careful marginal cost marginal benefit to decide what is the optimal sample size let's move on to this idea of resampling so what have we done so far we've taken our population of 6 000 u.s dividend yields and we've drawn samples from them maybe we did some clustering maybe we did some kind of simple sampling maybe we did stratums maybe we
- 30:30 - 31:00 did some other things but then we get this idea of i wonder if there's something else that we can do in order to improve efficiency because remember what we want to do is we want to be able to make a conclusion regarding the sample as it applies to the population so bootstrapping and jaff jackknifing are attempts to try to help us make and draw better conclusions
- 31:00 - 31:30 all right so notice what we have right above the big old bootstrapping one resampling refers to repeatedly repeatedly drawing samples from the original observed data sample all right so we can do this in one of two ways we can do it through the bootstrapping and so look at the illustration there going down the original data set right so there it is and maybe we've divided that into different kind of groups or maybe not maybe those are just different samples and those are
- 31:30 - 32:00 different means or different any kind of another statistic so what we do in the bootstrap is that we go and draw random samples repeatedly and you know it's probably a lot easier to do this with some type of a computer algorithm and so what this allows us to do is it allows us to take right and continue to take and so all of a sudden we have a thousand copies of each observation which then over in the green
- 32:00 - 32:30 area that might be more efficient it might be more representative of the entire population so notice that second arrow point the number of repeated samples to be drawn are determined by the researcher now what we can also do is we can jackknife we can sample we can withdraw data without replacement so here here let's go back here um we're just like drawing and drawing and
- 32:30 - 33:00 drawing so we're re-sampling resampling repeatedly here we're doing this without replacement and so we need n repeated samples that match the sample size in a jackknife method and so there we have it can be used to reduce the bias of the estimator to estimate the standard error and the confidence interval of an estimator now remember when i we had that slide about the using our judgments so when we
- 33:00 - 33:30 have judgments then we're probably going to be subject to some kind uh some kind of a bias i mean i'm i'm guilty of a bias just like anybody when i flick on the red zone on sunday afternoons to watch nfl if i see a notre dame football player playing i root for that team so i'm biased of course i'm biased now that really doesn't affect anybody else in my life and it doesn't affect the outcome of a football game and it's
- 33:30 - 34:00 really not that big of a deal but if we have biases in sampling they're going to dramatically at least have the potential of dramatically impact our conclusions based on based on the population and in fact they can be so dramatic that we can make the wrong conclusion data snooping bias you probably don't even need to read anything about this to understand what this means and let me assure you that our professors in graduate school when my buddies and i were
- 34:00 - 34:30 writing our dissertations you know we're collecting tons and tons of data and our professors said look whatever you collect that's what you get i don't want you going in and saying oh here's an outlier let's throw that out here's something that i don't like let's throw that out all right so let me read this to you analyzing a large amount of historical data to discover trends and then data mining analyze data you know statistically irrelevance all those kinds of good
- 34:30 - 35:00 things uh sample selection bias this is prevalent and lots and lots of uh studies out there you will read this when you get to the reading on um alternative investments when it comes to hedge funds there's there's sample selection bias in some hedge fund studies exclude a section of the population this erodes the idea of randomness right what did we say earlier
- 35:00 - 35:30 that the samples they must be randomly selected but if you're specifically excluding a section of the population well then your sampling is not going to be representation representative survivorship bias this this goes back to hedge funds and mutual funds as well and there's the example hedge funds over a 10-year period would exclude all hedge funds that close close during that
- 35:30 - 36:00 period backfill bias successful funds report their past performance to a database of course successful fund managers are going to say yeah look at me i did a great job but if i stink if i have if i have jim's mutual fund and i stink there's no way i'm sending that information over there so that everybody knows that i stink look ahead bias this is an interesting one because what are we taught to do as good financial analysts we're taught to
- 36:00 - 36:30 process all this information that we get from financial statements that we get from reading the wall street journal from listening to the chairman of the fed and what we can expect is that we can expect something like hey you know what next week there's a fed meeting and i believe that the fed is going to announce this and this and this and i'm pretty sure they're going to do it at 10 o'clock in the morning and so therefore i'm going to anticipate and i'm going to look ahead
- 36:30 - 37:00 etc etc but it also includes most companies taking you know a long time let's say two or three months to release those results and so there's some kind of a bias in there time period by us our professors used to tell us about this all the time whenever we had assignments in inside of a class like our econometrics class we would go out and collect data for let's say a year and we'd present the data and our professor would say all right that sounds good but a year what's cool about
- 37:00 - 37:30 this year nothing does that represent three years or five years and so there's a seasonality issue here with time period time period bias as well so look for time period bias when we do cyclicality and seasonality at some time in future readings and that takes us through uh all of our learning outcome statements well i love to give students and candidates an idea
- 37:30 - 38:00 of what's more important but wait i'm having a tough one here with this reading i think all of these are important [Music]