Iterations using combination of filtering conditions
Estimated read time: 1:20
Summary
The lecture explores the process of filtering data using various conditions. The instructor demonstrates filtering through gender and city, finding girls from Chennai by applying two-stage filtering. The exercise progresses to count students born in the first half of the year, comparing males and females. Finally, the discussion centers on comparing maths proficiency between genders, underscoring the use of algorithms in making data-driven conclusions. The session highlights the practical application of filtering conditions and iteration patterns in problem-solving.
Highlights
- The instructor demonstrates filtering girls from Chennai using a two-step process. 🌟
- The data filtering technique is extended to find students born in the first half of the year. 📅
- There's a detailed example on calculating average mathematics scores for boys and girls to determine who's better. 📐
- Through data-driven approaches, algorithmic solutions address subjective questions like gender performance in maths. 🤔
- The class learns to apply multiple conditions to data, ensuring decisions are data-backed. 🔍
Key Takeaways
- Filtering is a powerful tool for sorting data based on conditions, like gender and location. 🎯
- Two-step filtering can isolate specific data subsets, such as finding Chennai-based girls from a data pool. 🏙️✨
- Iterating through data allows real-time decisions based on multiple conditions. 💡
- Counting conditions, like birth dates, can reveal interesting demographic insights. 📊
- Maths proficiency comparison between genders shows slight advantages through systematic data analysis. 📚⚖️
Overview
Welcome to the exploration of filtering techniques with the IIT Madras B.S. Degree Programme, where we delve into data handling with a dual focus on gender and location.
The lecture starts by guiding through extracting girls from Chennai using step-by-step filtering, highlighting efficiency in targeting data subsets.
We then journey through exercises aimed at understanding demographic distributions and computational evaluations, reinforcing filtered iteration and algorithmic analysis.
Chapters
- 00:00 - 00:30: Introduction to Filtering The chapter 'Introduction to Filtering' discusses the process of going through cards and evaluating them based on certain characteristics or properties to determine their usefulness. This involves examining specific elements such as pronouns and verbs to make informed decisions about the relevance and utility of the cards.
- 00:30 - 01:30: Filtering with Multiple Conditions This chapter discusses the process of filtering data with multiple conditions. It starts by highlighting a basic concept that an item is either a pronoun or a verb, implying exclusive categories. The narrative then shifts to demonstrate a data analysis task using a classroom data set. The task involves identifying the number of girls from a specific city, Chennai, as an example of applying multiple filtering criteria. The conditions for filtering include gender and city, which are demonstrated as an effective way to extract specific insights from the dataset.
- 01:30 - 02:30: First Filter: Gender The chapter titled 'First Filter: Gender' focuses on the process of filtering data based on gender and geographical location, specifically selecting females from Chennai. The conversation suggests a step-by-step approach, starting with separating females from a dataset, indicating this as an initial filtering phase. The dialogue reflects a practical scenario where the speaker actively organizes the data by setting females aside into a separate group, demonstrating an example of data sorting and organization.
- 02:30 - 03:30: Second Filter: City of Chennai In this chapter titled 'Second Filter: City of Chennai,' the process of sorting data into two distinct categories based on gender is illustrated. The transcript lists out data entries as either 'boy' or 'girl.' By the end of this process, the data is organized into two separate categories or 'piles' where one pile contains data entries identified as 'boy.' This resembles a filtering process, likely used for organizing or categorizing information by gender.
- 03:30 - 04:30: Combining Conditions in One Pass The chapter discusses filtering a list of girls based on their city, specifically looking for those from Chennai. It involves examining each entry and segregating those from Chennai, while ignoring those from other cities like Bengaluru. The process is described as methodically checking each item and separating it based on the city field.
- 04:30 - 05:30: Count People Born in First Half of Year This chapter focuses on data filtering techniques to count students who are both female and from Chennai. Initially, 30 students were considered, and through a two-stage filtering process of selecting gender and city, it was found that 5 girls match these criteria. This exemplifies a method for narrowing down data based on multiple conditions.
- 05:30 - 07:00: Filtering by Birthdate and Gender In this chapter, the focus is on filtering data based on birthdate and gender. The process involves applying a gender filter to isolate all female students, which is followed by applying a city filter to select individuals from Chennai. The discussion highlights that the order of applying these filters may not affect the final result, as long as the same criteria are used. The chapter suggests exploring a more streamlined method of combining these filters into a single step to achieve the result efficiently.
- 07:00 - 08:30: Calculation Summary for Date and Gender The chapter discusses an algorithmic approach to filter data based on multiple conditions. Specifically, the focus is on processing a collection of records to identify and segregate data entries pertaining to females from Chennai. The transcript highlights a conversation about optimizing this filtering process by consolidating checks in a single iteration. The discussion emphasizes the use of logical 'and' operations to ensure both conditions — being female and from Chennai — are met for each record as it is processed. This strategy is suggested to streamline and possibly enhance the efficiency of the data computation process.
- 08:30 - 09:30: Calculating Average Math Scores by Gender This chapter focuses on the calculation of average math scores, specifically separated by gender, with a case study based on the city of Chennai. It discusses the process of filtering data based on gender ('F' for female) and location ('Chennai'), and counting the number of valid entries meeting both conditions. The discussion begins with initializing a count variable at zero and proceeds with data traversal and condition checking to increment the count only when both criteria are met.
- 09:30 - 10:30: Iterating and Accumulating Math Scores In this chapter titled 'Iterating and Accumulating Math Scores,' the focus is on filtering and counting specific data from a dataset. The task is to count females from Chennai. As each data point is examined, the process filters out males and examines only female entries to check if they are from Chennai. If so, the count is incremented. The narrative provides specific examples of how data is evaluated and decisions are made based on the conditions set for filtering – such as not needing to check the city if the individual is male.
- 10:30 - 11:30: Conclusion on Math Performance by Gender In this chapter, the focus is on analyzing math performance by gender, with particular attention to geographic data. The narrative involves counting and categorizing individuals from various locations such as Chennai, Madurai, Erode, Nagercoil, and Bangalore. A pattern in female representation from Chennai is highlighted, while mentioning male representation from Bangalore. The chapter attempts to synthesize demographic information before concluding.
- 11:30 - 12:30: Deriving Algorithms from Questions The chapter explores how algorithms can be derived by framing questions about the data at hand. Through an example, it demonstrates maintaining a count of "Chennai girls" using a conditional loop: iterating over data while checking if the gender is female and the city is Chennai. This approach exemplifies filtering data based on specific attributes and serves as an effective technique to extract relevant elements from a complex dataset. Such methodologies allow for targeting and pullin specific data entries of interest, thus simplifying data processing and extracting meaningful information.
- 12:30 - 13:30: Conclusion on Filtering and Algorithms This chapter explores advanced filtering and algorithmic applications by considering practical scenarios. Initially, it discusses filtering shopping bills based on total amount and specific item types. Similarly, it explains word filtering using criteria like word type and length. The chapter then suggests creative challenges, such as finding individuals born within a certain date range, encouraging a deeper exploration of algorithmic possibilities.
Iterations using combination of filtering conditions Transcription
- 00:00 - 00:30 So last time we looked at this question of filtering, so we said we would go through the cards and based on some characteristics of the cards, some property of the card, we will decide whether it is useful or not. For instance, we looked at pronouns, we looked at verbs we also said we could keep track
- 00:30 - 01:00 of both pronouns and verbs at the same time because something is either a pronoun or a verb. So, now let us do something more interesting. So I am going back to this data set, classroom data set. Now let us say in this data set I want to find out all the girls, who not just want do I want to find out how many girls are there, but I want to find out how many girls are there from a specific city, let us say Chennai. Okay. So how many girls are from Chennai, how many Chennai girls are in thisů So there are two, so there is a gender and there is a town/city, so there two items and
- 01:00 - 01:30 what we want is this should be female and the town/city should be Chennai. Yeah, how would we do that, I mean is there simple way of doing itů So what we could do first is we could first separate out all the girls, so that would be one step of filtering. Want to try that? Yeah. So So here is a girl, so I guess I put it in a different pile, here is also girl, girl,
- 01:30 - 02:00 girl, boy, boy, girl, girl. Girl. Girl, boy. Boy, boy, boy, boy, boy, one more girl, boy, boy, girl, boy, boy, boy, boy, girl, boy, boy, girl, girl, girl. So now we have filtered this data into two piles, everything on this pile is a boy
- 02:00 - 02:30 Not useful Everything there is the girl : This is one which we are interested in Now, among the girls, we want those who are from Chennai. So we move this aside keep it separately and now we want to go through this again look at the town/city thing. Yeah soů So this is not Chennai, Bengaluru, not Chennai, this is Chennai, this is Chennai, this is Chennai, this is Chennai, not Chennai, this Chennai, not Chennai, not Chennai, not Chennai, not Chennai and not Chennai.
- 02:30 - 03:00 So now we have pulled out those cards which are a combination of two different interesting things for us, that they are both girls and they are from Chennai. And we can count this I guess. We shall 2, 3, 4, 5 So among this entire remember we add 30 students we had counted at the beginning, so among those we have found that there are 5 girls from Chennai. So in this what we have done is we have done filtering in two stages, so first we have
- 03:00 - 03:30 applied a gender filter to pull out all the girls and then we have applied the city filter to pull out all the Chennai, I guess we could have done it the other way, we could have first pulled out all the Chennai people and look for the girls in that. Should give the same answer, right? Because the same cards would come out. Yes First, we should have filtered the Chennai. Yeah, but supposing we did it in a one step, can we do it in a one single, so what I said is instead of going through and first explicitly pulling out all the girls, and then going
- 03:30 - 04:00 through all the girls and then pulling out all the Chennai people, do you think we can just do it in one shot like we have done for some of the other iterations overlap this thing that we could check? We should I guess that would be I mean just go through the cards in one iteration and at each stage you look at the card and see whether it is a female, that is not enough. And we also need to check So we have to do, so you have to check two conditions. So you have to check So both of them have to be true Both of them have to be true, so it is an Ĺandĺ, and of two conditions, and of two
- 04:00 - 04:30 conditions, the first condition is the gender F and the second condition is this the town/city Chennai. So if it is an and of, if both of them are true, then only we select the card for counting we keep count I guess, so we have a count variable and so maybe we start with a count variable to 0. Yeah. So we could do that, maybe I could do that by count And then we go through this So I will say, Chennai
- 04:30 - 05:00 Girl, Chennai girls equal to 0. So here is a card which is a male, so it does not satisfy the condition. So if it is a male we do not even have to look at the city. Do not have to look at the city, so that is interesting. So, again here it is a male, so do not have to look, male, male, so female now we look at the city, Trichy So it is not Chennai. So again, female Teni, female Bangalore, female Madurai, female Chennai. So now I have to increment my count.
- 05:00 - 05:30 Increment the count, another female Chennai. Increment again. One more female Chennai, they all come together I guess because you put the cards together, another female Chennai and then one more female Chennai, and there is a female Madurai, Erode, female Nagercoil, Bangalore, male, male, male, male so the rest I think are not cards that are useful to us, so just going through all of them. So now basically in one
- 05:30 - 06:00 So we got the count, in one iteration we kept track of a variable called Chennai girls and we kept incrementing that variable provided that both the conditions were satisfied, it is Ĺandĺ of two conditions, gender being female and the city being Chennai. So this is a very useful thing now, so we know that we can take the entire stack of data or sequence of data or cards whatever we want and we can pull out those which are interesting based on some property, in this case it was the city, for example if it was
- 06:00 - 06:30 the shopping bills it could be based on whether the total is bigger than something and they have bought at certain type of item, in the case of the words for instance it could have been if it is a noun and it is more than five letters long, so we can take various combinations of conditions and filter on them simultaneously. Is there something bit more interesting we can do, for example, can we find all those people who are born between one date and other date? Yeah, let us try that, so let us see if we can find all people who are born in the first
- 06:30 - 07:00 half of the year, say between 1st January and 30th June. It should be half, we do not know. We do not know, but let us see, is it true that, let us see. It should be half, I mean if it is random it would be half, but this may not be randomů So let me keep track of a count for this, so this isů So here we are actually now not checking against two of these fields, we are checking one field only, we are checking the date but now we are comparing the date, we have twoů
- 07:00 - 07:30 So we could still do two, so we could say for male and for female, so we can say they should be born in the first half and they are male born in the, so how many boys were born in the first half of the year, how many girlsů Girls are born in the first half. So to check whether it is the first half the date has to be, so if it is 22 July is it first half or it is not first half? No. It is not first half. So anything should be up to June 30th. So which means that it should be So we should first check the date. It should be more than 1st January or equal to, more than or equal to 1st January and
- 07:30 - 08:00 less than or equal to 30th of June. That is right. So that is what it means. So let us check this this is, so this is 22 July, so does not meet the criteria. 4th March Yes, and it is It meets the criteria because it is greater than its between 1st of January and 30th of June, so it is in that space And it is a male. And it is a male. So I will increment the male count by 1. The Next one is 17th of September, so again in the second half.
- 08:00 - 08:30 30th of August 2nd half. 2nd half because it is greater than After June. After 30th of June. 6th May this one lies between 1st of January and 30th of June. And it is a male. We should count it, it is a male, so it should go to the male one. 13th October again is in the second half. 3rd June is before 30th of June so should be in the first half male. 4th Jan, again male, looks like there are more males. 14 December No. So no, 30th December no, 7th November no, 30th April male.
- 08:30 - 09:00 So that is now 5. 26 December male, 13th May, so first time me got a female, 13th May first half. 17th July No, too late. Again no, 9th October No. 10th September. No. 12th Jan. Yes. Female, 16th May again, female. Yes. 8th February female, 14th Jan female, 5th May female, wow, 17th November, so this does
- 09:00 - 09:30 not count, 15th March So I am running out of space I am moving to the left, so I will say 7 now. 22nd September. No. 23rd July. No. 23rd March. Yes, so male now goes to 6. 15th March, again male. Male goes 7. 28th February male. Okay, 8. 6th December does not count. So what do we get? So of the 30 students, 8 males were born in the first half, 7 females.
- 09:30 - 10:00 So, roughly half the students, actually exactly half 15ů 15 of them were born in the first halfů This is not bad. Which is surprising and off that since we have an odd number of 15 is roughly equal, 8 males and 7 females. So this is a very equally Looks like a balanced, looks like a very balanced data. So this is interesting, so now so basically so you can filter on multiple things at the same time and then when you are filtering on multiple things you can also keep track
- 10:00 - 10:30 of them separately. So we have two conditions, when they were born and whether they are male or female and based on the combination we have one count for males born in the first half, one count for females born in first half. And we have used the same pattern, we have used the iterator pattern, initial values of so we kept two variables, male female born first half variable count, that is a count so we have two variables that we are keeping track and going through the iteration each time filtering on the gender as well as the date of birth and checking whether the date
- 10:30 - 11:00 of birth field falls between two values, 1st of January and 30th of June, inclusive and if it falls then we add to the male count or the female count, that is what we have done. So it is just an iteration pattern with a filter added. So filtered iteration, very good. So let us do something more interesting with this set, so here we have the maths marks highlighted and earlier we computed the average marks for the whole class but supposing I
- 11:00 - 11:30 wanted to find out as a teacher who is better in general in the class, are the girls doing better in maths or the boys are doing better in maths. So what would be a good way to look at it, should we look at the highest marks maybe the which I am not very sure about that, highest marks because there could be one exceptional boy let us say, but you know the girls could general be doing better than the boys, but only one exceptional boy may be there and just by looking at the maximum you will biasing it by looking
- 11:30 - 12:00 at this exceptional boy candidate. So I do not think we should look at the maximum. So then, if you want to look at the general trend I guess average is as good as Average is, average should work I think, if you find the average of the boys and the average of the girls and compare it, if the average of the girls is higher, which I think is likely than the boys, then the girls are doing better, this is what I would think is right way of doing that. So, we saw how to compute the average, we have to find out how many cards there are
- 12:00 - 12:30 and then we have to add up the marks across those cards and divide by the number of cards but now we have to separate the cards, we have to filter the cards into those for the boys and those for the girls. So we are doing total, we are trying to find the total average total marks for boys. For maths total. Maths total, we are checking which, whether girls are doing better than boys in maths? In maths, yes. So we can find the average maths marks for girls, average maths marks for boys and compare
- 12:30 - 13:00 the two, that is what we want to do. So simple way would be to simply separate the boys from the girls, and then find the average for each set like we did last time, last time we did the average for the whole set So we filter it out into two sets boys and girls Boys and girlsů= Then we process the boys separately Boys separately find the average, girls separately find the average and compare the average. So why do we need to separate them, why cannot we just do it as we are going along, so surely
- 13:00 - 13:30 like we said we did it earlier, we can keep track of all these things in one single iteration, so we could keep track of the We need to keep track of to keep to find average of the boys, what we need we need the total mathematics marks of only the boys and we need to find the total number of boys and then if you divide the total mathematics marks of the boys by the total number of boys that gives you the average for the boys.
- 13:30 - 14:00 Similarly, for the girls you have find the total mathematics marks of the girls and find the total number of girls and then divide the total mathematics marks of girls by the total number of girls that will give the average for the girls. So we have to keep track So we see it looks like we keep track of 4 things. So let me just write it down, so we need say let me just break up this space into four things, so we need say the boy count We need the boy count. And of course we need the girl count. So this will tell us how many of each there are and then we will keep track of the boys
- 14:00 - 14:30 sum and the girls sum. And we are going to do all of it in one iteration, as we go through the cards As we go through in one iteration. So here there is some filtering going on because we are looking at whether it is a boy or a girl and then doing the so these few things And then this, and then use accumulation because Accumulation because we are adding the mathematics marks and then we are counting, so we are counting, accumulating, filtering all together in one shot, so let us try it. So here is the first card, this is a boy
- 14:30 - 15:00 So boy is now count is 1. And mathematics is 72. So the sum is now 72. Second one is a girl. Girl count is 1. Girl count is 1 and 74 is the maths marks. Third one is again a boy, count Count is 2. Maths is 81. So 72 plus 81 is 153. The next card is a male again boy 3.
- 15:00 - 15:30 74 is the maths marks. So that is 227. The next is girl and 62 is the maths marks. So it is 136. So here is a fantastic maths mark by a girl who is getting 97. So that is 233. So here is a boy who is not doing that well, maths mark 44. So that is 271. We can do like this we can go through the whole set I guess.
- 15:30 - 16:00 So at the end we would have got some total. We have got a total, so we have gone through the whole list now, what do we get? So I think the whole list there were 17 boys and 13 girls and the sum of the boys was 1220 and the sum of the girls was 951. So now what we wanted to do was find the averageů Average. So we need to divide, so if we actually divide the left hand side you get some 71.76 and
- 16:00 - 16:30 this is some 73.15, so 1220 divided by 17, 71.76, 951 divided by 13 is 73.15. So it looks like the girls are doing marginally better than the, not that much 73.15 is just a little bit more than 71.76, so the girls are doing better but one, could not because it is small number 13 and so on you cannot really say much, but slightly marginally better
- 16:30 - 17:00 than the boys. But what is interesting is that we were able to do this now in one scan, so we were able to iterate this set of cards once and keep track of these four outputs Four variables. As we were going along, so we were kind of doing a filtering on the gender in this case and we were accumulating both the number of cards of that gender and the maths marks total of that gender and finally we got all the totals and we were able to thenů
- 17:00 - 17:30 So would you say that I mean we had this original question, are the girls doing better than boys, which seem like very interesting question, I mean it is kind of question that we normally ask but we have been able to turn that into a procedure, an iterator with some filtering and some variables and keeping track and all that and we have been able to give an answer which is numerically Yes, that we can justify. Justify, so would you say that this is an algorithm that we have been able to take this problem, are girls doing better than boys and we have found an algorithm for it, would
- 17:30 - 18:00 you describe it like that? Yeah, that is a good way of thinking about it, so we have taken a question which is say subjective which you can debate about and we have given a systematic procedure on using the data available and a way to calculate and either confirm or to deny the hypothesis. So if we assume that girls are doing better than boys, then this is one way to validate whether this is true or not true, so that is true so this is very interesting that you
- 18:00 - 18:30 can actually take a question which looks somewhat vague and make it precise by giving a criterion for that and giving an algorithm to compute that. Okay. Very good.