Lecture 23: MapReduce-Tutorial

Estimated read time: 1:20

Learn to use AI like a Pro

Get the latest AI workflows to boost your productivity and business performance, delivered weekly by expert consultants. Enjoy step-by-step guides, weekly Q&A sessions, and full access to our AI workflow archive.

Summary

In Lecture 23 of the Cloud Computing series, the focus is on a MapReduce tutorial. The session begins with a brief recap of the MapReduce programming model, originally developed by Google for processing large-scale search tasks. The lecture delves into how problems can be decomposed into MapReduce tasks, emphasizing paralleled computing using numerous processors. Key topics discussed include the fault-tolerant nature of the system, the deployment of MapReduce in various implementations like Hadoop, and practical examples to understand the decomposition of problems. The lecture also provides examples, including an average calculation problem and a salary processing example using MapReduce, demonstrating the practical application of this framework in handling large datasets efficiently.

Highlights

Introduction to solving problems using MapReduce. 🔍
Recapping the basics and historical context of MapReduce. 📜
Understanding the importance of parallel processing in dealing with large datasets. 📊
Detailed explanation of the map and reduce phases in the programming model. 🧩
Examples showing MapReduce application in averaging and salary computations. 💼

Key Takeaways

MapReduce simplifies processing huge datasets by parallel computing. 🗂️
Developed by Google, later adopted widely with open-source versions like Hadoop. 🤖
Fault tolerance in MapReduce ensures computations continue even with failures. 🔄
MapReduce divides tasks into map and reduce phases, enhancing parallel efficiency. ⚡
Practical code examples help understand MapReduce better, even in simple problems. 💡

Overview

The lecture starts by explaining how MapReduce is employed to handle problems involving vast amounts of data. The model, developed initially by Google, addresses issues of large-scale data processing effectively. It leverages parallel computing capabilities to provide scalable and efficient solutions, often dealing with thousands of processors to process distributed file systems like GFS and Bigtable.

Key features of MapReduce, such as its fault-tolerant design, are highlighted, allowing computations to progress even if some network or processor fails. The involvement of open-source implementations like Hadoop is discussed, which underlines the accessibility and adaptability of the MapReduce framework across various platforms, including cloud services like Amazon EC2.

Practical examples are given throughout the session to show how MapReduce can be applied to common problems. Examples include calculating the average from a dataset and analyzing staff salaries by gender. These case studies aim to demonstrate how task decomposition into map and reduce phases makes handling and processing large data more efficient and how MapReduce can transform simple tasks into scalable solutions.

Chapters

00:00 - 03:00: Introduction to MapReduce The chapter introduces the concept of MapReduce. It begins with a tutorial and indicates that previous discussions on MapReduce have already taken place. The focus of this chapter is to work through one or two practical problems or applications of MapReduce.
03:00 - 10:00: MapReduce Framework and Its Components This chapter provides an introduction to the MapReduce framework, which is designed to process large volumes of data efficiently. It covers the decomposition of problems into a MapReduce format and discusses the parallel processing capabilities that make it suitable for handling big data workflows. Originating from Google, MapReduce is now utilized in a variety of large-scale data processing applications.
10:00 - 15:00: Fault Tolerance and Parallelism The chapter begins with a quick recap before tackling one or two problems related to the MapReduce framework. It explains that this model, a programming construct developed by Google, was primarily designed to implement large-scale search tasks. These tasks involve processing massively scalable web data efficiently, which is stored using Bigtable and Google's File System (GFS) distributed storage.
15:00 - 23:00: Example: Word Count using MapReduce The chapter discusses the design objectives of Google in processing and generating large volumes of data. It introduces the concept of massively parallel computation using tens of thousands of processors. The chapter explores the inherent parallelism in data processing and how it can be effectively exploited.
23:00 - 29:00: Hadoop File System and Storage The chapter discusses the design of the Hadoop File System, emphasizing its fault-tolerant nature. It explains that the Hadoop system is engineered to continue processing and computation even if there is a failure in the processor or network. The chapter underlines that maintaining operation despite such failures is not just essential but a fundamental precondition for the system's design.
29:00 - 52:00: Example: Average Calculation using MapReduce The chapter titled 'Average Calculation using MapReduce' discusses the open-source implementation of MapReduce through Hadoop, developed by Yahoo. It is available on Amazon EC2 as pre-packaged AMIs. The chapter elaborates on the history of MapReduce as a parallel programming abstraction that is widely used in various applications for large-scale computations, utilizing thousands of processors.
52:00 - 63:00: Example: Salary Calculation using MapReduce The chapter titled 'Example: Salary Calculation using MapReduce' explains the fault-tolerant implementation of systems both on the data, processor, and network sides. This fault tolerance is illustrated in two main phases: the map phase and the reduce phase. The map phase involves dividing a problem into intermediate results, while the reduce phase focuses on reducing these intermediate results into the actual, final results.
63:00 - 65:30: Conclusion and Wrap-up The chapter discusses the concept of parallel efficiency in MapReduce, emphasizing its substantial benefits when handling large volumes of data. It highlights the inherent parallelism in processes, explaining the roles of mapper and reducer processors in assigning tasks based on specific problems.

Lecture 23: MapReduce-Tutorial Transcription

00:00 - 00:30 hi so ah today we will ah discuss some tutorial on mapreduce already we have ah discussed on mapreduce so today we will try to solve one or two ah problems or rather try to see
00:30 - 01:00 how one or two ah ah how we can decompose a problem into a mapreduce ah problem and how to work on it so if you remember that ah the a mapreduce ah paradigm is ah is ah used for you for for using huge volume of data where paralysation is possible and primarily ah um ah rather developed by google and later on used in various
01:00 - 01:30 fields so what we will do we will initially couple of slides we will have a quick recap before we take up one or two problems related to this mapreduce framework ok so as we discussed already so its a programming model developed at google all right ah implement large scale search primarily the basic objective was to implement large scale search takes processing on massively scalable web data ah using ah data stored using big table and gfs distributed
01:30 - 02:00 file system google file system and big data big table so this was a objective of google which it started with so design for processing and generating large volume of data by massively parallel computation utilizes utilizing tens and thousands of processor at a time right so there are a ah huge number of a processors to the tune of tens and thousands so whether if there is an inherent parallelism inside the things whether i can [explo/exploit] exploit it into the ah thing so the it is ah ah ah
02:00 - 02:30 it is um designed to be fault tolerant that means ensure progress of the computation even if the processor or network fails so it should be fault tolerant to the extent that even there is a failure in the ah processors or the network it to still the working should go on so that was that was the basic ah assumption or basic oh i say precondition of doing this or that was basically these what they was
02:30 - 03:00 taken up right so ah there are ah several thing like hadoop ah open source implementation of mapreduce ah incidentally it was developed by yahoo available on pre packaged amis on amazon ec two and so and so forth so if we look at ah apart from its ah history so its a parallel programming abstraction right ah used by many different parallel applications which carry out large scale computation involving thousands of processors it is again as we
03:00 - 03:30 have doing the underlining fault tolerant ah implementation both on the data site and on processor and network site right so everything is fault tolerant divided into two phases one is a map phase so mapping the so the given a problem i divided into two phases so it is mapped into a intermediate result and reduced to a ah again the reduce function or reduce phase reduce those ah intermediate result to the actual results right
03:30 - 04:00 so in doing doing so what we have seen in our earlier lecture or earlier discuss an on mapreduce that we can have parallel efficiency ah in in we can achieve substantial parallel efficiency when we are dealing with a large volume of data and there is inherent parallelism ah into the process right so what we look at there are m number of mapper processor and uh number of reducer processors which are assign work uh ah based on the problem
04:00 - 04:30 right so there is a master ah controller m mapper processor and say ah un number of reducer cell processors which work on that right and in some cases this mapper and reducer ah processor can see at the same physical infrastructure right so that means ah sometimes we acting as a mapper and at a later stage acting as a reducer type of things so its a its ah its a our ah um ah capability of the ah developer
04:30 - 05:00 ah or its ah that how you devise these mapping and map and reduce functions right implementation also based on that whatever that ah language the developer working on maybe there are a lot of ah uh ah means speed of means people are working on python it can be on c plus plus another type of coding things so that that coding part is based on that what sort of problem and what is the environment you are working on
05:00 - 05:30 but primarily a philosophy there are m number of mapper and a set of reducer ah you have intermediate results into the thing right so as as we discussed early ah that e map is each mapper reads one m th of the input from the global file system right using loca using locations given by the master so master controller the controller of the master node says that these are the chunks you need to read right so map function consists of a transformation from one key value pair to another key value
05:30 - 06:00 pair like here i have a k one v one map to k two v two right each mapper writes computation results of one file per reducer so its its prepare typically if there are ah r reducers so it prepares the result for one file per reducer if there is one reducer so one file that it creates a file for the user files are stored in a key and stored in a local by a key sorted by a key and stored in a local file system right so that is a local file
06:00 - 06:30 systems where the output of the mapper are stored the master keeps track of the location of this file the master has the tracking of the things on the reducer phase or the reduce phase the master informs a reducer where the partial computation because the mapper has d done a partial computation of the whole process have been stored on the local file system ah for respective mappers so for if there are ah m mappers for the respective mappers where the local where the files are stored for that particular reducer reducer
06:30 - 07:00 make remote procedure call request the mapper to face the files each reducer groups the ah groups the results of the maps tape using the same key and performs a function some function f and list of values corresponding to this value that means if i have as you as k two v two then i map it to some k two and function of of that v two so if the function may be ah as simple as averaging so um maybe ah frequency or the
07:00 - 07:30 count or some complex function of doing some more operations right so results are written back to the google file system so the results are written back to the google file system so the google file system takes care of them so map reduce example so there are three mappers two reducers map function ah is in this case as as we ah as if you remember or if you look at the previous ah our previous lecture and discussing so there was a ah there is a ah
07:30 - 08:00 huge volume of watt and what we want to do a watt count right so the every mapper has a has a chunk of the ah data things like this mapper has d one d two d three and this is d four seven d eight etcetera so every mapper does a partial ah count of the word like and ah for w one w two w three and so and so forth and there are two reducers
08:00 - 08:30 so it creates file for both the reducer and so the reducer one is responsible for w one and w two whether the users two is for w three and w four and uh we do a word count on the ah thing right so there is a mapping function where this is done and there is a reducing function where it is basically the function is for summation of this count for every one w one so that is dividing this what count problem into a mapreduce problem and last
08:30 - 09:00 ah uh talk or last lecture we have shown that this can give parallel efficiency um in the system right so ah we now look at a couple of ah problems so this is not exactly mapreduce problem just to ah go for hadool fi hadoop file system or gfs google file system so if the block size is sixty four mb if you remember these file systems are larger chunk block size than never natural file system and another thing
09:00 - 09:30 was that there is a three replica of the of every ah instance of the data so there is a three replica where which allows you to have a ah fault tolerant mode so based on that there are read write operations under so in this ah particular thing if we if we say if there are ah if the hdfs block size ah is sixty four mb then we want to find out
09:30 - 10:00 if there are three files of sixty four k sixty five mb kb mb and one twenty seven mb so how many ah blocks will be created by the hdfs framework right so if for the sixty four
10:00 - 10:30 kb how many you know one will be creating right and sixty five mb we have two because sixty up to sixty four mb one and one twenty seven mb also two so total five but in reality as there are replicas like you have ah replicas so there is typically three replica so effective
10:30 - 11:00 block size will be five into three equal to you see these blocks right so is very straightforward nothing nothing ah uh complexi[ty]- no complexity in it right so if i have different type of thing so we can calculate this ah straightforward right so again nothing to do with immediately nothing to do with mapreduce but nevertheless it is
11:00 - 11:30 the the data is stored in either hdfs or if it is a open source or in a gfs if it is google file system and it it need to be ah this data size or the storage need to be budgeted when you are working with large data set that how much stored you [nee/require] require ah how much storage you require to work on this type of ah data sets right now ah let us ah see one ah problem on very again very straightforward problem on ah mapreduce
11:30 - 12:00 framework so whether ah we can have this mapreduce framework though again ah we may not very much appreciate directly because of this ah simplicity of the problem but to understand the mapreduce ah framework it good so want to write the pseudo codes in or codes in any any ah language that where a there are ah what i uh uh what we want to do calculate
12:00 - 12:30 the average of a set of integers in mapreduce so a set of integers being ah pumped into the system it may be a direct input from the ah ah keyboard or something and ah so we want to find out the average of the set of integers in other sense in this typical case i have set of integer a as a ten twenty thirty forty fifty right so set of integers
12:30 - 13:00 ah are there so i want to make a average so in other sense we want to basically sum it up and ah i divide by uh the cardinality so totally divided by five so in this case what we do we the master node ah say we consider there are three mapper right so uh there are mapper nodes we considered as three numbers and a reducer say one number right so the
13:00 - 13:30 what da do in a master node what it does for this mapper it divides into say m one m two m three right three mapper node so a portion of this data right say it gives ten twenty to the first one thirty forty and fifty so each mapper does a partial ah counting of
13:30 - 14:00 the things right it does a averaging of these ah two things right so it is something which comes up as ah i can say average and count so first one is fifteen two then this is thirty five cardinalities to and this is fifty one in other sense it in the temporary local file
14:00 - 14:30 system is [sto/store] store fifteen two thirty five two what basically the ah output of the mapper right it does by the um ah combiner so this is the map functions wants to ah achieve correct on the reduce reduce is primarily is there
14:30 - 15:00 is a one reducer it takes all the things and ah uh it does a ah averaging of the things or more ah this averaging of the whole thing so in other sense what it does say fifteen star two here thirty five star two and fifty star one in other sense is sigma sum is one fifty sigma count is five so it says one fifty by five or thirty what ah it is basically
15:00 - 15:30 fifteen star two plus the result thats two ah right so that the exactly does so that
15:30 - 16:00 it it ah as this the problem is pretty straightforward or simple you may not ah find ah that what is the big deal here you know if the if the number of things is pretty high coming as a stream and then i can basically ah do a parallel things right these these are parallely processed and this is reduced by the one particular reducer right so if we write the ah code for it so you can use any language to do that here ah um ah
16:00 - 16:30 here it a we are using ah say python or python type language it is the language is doesnt matter the representation you can do use any pseudo code and type of things so ah so you have that mapper function or same mapper dot pi so it is something python type ah so we are not much giving importance to the syntax rather we are more giving ah importance to
16:30 - 17:00 the ah concept so def map so l is the least so what we do we initialize a sum equal to zero right for i in range zero to length of l ok sum equal
17:00 - 17:30 to sum plus so every mapper does this right every mapper what it does it takes that chunk of data which is being allocated to it for by the master node and then for the in our cases how many data is there the mapper one has two data mapper two has two data mapper
17:30 - 18:00 three m three has one data so it is ah two two each so for every mapper we done this so we ah that average sum by length of l in this case two it should be ah like this sum
18:00 - 18:30 by length of two ok or another more more strictly speaking we should have ah to have a floating point difference ah floating point ah divide because this is a otherwise there may be integer divisions so we can ideally give some star one point zero divided by length of length of l so length of l in this case two right and then we output this ah let us use print
18:30 - 19:00 function output this to the local file system right so there can be this command why is it may be different if you are using different ah programming paradigm lengths so the map are basically emit these data so it is stored in the local file system so what we are doing we are for every mapper we are reading a list of data which is being assigned by its master node making that some initializing the sum ah to value then we add
19:00 - 19:30 what we are doing for i for in the loop so calculating the sum making a average out of it rather this is this is nothing but to make it float ah uh division and then ah it is emitting or ah dumping that value into the local file systems which the reducer will
19:30 - 20:00 ah read it right so this is the ah mapper portion of the thing so if we look at the ah reducer portion what we have def reduce so what it reads so
20:00 - 20:30 whatever the mapper has dumped in the things so if you look at it is giving these ah average value and the lengths of the thing right here also we read that ah particular thing sum equal to
20:30 - 21:00 so here for i in range of zero length of l right so it is what it is doing ah if you
21:00 - 21:30 if we look at our ah previous ah thing so what it is doing it is basically trying to calculate ah this sort of values right so count also equal to count plus length of i
21:30 - 22:00 so finally what we get ah average again sum to make it float yes multiply this or you can basically typecast also count right and then print average it so what it is reducer
22:00 - 22:30 is doing taking this local file system as there are only one reducer so it takes all the values and for all ah that data it goes on summing up the ah um that ah output from the ah each mapper in this case there are three so it is it is coming to be fifteen into two plus thirty five into two plus fifty into one divided by count which is is ah here
22:30 - 23:00 two plus two plus one is five so and it calculates the average value and then it ah again writes where writes the average value to the ah google file system or hadoop file system based on the whatever the requirement is this right so here this is that again though ah this is may be a ah straightforward simple thing but we see that i can divide a problem as there are inherent parallelism like there are i cou[ld]- i could have like in order
23:00 - 23:30 to do a averaging i have do ah i have taken a chunk of data and that we try to solve it using in mapreduce framework so it is if there is a huge volume of data then the mapper ah that the master node divides accordingly and do the partial computation and the reducer read it from and do the final computation ok so this is this is again ah a ah simple example of a mapreduce ah framework so next ah we see another problem right so
23:30 - 24:00 what it says ah i want to compute the total an average salary of organization x y z some organization grouped by the gender right an[d]- using mapreduce so input is name ah gender and ah the salary of the thing right in this case say name is john gender is m or male
24:00 - 24:30 and salary is something ten thousand unit maybe ten thousand dollar or something and the next one is mo ah martha fa uh gender is f and salaries something fifteen thousand right so what we want to do we want to find out that male voice like ah gender voice in this case male and female that what is the ah your total and average salary ah anyway that total divided by the cardinality will be the average
24:30 - 25:00 salary of the things so the output will be like in that ah form so what we try to look at whether we can ah employ mapreduce ah problem to look into the this particular ah problem ok so let us look at it so what we are having we are having ah this ah tuple like this name
25:00 - 25:30 gender and salary right so this is the tuple so if i ah so what do you want to do ah in the map phase so want to if the input data whatever the input whatever the input data say it is there there are different set of the uh ah input data set so what we want to do we want to extract only because we are not bothered about the gender m bothered about the name of the person or that is not required in the ah so
25:30 - 26:00 uh in the salary so we want to calculate from there these two thing yeah say m the salary so irrespective and right so they want to do type of ah this type of things or i can
26:00 - 26:30 i can have ah a key value pair right key is this this male or female and value is the salary of the things or we can say that we have ah to sort of a um ah ah dictionary structure which is having a key value pair and then having say i say this is ah dict one dict two and these two type of a key value pair and ah for everyone i can have ah dict one
26:30 - 27:00 then maybe total an average for other one also dict two id maybe so so id in this case is this particular male or female and then having total or average salary right so compute total average for the two separate ah in uh say we consider there is a dictionary structure
27:00 - 27:30 and the reducer basically does this thing right so want to see that how to uh realize this so what ah let us ah look at the problem so i have a set of name gender and ah salary want to extract that ah uh gender and the salary from every topple so if i have multiple ah um mapper so i i extract those things and dump as a ah particular ah two type of dictionary
27:30 - 28:00 type one is that two type of ah thing one is that what we have this ah with m and f and the at the reducer part we calculate the ah total and average salary of the things or any of the thing so again ah we look at as a ah mapper dot py or mapper dot some python
28:00 - 28:30 type code so again i am ah just ah um want to again repeat it that ah you can do with any coding language which is ah suitable for this and whatever we are doing may not be there may be some syntactical problem with the actual python thing but it doesnt matter the conceptually you want to show that things works and then actual syntactical syntax need
28:30 - 29:00 to be followed if you are really want to implement this right so so for if this is the thing for a line in syz dots std in what we are doing
29:00 - 29:30 so what we are considering that it is ah it is separated by a a by a comma so we are split is sorry it should be comma right so i now separate name equal to nine
29:30 - 30:00 zero salary a business sa l is line two right while ah generating the or emitting the mapper
30:00 - 30:30 phase the data into the local file system so we keep print comma percentage d then so
30:30 - 31:00 what we do gender and salary in other sense this uh syntax you need to check up in other sense what we do we dump basically gender in the salary into the thing or ah m and the ah and the salary portion like ah as you see we want to generate this m and the salary
31:00 - 31:30 portion right and another thing is that either m or f and the salary portion right so in the reducer phase in the reducer phase what we do that import
31:30 - 32:00 so define that or call this dit org that is dictionary class for line in sys in what we
32:00 - 32:30 do so what what it is reading it is reading basically that the gender or that key value fear with the gender and the ah value of the things or in other sense the gender and the salary values so i we we dont have that name into the things because the que[ry]- this particular query does not require the name of the thing right line of one
32:30 - 33:00 so if this if so if it is already existing that means once
33:00 - 33:30 you have read then d ct org so what is our objective to basically sum up the salaries
33:30 - 34:00 ah by adding go on adding on the ah salary values for the same gender type so already if it is already if there uh that means ah so what that ah now my existing the reducer dictionary counting a key value pair so the if the key is already that gender is there or male and female male or female then i go on adding those things whenever i get if it is not there if it is else that means ah this is basically the initialization thing dctorg
34:00 - 34:30 so initialize with a blank thing so in first time when it is coming
34:30 - 35:00 so in first time it is coming that means there is a blank thing so if it is blank then it it basically initialize then append the salary that means it is initialized with the ah salary of the dict org that keys salary average equal to some of dict org gender divided by length
35:00 - 35:30 of dictorg gender so it is summing up divided by things straightforward right and total
35:30 - 36:00 salary equal to only some of dict org gender right and then we basically write back the ah thing from to the google or gfs or the ah hdfs file system ok if we want to separate
36:00 - 36:30 it by a comma or tab as the case may be again maybe d if it is a integer or based on that if it is a float and all those things so we have this as gender total sal and salary
36:30 - 37:00 avg right so ah so that is the ah what we do at the ah at the final ah reduce surface so if we ah try to just ah quickly ah have a look so what we are doing in the mapping function we have ah three um thing like name gender and salary our objective is to the
37:00 - 37:30 mapping functions so see this ah all the map are we like ah find out this individually this whether it is a which gender m and find keep this salaries along with that g m or f and salary and the reducer ah we will basically so that it exactly that gender and salary and the reducer will basically ah extract that intermediate result and calculate the
37:30 - 38:00 average and the total right so here that that operation is there right so this is a typical python type i am not ah strictly telling python because there may be some um syntactical issue but you can implement in anything right the idea is that i divide the problem into smaller parallel things by the mapper and ah then in the second phase
38:00 - 38:30 the reducer put it to another key value pair so key value from the input set to a set of key value pair reducer takes that key value pair and put a function in this case average or total sal[ary]- total of the things to another set of key value pair and the finally it goes to the ah ah hdfs pair or gfs file system ok so what we try to look at ah in todays ah thing that ah is there is mapreduce functions
38:30 - 39:00 say simple problems how we can put it into map and reduce things that ah this number of my parts available allow is uh basically availability of the resource and the how the master nodes ah uh divide it and the re ah number of reducers also based on the term what type of functional things you want to do and ah so they master node is there it divides into m number of mapper and a number of reducer the problem is div the functionality of the ah problem is divided into uh in such a way so that it divi[de]- it can execute
39:00 - 39:30 it it can be executed in two phases and we can have a parallel implementation of this sort of paradigm ok thank you