Get the latest AI workflows to boost your productivity and business performance, delivered weekly by expert consultants. Enjoy step-by-step guides, weekly Q&A sessions, and full access to our AI workflow archive.
Summary
In Lecture 23 of the Cloud Computing series, the focus is on a MapReduce tutorial. The session begins with a brief recap of the MapReduce programming model, originally developed by Google for processing large-scale search tasks. The lecture delves into how problems can be decomposed into MapReduce tasks, emphasizing paralleled computing using numerous processors. Key topics discussed include the fault-tolerant nature of the system, the deployment of MapReduce in various implementations like Hadoop, and practical examples to understand the decomposition of problems. The lecture also provides examples, including an average calculation problem and a salary processing example using MapReduce, demonstrating the practical application of this framework in handling large datasets efficiently.
Highlights
Introduction to solving problems using MapReduce. 🔍
Recapping the basics and historical context of MapReduce. 📜
Understanding the importance of parallel processing in dealing with large datasets. 📊
Detailed explanation of the map and reduce phases in the programming model. 🧩
Examples showing MapReduce application in averaging and salary computations. 💼
Key Takeaways
MapReduce simplifies processing huge datasets by parallel computing. 🗂️
Developed by Google, later adopted widely with open-source versions like Hadoop. 🤖
Fault tolerance in MapReduce ensures computations continue even with failures. 🔄
MapReduce divides tasks into map and reduce phases, enhancing parallel efficiency. ⚡
Practical code examples help understand MapReduce better, even in simple problems. 💡
Overview
The lecture starts by explaining how MapReduce is employed to handle problems involving vast amounts of data. The model, developed initially by Google, addresses issues of large-scale data processing effectively. It leverages parallel computing capabilities to provide scalable and efficient solutions, often dealing with thousands of processors to process distributed file systems like GFS and Bigtable.
Key features of MapReduce, such as its fault-tolerant design, are highlighted, allowing computations to progress even if some network or processor fails. The involvement of open-source implementations like Hadoop is discussed, which underlines the accessibility and adaptability of the MapReduce framework across various platforms, including cloud services like Amazon EC2.
Practical examples are given throughout the session to show how MapReduce can be applied to common problems. Examples include calculating the average from a dataset and analyzing staff salaries by gender. These case studies aim to demonstrate how task decomposition into map and reduce phases makes handling and processing large data more efficient and how MapReduce can transform simple tasks into scalable solutions.
Chapters
00:00 - 03:00: Introduction to MapReduce The chapter introduces the concept of MapReduce. It begins with a tutorial and indicates that previous discussions on MapReduce have already taken place. The focus of this chapter is to work through one or two practical problems or applications of MapReduce.
03:00 - 10:00: MapReduce Framework and Its Components This chapter provides an introduction to the MapReduce framework, which is designed to process large volumes of data efficiently. It covers the decomposition of problems into a MapReduce format and discusses the parallel processing capabilities that make it suitable for handling big data workflows. Originating from Google, MapReduce is now utilized in a variety of large-scale data processing applications.
10:00 - 15:00: Fault Tolerance and Parallelism The chapter begins with a quick recap before tackling one or two problems related to the MapReduce framework. It explains that this model, a programming construct developed by Google, was primarily designed to implement large-scale search tasks. These tasks involve processing massively scalable web data efficiently, which is stored using Bigtable and Google's File System (GFS) distributed storage.
15:00 - 23:00: Example: Word Count using MapReduce The chapter discusses the design objectives of Google in processing and generating large volumes of data. It introduces the concept of massively parallel computation using tens of thousands of processors. The chapter explores the inherent parallelism in data processing and how it can be effectively exploited.
23:00 - 29:00: Hadoop File System and Storage The chapter discusses the design of the Hadoop File System, emphasizing its fault-tolerant nature. It explains that the Hadoop system is engineered to continue processing and computation even if there is a failure in the processor or network. The chapter underlines that maintaining operation despite such failures is not just essential but a fundamental precondition for the system's design.
29:00 - 52:00: Example: Average Calculation using MapReduce The chapter titled 'Average Calculation using MapReduce' discusses the open-source implementation of MapReduce through Hadoop, developed by Yahoo. It is available on Amazon EC2 as pre-packaged AMIs. The chapter elaborates on the history of MapReduce as a parallel programming abstraction that is widely used in various applications for large-scale computations, utilizing thousands of processors.
52:00 - 63:00: Example: Salary Calculation using MapReduce The chapter titled 'Example: Salary Calculation using MapReduce' explains the fault-tolerant implementation of systems both on the data, processor, and network sides. This fault tolerance is illustrated in two main phases: the map phase and the reduce phase. The map phase involves dividing a problem into intermediate results, while the reduce phase focuses on reducing these intermediate results into the actual, final results.
63:00 - 65:30: Conclusion and Wrap-up The chapter discusses the concept of parallel efficiency in MapReduce, emphasizing its substantial benefits when handling large volumes of data. It highlights the inherent parallelism in processes, explaining the roles of mapper and reducer processors in assigning tasks based on specific problems.
Lecture 23: MapReduce-Tutorial Transcription
00:00 - 00:30 hi so ah today we will ah discuss some tutorial
on mapreduce already we have ah discussed on mapreduce so today we will try to solve
one or two ah problems or rather try to see
00:30 - 01:00 how one or two ah ah how we can decompose
a problem into a mapreduce ah problem and how to work on it
so if you remember that ah the a mapreduce ah paradigm is ah is ah used for you for for
using huge volume of data where paralysation is possible and primarily ah um ah rather
developed by google and later on used in various
01:00 - 01:30 fields so what we will do we will initially
couple of slides we will have a quick recap before we take up one or two problems related
to this mapreduce framework ok so as we discussed already so its a programming model developed
at google all right ah implement large scale search primarily the basic objective was to
implement large scale search takes processing on massively scalable web data ah using ah
data stored using big table and gfs distributed
01:30 - 02:00 file system google file system and big data
big table so this was a objective of google which it
started with so design for processing and generating large volume of data by massively
parallel computation utilizes utilizing tens and thousands of processor at a time right
so there are a ah huge number of a processors to the tune of tens and thousands so whether
if there is an inherent parallelism inside the things whether i can [explo/exploit] exploit
it into the ah thing so the it is ah ah ah
02:00 - 02:30 it is um designed to be fault tolerant that
means ensure progress of the computation even if the processor or network fails so it should
be fault tolerant to the extent that even there is a failure in the ah processors or
the network it to still the working should go on
so that was that was the basic ah assumption or basic oh i say precondition of doing this
or that was basically these what they was
02:30 - 03:00 taken up right so ah there are ah several
thing like hadoop ah open source implementation of mapreduce ah incidentally it was developed
by yahoo available on pre packaged amis on amazon ec two and so and so forth
so if we look at ah apart from its ah history so its a parallel programming abstraction
right ah used by many different parallel applications which carry out large scale computation involving
thousands of processors it is again as we
03:00 - 03:30 have doing the underlining fault tolerant
ah implementation both on the data site and on processor and network site right so everything
is fault tolerant divided into two phases one is a map phase so mapping the so the given
a problem i divided into two phases so it is mapped into a intermediate result and reduced
to a ah again the reduce function or reduce phase reduce those ah intermediate result
to the actual results right
03:30 - 04:00 so in doing doing so what we have seen in
our earlier lecture or earlier discuss an on mapreduce that we can have parallel efficiency
ah in in we can achieve substantial parallel efficiency when we are dealing with a large
volume of data and there is inherent parallelism ah into the process right so what we look
at there are m number of mapper processor and uh number of reducer processors which
are assign work uh ah based on the problem
04:00 - 04:30 right
so there is a master ah controller m mapper processor and say ah un number of reducer
cell processors which work on that right and in some cases this mapper and reducer ah processor
can see at the same physical infrastructure right so that means ah sometimes we acting
as a mapper and at a later stage acting as a reducer type of things so its a its ah its
a our ah um ah capability of the ah developer
04:30 - 05:00 ah or its ah that how you devise these mapping
and map and reduce functions right implementation also based on that whatever that ah language
the developer working on maybe there are a lot of ah uh ah means speed of means people
are working on python it can be on c plus plus another type of coding things so that
that coding part is based on that what sort of problem and what is the environment you
are working on
05:00 - 05:30 but primarily a philosophy there are m number
of mapper and a set of reducer ah you have intermediate results into the thing right
so as as we discussed early ah that e map is each mapper reads one m th of the input
from the global file system right using loca using locations given by the master so master
controller the controller of the master node says that these are the chunks you need to
read right so map function consists of a transformation
from one key value pair to another key value
05:30 - 06:00 pair like here i have a k one v one map to
k two v two right each mapper writes computation results of one file per reducer so its its
prepare typically if there are ah r reducers so it prepares the result for one file per
reducer if there is one reducer so one file that it creates a file for the user files
are stored in a key and stored in a local by a key sorted by a key and stored in a local
file system right so that is a local file
06:00 - 06:30 systems where the output of the mapper are
stored the master keeps track of the location of this file the master has the tracking of
the things on the reducer phase or the reduce phase the master informs a reducer where the
partial computation because the mapper has d done a partial computation of the whole
process have been stored on the local file system ah for respective mappers
so for if there are ah m mappers for the respective mappers where the local where the files are
stored for that particular reducer reducer
06:30 - 07:00 make remote procedure call request the mapper
to face the files each reducer groups the ah groups the results of the maps tape using
the same key and performs a function some function f and list of values corresponding
to this value that means if i have as you as k two v two then i map it to some k two
and function of of that v two so if the function may be ah as simple as
averaging so um maybe ah frequency or the
07:00 - 07:30 count or some complex function of doing some
more operations right so results are written back to the google file system so the results
are written back to the google file system so the google file system takes care of them
so map reduce example so there are three mappers two reducers map function ah is in this case
as as we ah as if you remember or if you look at the previous ah our previous lecture and
discussing so there was a ah there is a ah
07:30 - 08:00 huge volume of watt and what we want to do
a watt count right so the every mapper has a has a chunk of the ah data things like this
mapper has d one d two d three and this is d four seven d eight etcetera
so every mapper does a partial ah count of the word like and ah for w one w two w three
and so and so forth and there are two reducers
08:00 - 08:30 so it creates file for both the reducer and
so the reducer one is responsible for w one and w two whether the users two is for w three
and w four and uh we do a word count on the ah thing right so there is a mapping function
where this is done and there is a reducing function where it is basically the function
is for summation of this count for every one w one so that is dividing this what count
problem into a mapreduce problem and last
08:30 - 09:00 ah uh talk or last lecture we have shown that
this can give parallel efficiency um in the system right so ah we now look at a couple
of ah problems so this is not exactly mapreduce problem just
to ah go for hadool fi hadoop file system or gfs google file system so if the block
size is sixty four mb if you remember these file systems are larger chunk block size than
never natural file system and another thing
09:00 - 09:30 was that there is a three replica of the of
every ah instance of the data so there is a three replica where which allows you to
have a ah fault tolerant mode so based on that there are read write operations under
so in this ah particular thing if we if we say if there are ah if the hdfs block size
ah is sixty four mb then we want to find out
09:30 - 10:00 if there are three files of sixty four k sixty
five mb kb mb and one twenty seven mb so how many ah blocks will be created by the
hdfs framework right so if for the sixty four
10:00 - 10:30 kb how many you know one will be creating
right and sixty five mb we have two because sixty up to sixty four mb one and one twenty
seven mb also two so total five but in reality as there are replicas like you have ah replicas
so there is typically three replica so effective
10:30 - 11:00 block size will be five into three equal to
you see these blocks right so is very straightforward nothing nothing
ah uh complexi[ty]- no complexity in it right so if i have different type of thing so we
can calculate this ah straightforward right so again nothing to do with immediately nothing
to do with mapreduce but nevertheless it is
11:00 - 11:30 the the data is stored in either hdfs or if
it is a open source or in a gfs if it is google file system and it it need to be ah this data
size or the storage need to be budgeted when you are working with large data set that how
much stored you [nee/require] require ah how much storage you require to work on this type
of ah data sets right now ah let us ah see one ah problem on very
again very straightforward problem on ah mapreduce
11:30 - 12:00 framework so whether ah we can have this mapreduce
framework though again ah we may not very much appreciate directly because of this ah
simplicity of the problem but to understand the mapreduce ah framework it good so want
to write the pseudo codes in or codes in any any ah language that where a there are ah
what i uh uh what we want to do calculate
12:00 - 12:30 the average of a set of integers in mapreduce
so a set of integers being ah pumped into the system it may be a direct input from the
ah ah keyboard or something and ah so we want to find out the average of the
set of integers in other sense in this typical case i have set of integer a as a ten twenty
thirty forty fifty right so set of integers
12:30 - 13:00 ah are there so i want to make a average so
in other sense we want to basically sum it up and ah i divide by uh the cardinality so
totally divided by five so in this case what we do we the master node ah say we consider
there are three mapper right so uh there are mapper nodes we considered as three numbers
and a reducer say one number right so the
13:00 - 13:30 what da do in a master node what it does for
this mapper it divides into say m one m two m three right three mapper node so a portion
of this data right say it gives ten twenty to the first one thirty forty and fifty so
each mapper does a partial ah counting of
13:30 - 14:00 the things right it does a averaging of these
ah two things right so it is something which comes up as ah i can say average and count
so first one is fifteen two then this is thirty five cardinalities to and this is fifty one
in other sense it in the temporary local file
14:00 - 14:30 system is [sto/store] store fifteen two thirty
five two what basically the ah output of the mapper right it does by the um ah combiner
so this is the map functions wants to ah achieve correct
on the reduce reduce is primarily is there
14:30 - 15:00 is a one reducer it takes all the things and
ah uh it does a ah averaging of the things or more ah this averaging of the whole thing
so in other sense what it does say fifteen star two here thirty five star two and fifty
star one in other sense is sigma sum is one fifty sigma count is five so it says one fifty
by five or thirty what ah it is basically
15:00 - 15:30 fifteen star two plus the result thats two
ah right so that the exactly does so that
15:30 - 16:00 it it ah as this the problem is pretty straightforward
or simple you may not ah find ah that what is the big deal here you know if the if the
number of things is pretty high coming as a stream and then i can basically ah do a
parallel things right these these are parallely processed and this is reduced by the one particular
reducer right so if we write the ah code for it so you can
use any language to do that here ah um ah
16:00 - 16:30 here it a we are using ah say python or python
type language it is the language is doesnt matter the representation you can do use any
pseudo code and type of things so ah so you have that mapper function or same mapper dot
pi so it is something python type ah so we are not much giving importance to the syntax
rather we are more giving ah importance to
16:30 - 17:00 the ah concept
so def map so l is the least so what we do we initialize a sum equal to zero right for
i in range zero to length of l ok sum equal
17:00 - 17:30 to sum plus so every mapper does this right
every mapper what it does it takes that chunk of data which is being allocated to it for
by the master node and then for the in our cases how many data is there the mapper one
has two data mapper two has two data mapper
17:30 - 18:00 three m three has one data so it is ah two
two each so for every mapper we done this so we ah that average sum by length of l in
this case two it should be ah like this sum
18:00 - 18:30 by length of two ok or another more more strictly
speaking we should have ah to have a floating point difference ah floating point ah divide
because this is a otherwise there may be integer divisions so we can ideally give some star
one point zero divided by length of length of l so length of l in this case two right
and then we output this ah let us use print
18:30 - 19:00 function output this to the local file system
right so there can be this command why is it may be different if you are using different
ah programming paradigm lengths so the map are basically emit these data
so it is stored in the local file system so what we are doing we are for every mapper
we are reading a list of data which is being assigned by its master node making that some
initializing the sum ah to value then we add
19:00 - 19:30 what we are doing for i for in the loop so
calculating the sum making a average out of it rather this is this is nothing but to make
it float ah uh division and then ah it is emitting or ah dumping that value into the
local file systems which the reducer will
19:30 - 20:00 ah read it right so this is the ah mapper
portion of the thing so if we look at the ah reducer portion
what we have def reduce so what it reads so
20:00 - 20:30 whatever the mapper has dumped in the things
so if you look at it is giving these ah average value and the lengths of the thing right here
also we read that ah particular thing sum equal to
20:30 - 21:00 so here for i in range of zero length of l
right so it is what it is doing ah if you
21:00 - 21:30 if we look at our ah previous ah thing so
what it is doing it is basically trying to calculate ah this sort of values right so
count also equal to count plus length of i
21:30 - 22:00 so finally what we get ah average again sum
to make it float yes multiply this or you can basically typecast also count right and
then print average it so what it is reducer
22:00 - 22:30 is doing taking this local file system as
there are only one reducer so it takes all the values and for all ah that data it goes
on summing up the ah um that ah output from the ah each mapper in this case there are
three so it is it is coming to be fifteen into two plus thirty five into two plus fifty
into one divided by count which is is ah here
22:30 - 23:00 two plus two plus one is five so and it calculates
the average value and then it ah again writes where writes the average value to the ah google
file system or hadoop file system based on the whatever the requirement is this right
so here this is that again though ah this is may be a ah straightforward simple thing
but we see that i can divide a problem as there are inherent parallelism like there
are i cou[ld]- i could have like in order
23:00 - 23:30 to do a averaging i have do ah i have taken
a chunk of data and that we try to solve it using in mapreduce framework
so it is if there is a huge volume of data then the mapper ah that the master node divides
accordingly and do the partial computation and the reducer read it from and do the final
computation ok so this is this is again ah a ah simple example of a mapreduce ah framework
so next ah we see another problem right so
23:30 - 24:00 what it says ah i want to compute the total
an average salary of organization x y z some organization grouped by the gender right an[d]-
using mapreduce so input is name ah gender and ah the salary of the thing right in this
case say name is john gender is m or male
24:00 - 24:30 and salary is something ten thousand unit
maybe ten thousand dollar or something and the next one is mo ah martha fa uh gender
is f and salaries something fifteen thousand right
so what we want to do we want to find out that male voice like ah gender voice in this
case male and female that what is the ah your total and average salary ah anyway that total
divided by the cardinality will be the average
24:30 - 25:00 salary of the things so the output will be
like in that ah form so what we try to look at whether we can ah employ mapreduce ah problem
to look into the this particular ah problem ok so let us look at it so what we are having
we are having ah this ah tuple like this name
25:00 - 25:30 gender and salary right so this is the tuple
so if i ah so what do you want to do ah in the map phase
so want to if the input data whatever the input whatever the input data say it is there
there are different set of the uh ah input data set so what we want to do we want to
extract only because we are not bothered about the gender m bothered about the name of the
person or that is not required in the ah so
25:30 - 26:00 uh in the salary so we want to calculate from
there these two thing yeah say m the salary so irrespective and right so they want to
do type of ah this type of things or i can
26:00 - 26:30 i can have ah a key value pair right key is
this this male or female and value is the salary of the things or we can say that we
have ah to sort of a um ah ah dictionary structure which is having a key value pair and then
having say i say this is ah dict one dict two and these two type of a key value pair
and ah for everyone i can have ah dict one
26:30 - 27:00 then maybe total an average for other one
also dict two id maybe so so id in this case is this particular male or female and then
having total or average salary right so compute total average for the two separate ah in uh
say we consider there is a dictionary structure
27:00 - 27:30 and the reducer basically does this thing
right so want to see that how to uh realize this so what ah let us ah look at the problem
so i have a set of name gender and ah salary want to extract that ah uh gender and the
salary from every topple so if i have multiple ah um mapper so i i extract those things and
dump as a ah particular ah two type of dictionary
27:30 - 28:00 type one is that two type of ah thing one
is that what we have this ah with m and f and the at the reducer part we calculate the
ah total and average salary of the things or any of the thing so again ah we look at
as a ah mapper dot py or mapper dot some python
28:00 - 28:30 type code so again i am ah just ah um want
to again repeat it that ah you can do with any coding language which is ah suitable for
this and whatever we are doing may not be there may be some syntactical problem with
the actual python thing but it doesnt matter the conceptually you want to show that things
works and then actual syntactical syntax need
28:30 - 29:00 to be followed if you are really want to implement
this right so so for if this is the thing for a line in
syz dots std in what we are doing
29:00 - 29:30 so what we are considering that it is ah it
is separated by a a by a comma so we are split is sorry it should be comma
right so i now separate name equal to nine
29:30 - 30:00 zero salary a business sa l is line two right
while ah generating the or emitting the mapper
30:00 - 30:30 phase the data into the local file system
so we keep print comma percentage d then so
30:30 - 31:00 what we do gender and salary in other sense
this uh syntax you need to check up in other sense what we do we dump basically gender
in the salary into the thing or ah m and the ah and the salary portion like ah as you see
we want to generate this m and the salary
31:00 - 31:30 portion right and another thing is that either
m or f and the salary portion right so in the reducer phase in the reducer phase
what we do that import
31:30 - 32:00 so define that or call this dit org that is
dictionary class for line in sys in what we
32:00 - 32:30 do so what what it is reading it is reading
basically that the gender or that key value fear with the gender and the ah value of the
things or in other sense the gender and the salary values so i we we dont have that name
into the things because the que[ry]- this particular query does not require the name
of the thing right line of one
32:30 - 33:00 so if this if
so if it is already existing that means once
33:00 - 33:30 you have read then d ct org so what is our
objective to basically sum up the salaries
33:30 - 34:00 ah by adding go on adding on the ah salary
values for the same gender type so already if it is already if there uh that means ah
so what that ah now my existing the reducer dictionary counting a key value pair so the
if the key is already that gender is there or male and female male or female then i go
on adding those things whenever i get if it is not there if it is else that means ah this
is basically the initialization thing dctorg
34:00 - 34:30 so initialize with a blank thing so in first
time when it is coming
34:30 - 35:00 so in first time it is coming that means there
is a blank thing so if it is blank then it it basically initialize then append the salary
that means it is initialized with the ah salary of the dict org that keys salary average equal
to some of dict org gender divided by length
35:00 - 35:30 of dictorg gender so it is summing up divided
by things straightforward right and total
35:30 - 36:00 salary equal to only some of dict org gender
right and then we basically write back the ah thing from to the google or gfs or the
ah hdfs file system ok if we want to separate
36:00 - 36:30 it by a comma or tab as the case may be again
maybe d if it is a integer or based on that if it is a float and all those things
so we have this as gender total sal and salary
36:30 - 37:00 avg right so ah so that is the ah what we
do at the ah at the final ah reduce surface so if we ah try to just ah quickly ah have
a look so what we are doing in the mapping function we have ah three um thing like name
gender and salary our objective is to the
37:00 - 37:30 mapping functions so see this ah all the map
are we like ah find out this individually this whether it is a which gender m and find
keep this salaries along with that g m or f and salary and the reducer ah we will basically
so that it exactly that gender and salary and the reducer will basically ah extract
that intermediate result and calculate the
37:30 - 38:00 average and the total right
so here that that operation is there right so this is a typical python type i am not
ah strictly telling python because there may be some um syntactical issue but you can implement
in anything right the idea is that i divide the problem into smaller parallel things by
the mapper and ah then in the second phase
38:00 - 38:30 the reducer put it to another key value pair
so key value from the input set to a set of key value pair reducer takes that key value
pair and put a function in this case average or total sal[ary]- total of the things to
another set of key value pair and the finally it goes to the ah ah hdfs pair or gfs file
system ok so what we try to look at ah in todays ah
thing that ah is there is mapreduce functions
38:30 - 39:00 say simple problems how we can put it into
map and reduce things that ah this number of my parts available allow is uh basically
availability of the resource and the how the master nodes ah uh divide it and the re ah
number of reducers also based on the term what type of functional things you want to
do and ah so they master node is there it divides into m number of mapper and a number
of reducer the problem is div the functionality of the ah problem is divided into uh in such
a way so that it divi[de]- it can execute
39:00 - 39:30 it it can be executed in two phases and we
can have a parallel implementation of this sort of paradigm ok
thank you