Get the latest AI workflows to boost your productivity and business performance, delivered weekly by expert consultants. Enjoy step-by-step guides, weekly Q&A sessions, and full access to our AI workflow archive.
Summary
In this introductory lecture on the MapReduce programming paradigm, initially developed by Google, we explore its role in processing large-scale data efficiently in cloud computing environments. Originally designed for Google's extensive search engine capabilities, MapReduce has become an indispensable tool in various scientific fields requiring large-scale data computation. The lecture delves into the core logic of MapReduce, explaining how it breaks down tasks effectively using parallel processing and fault tolerance mechanisms. By optimizing data processing through mappers and reducers, MapReduce ensures efficient and scalable computing even when handling vast amounts of data.
Highlights
Google developed MapReduce to optimize search processes, but it's now used in various scientific applications. 📚
MapReduce splits tasks into manageable chunks, leveraging distributed systems for parallel computing. ⚙️
Fault tolerance and efficiency are key in MapReduce, ensuring processes continue smoothly despite system failures. 🔧
Hadoop's open-source MapReduce demonstrates its versatility and importance in cloud computing platforms. 🌍
MapReduce supports complex operations by converting key-value pairs, crucial for data analysis and database management. 🔍
Key Takeaways
MapReduce is a powerful tool for handling large-scale data computations efficiently, utilizing parallel processing. 🚀
Originally developed by Google, MapReduce has wide applications beyond web searches, notably in scientific data computation. 🌐
The methodology focuses on fault tolerance and scalability, executing tasks via mappers and reducers, enhancing efficiency. 🔄
Hadoop is an open-source implementation of MapReduce, showcasing its flexibility and widespread adoption. 🐘
The paradigm is ideal for applications involving massive datasets, making database operations more efficient. 📊
Overview
In the fascinating world of cloud computing, MapReduce stands out as a revolutionary programming paradigm. Originally crafted by Google, it was their answer to the colossal task of processing extensive volumes of search data. Imagine thousands of processors working in parallel to sift through monumental datasets! That's the power of MapReduce, which transforms how we handle big data by breaking down tasks into smaller, manageable units.
The MapReduce framework essentially hinges on two critical operations: Mappers and Reducers. These operations utilize hundreds or even thousands of processors to collaboratively tackle massive data calculations. Mappers break down data into key-value pairs, while Reducers collect and combine this information, ensuring that all data processing is not only swift but also fault tolerant. This ensures that even if a processor fails, the system can seamlessly continue its operations.
An exciting element of the MapReduce era is how it has transformed data processing into a scalable, efficient process. With various open-source implementations like Hadoop, adapted initially by Yahoo, MapReduce's reach extends far beyond Google. Today, it powers complex database queries, enabling swift data analysis across countless scientific and commercial applications. It's no surprise that it's become a cornerstone in the toolkit of modern data scientists and engineers alike.
Chapters
00:00 - 00:30: Introduction to Cloud Computing In this introductory chapter on Cloud Computing, the continuation of the discussion from the previous lecture is emphasized, focusing on data management and data storage in the cloud environment.
00:30 - 03:00: Introduction to MapReduce The chapter introduces MapReduce, a popular programming paradigm developed by Google. It highlights that MapReduce is used for large-scale search applications, such as search engines, and has been adapted for various scientific purposes.
03:00 - 05:00: MapReduce and Parallel Computation This chapter introduces the programming model of MapReduce, which was developed at Google. It highlights the significance of this model for both the efficient handling of large volumes of documents in contexts like Google Search, and for its broader application as a critical programming paradigm in the scientific realm. MapReduce's design allows it to effectively solve various scientific problems by exploiting parallel computation.
05:00 - 09:00: Parallel Efficiency in Computation The chapter titled 'Parallel Efficiency in Computation' discusses the implementation of large-scale text processing on massive web data using scalable systems like Bigtable and GFS distributed file systems. The focus is on the challenges and design strategies for processing enormous volumes of data efficiently, which is integral to managing the massively scalable web data.
09:00 - 15:00: Scalability in Parallel Computing The chapter titled 'Scalability in Parallel Computing' discusses the challenges and methods involved in processing large volumes of data using massively parallel computing systems. It specifically addresses the scenario where tens of thousands of processors are utilized simultaneously to analyze a vast pool of data. A significant problem highlighted is managing and analyzing huge datasets by efficiently counting or determining the frequency of specific elements (such as words) within the data. This serves as an illustration of scalability issues encountered in parallel computing environments.
15:00 - 19:00: MapReduce Model Explained The chapter discusses the concept of the MapReduce model and how it processes large volumes of data efficiently. It specifically addresses the scenario of searching for the frequency of occurrence of a term like 'IIT Kharagpur' within vast data sets stored in distributed file systems like HDFS, GFS, or architectures similar to Bigtable. Moreover, it highlights the fault-tolerant nature of MapReduce that ensures computational progress despite processor failures.
19:00 - 32:00: Application of MapReduce This chapter discusses the application of MapReduce, emphasizing the necessity of fault tolerance in systems with numerous processors and large volumes of data. It highlights Hadoop as an open-source implementation of MapReduce, originally developed at Yahoo. Hadoop became widely available as a pre-packaged option on Amazon's EC2 platform, illustrating how MapReduce can be applied in large-scale computing environments.
32:00 - 35:00: Conclusion This chapter titled 'Conclusion' discusses programming paradigms or platforms that have the capability to interact with cloud-based data stores. These data stores can be managed by systems like HDFS (Hadoop Distributed File System), GFS (Google File System), or by technologies such as Bigtable.
Introduction to MapReduce Transcription
00:00 - 00:30 hello so we will ah continue our discussion
on cloud computing as in our previous lecture we discussed about ah data store or data how
to manage data in cloud having a over view
00:30 - 01:00 of the things now ah we like to see that another
programming paradigm which is call map reduce right a very popular programming paradigm
which is primarily level out by google but now being ah used for ah for different scientific
purposes so google primarily developed it for their large scale search searches ah search
engines primarily to search on huge amount
01:00 - 01:30 of volumes of documents which are ah which
ah google search engines chants but it becames a important ah paradigm ah programming paradigm
for this scientific world to work on to exploit this ah philosophy to efficiently execute
different type of ah for different type of scientific problems so map reduce is a programming model developed
by developed at google primarily objective
01:30 - 02:00 was to implement large scale search text processing
on massively scalable web data stored in using big table or and g f s distributed file system
so as we obtained that big data and g f s distributed file systems the data are stored
so how to process this massively ah scalable web data that means the huge volume of data
are coming into play design for processing
02:00 - 02:30 and generating large volumes of data via massively
parallel computation utilizing tens of thousands of processor at a time so i have a large pool
of processors a huge pool of data and i want to do some analysis out of it so how can i
do it so one very popular problem what we see is that if i have a huge volume of data
and number of processors then how do do say want to do some sort of ah what counting or
counting the frequency of some of the words
02:30 - 03:00 in that huge volume of data like i want to
find out that how many times iit kharagpur appears in this ah a huge chunk of data which
are primarily stored in ah this ah h d f s or g f s or big table type of architecture so and it should be fault tolerant ensure
progress of the computation even if processor
03:00 - 03:30 fails and network fails right so because as
there are huge volume huge number of processors and say underlining networks so i do ensure
fault tolerant so one of the ah example is ah hadoop open source implementation of map
reduce developed at time volume had over initially developed as yahoo and then became a ah hmm
open source available in a pre packaged a m i s on amazon e c two platform right so
we are what we are looking at is trying to
03:30 - 04:00 give a programming provide a programming ah
platform or programming paradigm which can interact with data basis which are stored
in this sort of cloud data stores right it can be h d f s g f a s ah type or ah managed
by big table and so on so forth so if we ah
04:00 - 04:30 look at again parallel computing as we have
seen in our previous lectures ah so different models of parallel computing ah it depends
on the nature and evolution of the processor multi processor computer architecture so its
shared memory model ah distributed memory model so these are the two popular thing so
parallel ah computing developed for computing intensive scientific tasks as we all know
ah later found ah application in data base
04:30 - 05:00 arena or data base paradigm also right so
it was initially it is more of a doing a huge ah scientific task and later we have seen
ah that it has a lot of application in the in in the database domain two and we have
seen in our earlier lecture that we have three type of scenario one is ah shared memory shared
disk and shared nothing right so so whenever we had want to do a programming
paradigm or work on something which can work
05:00 - 05:30 on this sort of parallel programming ah paradigm
where the data stored in the different this sort of clouds storages so we need to take
care of that how how with what sort of mechanism is there like whether it is shared memory
shared disk or shared nothing type of ah hmm configuration this is the picture already we have seen in
our earlier lectures so we do not want to
05:30 - 06:00 repeat so it is a shared memory structure
shared disk and shared nothing but the perspective we are looking at now is little different
there it is more of the storage where we are looking at now we are trying to look at that
how the programming can exploit this type of structure so this is already a we have
seen so shared memory suitable for servers with multiple c p u s shared nothing cluster
of independent server each with its own hard
06:00 - 06:30 disk so connected by a high high speed network
and shared disk so its a hybrid architecture independent server cluster shares storage
through a high speed network storage like nas or san clusters are connected via to storage
via standard ethernet fast fiber channel infiniband and so and so forth so whenever we do anything parallel or anything
ah hm parallel computing or parallel storing
06:30 - 07:00 and type of things what is our back of the
mind is to have efficiency right so you want to do parallism to one of the major aspect
is to have a efficiency hm there may be other aspects of fault tolerance and full proof
and failure register and type of thing where primarily it should be efficient now ah first
of all the the type of work we have doing there should be inherent parallism into it
if there is no inherent parallism then ah
07:00 - 07:30 it may not be fruitful to do using always
a parallel architecture so first of all they should be inherent parallel it can it should
not be a sequence of operations and then you ah try to do a parallel so it is job one job
two job three job four a sequence is there or in between some parallism is there but
if you want to make a parallel operation there may not be so if a task takes time t in uniprocessor
it should take t by p if executed in p processor
07:30 - 08:00 ideally if the parallelism is there and ah
we are thinking that there is no cost in dividing distributing and type of things so it ideally
t by p is a something ideal condition we can have so inefficiency is introduced in distributed
computation due to need of synchronization among the processors so i need to synchronize
among the processor it is not like that all processor has the you may have the individual
clocks and you need to synchronize that where things will be there otherwise if you if you
divide the job into two where one executed
08:00 - 08:30 now and one executed after couple of hours
then it is it could have been better that is execute to in a one system so synchronization
in between the processor is one of the important aspect so need to synchronize overheads of
message communication between the processor is another aspect imbalance in the distribution
of work to the processors another so it may not be equally divided and type of things
so these are the different aspects which ah indirectly affect this efficiency or bring
about inefficiency into this para parallel
08:30 - 09:00 ah implementation so parallel efficiency of
an algorithm can be defined as ah t by p into t by ah p so if it is scalable i say this
is we say this is scalable or scalable parallel efficiency remains constant as the size of
the data increase along with a corresponding increase of the processor so what is happening when more data is coming
so i go on you go on deploying more processor
09:00 - 09:30 or you go on requesting from the cloud more
processor and then your efficiency remains constant the efficiency values does not change
so then what we say that it is scalable so that ah if i increase both in the say a say
for example for linearly then it goes on in the constant thing parallel efficiency increases
with the size of the data for a fixed number
09:30 - 10:00 of processor right it increases with the size
of the ah data and if it is a fixed number of processor then we can have effectively
more efficiency now ah the example which is there in that ah book you are referring also
you will find the example in different literature this sort of example not the same consider
a very large collection of documents say the web document crawled by the entire internet
10:00 - 10:30 so it is a pretty large ah its large every
day it is growing the problem is to determine the frequency that is total number of occurrences
of each word in this collection right so i want to determine the total number of what
is the frequency of occurrences of each word in this ah document d right so thus if there
are n documents and m distinct word we use to determine m frequency one for each word
right so this is a simple problem may be ah
10:30 - 11:00 true or may be ah more relevant for search
engines and type of things right so we have two approaches let each processor compute
the frequency for m by p words right so each processors if there are p processors if the
m m frequencies i need to calculate i divide m by p so many so for example i want to look
for i have ten processors and i have no i
11:00 - 11:30 am look for some ninety words let me call
to ninety so i every processor does it chunk of table right roughly if i am if it is not
divisible then you have to make some as as symmetric division so it makes ah things and that at the end
of things they again report the things together or or in a some ah through some system so
other way let each processor compute the frequency
11:30 - 12:00 of m words across n by p documents so total
number of documents say ten thousands so one ten thousand number of words i am looking
for ninety number of processor i am having ten so one is ninety by ten is the nine words
on an average given to the every processor and they count on the things other thing what
we are telling that each processor compute the frequency for all the ninety words but
on n by p document if that ten thousand words
12:00 - 12:30 and then p ten processors so i am some thousand
document so each take thousand documents and do the
processing and once that frequency of this m words by individual professors processors
come out then i sum up this thing and aggregate and so the result that this is the thing right
by followed by all processors summing their results right parallel computing now which
one will be efficient ah based on this parallel
12:30 - 13:00 computing paradigm we need to look at right
so parallel computing is implemented as a distributed memory model with a shared disk
right so that each processor is able to access any document from the disk in parallel with
no contention so this can be one of the implementation mechanisms now time to read each word from
the document say if let us assume that time
13:00 - 13:30 to read each word from the document equal
to time to send the word to another processor via inter processor communication and equals
to c so make it making thing simple so it may be it should be means ideally in ideal
case or in a real life case it will be different but we make this ah scenarios so first approach
so time for ah so time to add to a runni ah add to a running total of the frequencies
negligible so summing up is negligible once
13:30 - 14:00 i find the frequencies of this a m word then
summing up is negligible each what word occurs f times on the document on an average so if
i for our calculation sake that each word that on an average workers some f time right
time for compute all m frequencies with the single processor equal to then i have n into
m into f into c right so this is the time to compute m frequencies with a single processor
if i have a single process this could have
14:00 - 14:30 been the thing so if i if you if we do the
first approach first approach was this one each let each processor compute the frequency
of m by p words so that is a first approach so each processor reads at most n into m by
p into f times right so parallel efficiency is calculated as nffc by nf a pnmfc so one
by p very vanilla type consideration so we
14:30 - 15:00 take that all are doing all has morally same
frequencies all are ah hmm the negligible time for the ah any aggregation then the all
time for the ah means execu means read and write another operations we have to consider
c considering this we are getting one by p so efficiency falls with increasing p so if
we increase the p then the efficiency falls
15:00 - 15:30 right so it is not constant so it is not scalable
it is one of the major problem is that though it is ah hmm what we say ah easy to conceptualize
etcetera but there is a problem in the scalability so of the things so this one that let each
processor compute frequencies per n by m words
15:30 - 16:00 ah n by m words is not scalable whereas in
the second approach where we ah we ah that m words we divide into the different processes
oh sorry we divide that ah document d whereas i ah we every processor compute this for all
the m words and then aggregate so apparently what it looks that this could me more costly
right so it is there is a aggregation thing then
you are ah doing clubbing those ah processor
16:00 - 16:30 ah club means dividing the m set in to different
ah this whole documents set into different partitions and doing that this could be ah
in efficient ah than the first one but let us see what is there so the number of read
performs for each processor is n by p into m into f right num the time taken to read
is n by p m into f its because you are having
16:30 - 17:00 n by p amount of volume of the data and then
ah want to calculate ah for m into f into c so that number of time taken to calculate
this read time taken to write partial frequency on of m words in parallel to disk is m into
c into m right so ah once you are done you need to write on the parallel to the disk
and that is that comes to be c into m time
17:00 - 17:30 taken to communicate partial frequency right
ah to p minus one processors and then locally adding sub p sub vectors to generate one by
p of the final m vector of frequencies then what we have p into n by m by p into c so what you need to do we are time taken to
communicate partial frequencies right because you dont have the whole frequencies so partial
frequency by ah different processor and p
17:30 - 18:00 minus one processor and then totaling total
ah locally adding p sub vectors to generate one by p here of the final m vector frequencies
is this one right so individually need to do so if we adopt all those things ah incase
of this ah second approach what we have this parallel frequency as this structure so one
by one plus two p by n f so that is the hmm
18:00 - 18:30 we if you if you look at it little man dually
if you consult the book it is not a very ah difficult problem difficult to deduce it is
pretty easy just have to go by step by step now this is a inter interesting phenomena
so the so the term we are having here is one plus two p by nm right so ah in this case
a p is many many times less than nm nf right
18:30 - 19:00 efficiency of the second approach is higher
than that of the first right here if it is p is many many times less than nf then this
term that this will be tending towards one and it is it can be seemed that there is much
much efficiency is much ah higher in the first approach so there is a type it it should be
let us in the in first approach each processor
19:00 - 19:30 is reading many words than it needs to read
resulting in wastage of time right what we have done in the first approach this many
processor we have divided this m into different chunk so ah so the processor on on a say as
we as we have taken the example that if i am having m as ninety and number of processor
is p so ninety by p is ten so everybody is
19:30 - 20:00 getting ten but when it is searching the whole
document so number of documents is reading where there is no hit right it is no success so efficiency ah so in the second approach
every read is useful right as it results in a computation and distributes to the final
results so if for ah in the second approach every read is likely to be useful where it
ah contribute to this ah result so it is scalable
20:00 - 20:30 also the efficiency remains constant at both
n and p increases potentially they proportionally so what we see what we have see what we have
done there that if my data load increases i will increase the processor so if if i proportionally
increase the data ah processor then my efficiency remains constant in this case in the second
case efficiency tends to one for fixed p and
20:30 - 21:00 gradually increasing n right so efficiency
tends to one if the number of processor is fixed and gradually increased we are increasing
n that means we are increasing the data load number of processor fixed and it will means
it will basically approaches one so with these ah context or with these ah
background of that which can be ah that ah
21:00 - 21:30 this doing that individually then aggregating
is becoming more efficient with this things we look at that maa your map reduce model
right so so parallel it is a parallel programming abstraction used by many different parallel
applications which carry out large scale computations involving thousands of processors leverages
a common underlining fault tolerant implementation
21:30 - 22:00 right two face of map reduce map operation
and reduce operation a configurable number of m mapper mapper processor and r reducer
processors are assigned to work on the problem computational computation is coordinated by
a single master process
22:00 - 22:30 so what we are having now there are different
mapper processors like and there is a different reducer processor so i whole process i divide
into two things like i have a ah mapper so ah different mapper processor so there are
m processor and there is reducer so there are different reducer processor so what we
does it it when the data come here it basically
22:30 - 23:00 do some execution and then this reducer may
be based on the ah type of ah problem it will go on different reduce things and and ah do
the ah execution so reducer will generate is more of a aggregated results right will
so what what it tries to do it is a parallel programming abstraction used by mineral parallel
applications which carryout large scale computation
23:00 - 23:30 involving thousands of processors so here the application come into play so
it is a two phase process one is a map operation another is a reduce operation so that ah hmm
the configurable number of m mapper processor r reducer processors so it is configurable
that means you can have more etcetera mapper and reducer so map reduce phase so if we look
at the map phase each mapper read approximately
23:30 - 24:00 one by m of the input from the global file
so its a it is a hmm not the whole data d but a chunk of the data read map operation
consists of transforming one set of value key value pair to another set of key value
pair so what map does it is a one set of key value pair to another set of ah key value
pair so map k one v one to k two v two right
24:00 - 24:30 so each mapper writes computational results
in one file per reducer so what it does it basically for every reducer it produces a
file so it says if there are reducers r one r two r three a mapper m i create three files
based on the corresponding the reducer so the files are sorted by a key and stored
in a local file systems right the master keeps
24:30 - 25:00 tracks of the location of these files so there
is a master map reduce master so which takes care of this location of the file each mapper
produces a ah one file for every reducers and the master takes care where the files
are stored in the local disk etcetera in the reduce phase the master informs the reducers
where the partial computation have been stored
25:00 - 25:30 on local file systems ah of respective mappers
so that means in the reducer phase the reducer consult this ah master which informs that
where its related files are stored corresponding to the every mapper functions reducer makes
remote procedure call ah procedure calls to the mappers to fetch the files so reducer
in turn make a remote procedure call for the
25:30 - 26:00 mapper so mapper it is somewhere in the disk and
the reducer there may be in different structure with different types of v ms etcetera running
on the things ideally ah it is not far not geographically distributed then its the things
will not work so never the less it is it is ah it is working on that particular ah data
which are produced by the ah mapper so each reducer groups the results of the map step
using the same key value ah key value function
26:00 - 26:30 f etcetera so k two v two k two f v two so
here the aggregated functions in comes into play in other sense if we remember our problem
so what we do that every every doc every key or every word we want to calculate the frequency
so that functional model is summing up the frequencies of the things it can be different
for different type of things so it does a k two into v etcetera so its goes for a another
key value up here
26:30 - 27:00 final results are return back to the g f s
file system google file system right so map reduce ah example so if we see there are three
mapper two reducer so map function in this in our case is ah that is the data d there
are the set of what w one w two w n and it produce for every w i the count of the things
how much count the portion of the mapper it
27:00 - 27:30 is having right so every w i it counts the
thing so if you see if d one it has w one w two w four d two these are the things and
it counts this so every mapper does it and then it basically stored in the in a intermediate
space where the reducer reads so it it generates every file for every reducer like this particular
things is generate a particular file for a
27:30 - 28:00 reducer so there are two reducer so for two
reducer every mapper generates the file so and the reducer in turns basically accumulate
those so it it says that w it has the thing w one w two so w one as seven w two as ah
something fifteen right in this case w three w four are the other
two so the reducers reduces the thing ah from
28:00 - 28:30 the inputs of the or from the outputs of the
mapper getting the input from the mappers output so map reduce model is fault tolerance
there are different way to look at it one is ah heart beat message so every every particular
time period it says that whether it is a live and type of things communication exists but
no progress master if there are communication exists but no progress master duplicate those
tasks and assign the processor who are already
28:30 - 29:00 completed or some free processors
if the mapper fails the mapper reassigns key value designated to it ah to another work
node on the re execution so if it is a failure then it re execute the thing if the reducer
fails only the remaining task need to be reassigned to another node since the completed tasks
are already written back to google file system so if the completed tasks are there they are
already in google file systems only the remaining
29:00 - 29:30 tasks need to be reassigned so ah if you want
to calculate the efficiency of the map reduce so the general computation task on a volume
of data d ah so takes w d time to uniprocessor read time to read data from disk performing
computation write back to the disk ah time to read write one word from to disk is c now the computation task is decomposed into
map reduce stages like map stage mapping time
29:30 - 30:00 cm into d ah data producing and output rho
d reduce stage reduce time c r rho d and data produced at the output is sigma mu d so this
is this is ah not that difficult so mapping time if the how much that with d every mapper
is doing data produced time is ah from the particular mapper which is how much time it
is producing reduce reducers time in calculated
30:00 - 30:30 with the ah every c r and that finally we
have that reducer output so considering no overheads in decomposing
the task into map and reduce stages we can have the following relationship right so if
we if we forget the overhead in decomposing in mapping and reducing so we can have this
summation of the things now if we had p processors
30:30 - 31:00 that serve as both mapper and reducer right
irrespective of the phases to solve problem so if we use p processor sometimes it acts
as a mapper sometimes act as a reducer then we have additional overhead each mapper writes
to its local we have some additional overheads writes to local disk followed by each reducer
remotely reading to the disk right for analysis purpose time to read to a word locally or
remotely let us consider as same time to read a data from the disk is for each mapper is
w d by number of with an if the number of
31:00 - 31:30 processor is p wd by p data producer is mapper
is rho d by p so time required to write back to the disk
because ah once you read then you have to after computation you have to write back to
that is that much so similarly data read by each reducer from its its partition to each
mappers p mappers are ah rho d by p by p so
31:30 - 32:00 rho d by p square so if we calculate like
that we say that the parallel efficiency of the map reduce implementation comes as ah
this one one by one plus two c by w into rho now ah so this is the what we get a parallel
efficiency out here now if the indexing map reduce there are several type of applications
one is indexing a large collection of documents
32:00 - 32:30 right so that which is primarily ah one of
the major motivation for google so important aspect for web search as well as handling
structured data so map task consists of emitting a word document record ah id pair for the
each word like as we have seen w dk into w one into n into map to one its map w one into
every what dks so i can have a some sort of
32:30 - 33:00 indexing reduce step groups the pair of words
and creates entry into the thing so there are applications in relational operations
using map reduce right execute execual sql statements ah relational join group by on
large set of data advantages of parallel data base large scale fault tolerance we want to
exploit and i can have those type of function like ah as we have seen that it is a group
by clause and type of etcetera we can do so
33:00 - 33:30 that some sort of a relational operations
we can execute so with these ah we have a ah we come to this
end of todays talk so what we try to do here to give you a overview of a map reduce paradigm
that how ah how a problem can be divided into a into a set of ah parallel executions which
is a mapper node which creates intermediate
33:30 - 34:00 results and there is a set of reducer nodes
which takes this data and create the final results right and what we can which there
are some of the things which is interesting that the mapper creates ah data file for every
reducer so it is the it is the data is created per reducer so the reducer knows that where the data is
there over and above there is a master controller
34:00 - 34:30 or the map reduce master ah things which come
to which knows where there things where the data is stored by the mapper and how the reducer
will read not only that if the mapper node fails how to reallocate the things if the
reducer node fails how to reallocate because the things or the reallocate the not executed
data ah not executed things etce not executed hmm yet to be executed operations and so and
so forth so ah with this we will stop our