Unveiling the MapReduce Paradigm

Introduction to MapReduce

Estimated read time: 1:20

    Learn to use AI like a Pro

    Get the latest AI workflows to boost your productivity and business performance, delivered weekly by expert consultants. Enjoy step-by-step guides, weekly Q&A sessions, and full access to our AI workflow archive.

    Canva Logo
    Claude AI Logo
    Google Gemini Logo
    HeyGen Logo
    Hugging Face Logo
    Microsoft Logo
    OpenAI Logo
    Zapier Logo
    Canva Logo
    Claude AI Logo
    Google Gemini Logo
    HeyGen Logo
    Hugging Face Logo
    Microsoft Logo
    OpenAI Logo
    Zapier Logo

    Summary

    In this introductory lecture on the MapReduce programming paradigm, initially developed by Google, we explore its role in processing large-scale data efficiently in cloud computing environments. Originally designed for Google's extensive search engine capabilities, MapReduce has become an indispensable tool in various scientific fields requiring large-scale data computation. The lecture delves into the core logic of MapReduce, explaining how it breaks down tasks effectively using parallel processing and fault tolerance mechanisms. By optimizing data processing through mappers and reducers, MapReduce ensures efficient and scalable computing even when handling vast amounts of data.

      Highlights

      • Google developed MapReduce to optimize search processes, but it's now used in various scientific applications. 📚
      • MapReduce splits tasks into manageable chunks, leveraging distributed systems for parallel computing. ⚙️
      • Fault tolerance and efficiency are key in MapReduce, ensuring processes continue smoothly despite system failures. 🔧
      • Hadoop's open-source MapReduce demonstrates its versatility and importance in cloud computing platforms. 🌍
      • MapReduce supports complex operations by converting key-value pairs, crucial for data analysis and database management. 🔍

      Key Takeaways

      • MapReduce is a powerful tool for handling large-scale data computations efficiently, utilizing parallel processing. 🚀
      • Originally developed by Google, MapReduce has wide applications beyond web searches, notably in scientific data computation. 🌐
      • The methodology focuses on fault tolerance and scalability, executing tasks via mappers and reducers, enhancing efficiency. 🔄
      • Hadoop is an open-source implementation of MapReduce, showcasing its flexibility and widespread adoption. 🐘
      • The paradigm is ideal for applications involving massive datasets, making database operations more efficient. 📊

      Overview

      In the fascinating world of cloud computing, MapReduce stands out as a revolutionary programming paradigm. Originally crafted by Google, it was their answer to the colossal task of processing extensive volumes of search data. Imagine thousands of processors working in parallel to sift through monumental datasets! That's the power of MapReduce, which transforms how we handle big data by breaking down tasks into smaller, manageable units.

        The MapReduce framework essentially hinges on two critical operations: Mappers and Reducers. These operations utilize hundreds or even thousands of processors to collaboratively tackle massive data calculations. Mappers break down data into key-value pairs, while Reducers collect and combine this information, ensuring that all data processing is not only swift but also fault tolerant. This ensures that even if a processor fails, the system can seamlessly continue its operations.

          An exciting element of the MapReduce era is how it has transformed data processing into a scalable, efficient process. With various open-source implementations like Hadoop, adapted initially by Yahoo, MapReduce's reach extends far beyond Google. Today, it powers complex database queries, enabling swift data analysis across countless scientific and commercial applications. It's no surprise that it's become a cornerstone in the toolkit of modern data scientists and engineers alike.

            Chapters

            • 00:00 - 00:30: Introduction to Cloud Computing In this introductory chapter on Cloud Computing, the continuation of the discussion from the previous lecture is emphasized, focusing on data management and data storage in the cloud environment.
            • 00:30 - 03:00: Introduction to MapReduce The chapter introduces MapReduce, a popular programming paradigm developed by Google. It highlights that MapReduce is used for large-scale search applications, such as search engines, and has been adapted for various scientific purposes.
            • 03:00 - 05:00: MapReduce and Parallel Computation This chapter introduces the programming model of MapReduce, which was developed at Google. It highlights the significance of this model for both the efficient handling of large volumes of documents in contexts like Google Search, and for its broader application as a critical programming paradigm in the scientific realm. MapReduce's design allows it to effectively solve various scientific problems by exploiting parallel computation.
            • 05:00 - 09:00: Parallel Efficiency in Computation The chapter titled 'Parallel Efficiency in Computation' discusses the implementation of large-scale text processing on massive web data using scalable systems like Bigtable and GFS distributed file systems. The focus is on the challenges and design strategies for processing enormous volumes of data efficiently, which is integral to managing the massively scalable web data.
            • 09:00 - 15:00: Scalability in Parallel Computing The chapter titled 'Scalability in Parallel Computing' discusses the challenges and methods involved in processing large volumes of data using massively parallel computing systems. It specifically addresses the scenario where tens of thousands of processors are utilized simultaneously to analyze a vast pool of data. A significant problem highlighted is managing and analyzing huge datasets by efficiently counting or determining the frequency of specific elements (such as words) within the data. This serves as an illustration of scalability issues encountered in parallel computing environments.
            • 15:00 - 19:00: MapReduce Model Explained The chapter discusses the concept of the MapReduce model and how it processes large volumes of data efficiently. It specifically addresses the scenario of searching for the frequency of occurrence of a term like 'IIT Kharagpur' within vast data sets stored in distributed file systems like HDFS, GFS, or architectures similar to Bigtable. Moreover, it highlights the fault-tolerant nature of MapReduce that ensures computational progress despite processor failures.
            • 19:00 - 32:00: Application of MapReduce This chapter discusses the application of MapReduce, emphasizing the necessity of fault tolerance in systems with numerous processors and large volumes of data. It highlights Hadoop as an open-source implementation of MapReduce, originally developed at Yahoo. Hadoop became widely available as a pre-packaged option on Amazon's EC2 platform, illustrating how MapReduce can be applied in large-scale computing environments.
            • 32:00 - 35:00: Conclusion This chapter titled 'Conclusion' discusses programming paradigms or platforms that have the capability to interact with cloud-based data stores. These data stores can be managed by systems like HDFS (Hadoop Distributed File System), GFS (Google File System), or by technologies such as Bigtable.

            Introduction to MapReduce Transcription

            • 00:00 - 00:30 hello so we will ah continue our discussion on cloud computing as in our previous lecture we discussed about ah data store or data how to manage data in cloud having a over view
            • 00:30 - 01:00 of the things now ah we like to see that another programming paradigm which is call map reduce right a very popular programming paradigm which is primarily level out by google but now being ah used for ah for different scientific purposes so google primarily developed it for their large scale search searches ah search engines primarily to search on huge amount
            • 01:00 - 01:30 of volumes of documents which are ah which ah google search engines chants but it becames a important ah paradigm ah programming paradigm for this scientific world to work on to exploit this ah philosophy to efficiently execute different type of ah for different type of scientific problems so map reduce is a programming model developed by developed at google primarily objective
            • 01:30 - 02:00 was to implement large scale search text processing on massively scalable web data stored in using big table or and g f s distributed file system so as we obtained that big data and g f s distributed file systems the data are stored so how to process this massively ah scalable web data that means the huge volume of data are coming into play design for processing
            • 02:00 - 02:30 and generating large volumes of data via massively parallel computation utilizing tens of thousands of processor at a time so i have a large pool of processors a huge pool of data and i want to do some analysis out of it so how can i do it so one very popular problem what we see is that if i have a huge volume of data and number of processors then how do do say want to do some sort of ah what counting or counting the frequency of some of the words
            • 02:30 - 03:00 in that huge volume of data like i want to find out that how many times iit kharagpur appears in this ah a huge chunk of data which are primarily stored in ah this ah h d f s or g f s or big table type of architecture so and it should be fault tolerant ensure progress of the computation even if processor
            • 03:00 - 03:30 fails and network fails right so because as there are huge volume huge number of processors and say underlining networks so i do ensure fault tolerant so one of the ah example is ah hadoop open source implementation of map reduce developed at time volume had over initially developed as yahoo and then became a ah hmm open source available in a pre packaged a m i s on amazon e c two platform right so we are what we are looking at is trying to
            • 03:30 - 04:00 give a programming provide a programming ah platform or programming paradigm which can interact with data basis which are stored in this sort of cloud data stores right it can be h d f s g f a s ah type or ah managed by big table and so on so forth so if we ah
            • 04:00 - 04:30 look at again parallel computing as we have seen in our previous lectures ah so different models of parallel computing ah it depends on the nature and evolution of the processor multi processor computer architecture so its shared memory model ah distributed memory model so these are the two popular thing so parallel ah computing developed for computing intensive scientific tasks as we all know ah later found ah application in data base
            • 04:30 - 05:00 arena or data base paradigm also right so it was initially it is more of a doing a huge ah scientific task and later we have seen ah that it has a lot of application in the in in the database domain two and we have seen in our earlier lecture that we have three type of scenario one is ah shared memory shared disk and shared nothing right so so whenever we had want to do a programming paradigm or work on something which can work
            • 05:00 - 05:30 on this sort of parallel programming ah paradigm where the data stored in the different this sort of clouds storages so we need to take care of that how how with what sort of mechanism is there like whether it is shared memory shared disk or shared nothing type of ah hmm configuration this is the picture already we have seen in our earlier lectures so we do not want to
            • 05:30 - 06:00 repeat so it is a shared memory structure shared disk and shared nothing but the perspective we are looking at now is little different there it is more of the storage where we are looking at now we are trying to look at that how the programming can exploit this type of structure so this is already a we have seen so shared memory suitable for servers with multiple c p u s shared nothing cluster of independent server each with its own hard
            • 06:00 - 06:30 disk so connected by a high high speed network and shared disk so its a hybrid architecture independent server cluster shares storage through a high speed network storage like nas or san clusters are connected via to storage via standard ethernet fast fiber channel infiniband and so and so forth so whenever we do anything parallel or anything ah hm parallel computing or parallel storing
            • 06:30 - 07:00 and type of things what is our back of the mind is to have efficiency right so you want to do parallism to one of the major aspect is to have a efficiency hm there may be other aspects of fault tolerance and full proof and failure register and type of thing where primarily it should be efficient now ah first of all the the type of work we have doing there should be inherent parallism into it if there is no inherent parallism then ah
            • 07:00 - 07:30 it may not be fruitful to do using always a parallel architecture so first of all they should be inherent parallel it can it should not be a sequence of operations and then you ah try to do a parallel so it is job one job two job three job four a sequence is there or in between some parallism is there but if you want to make a parallel operation there may not be so if a task takes time t in uniprocessor it should take t by p if executed in p processor
            • 07:30 - 08:00 ideally if the parallelism is there and ah we are thinking that there is no cost in dividing distributing and type of things so it ideally t by p is a something ideal condition we can have so inefficiency is introduced in distributed computation due to need of synchronization among the processors so i need to synchronize among the processor it is not like that all processor has the you may have the individual clocks and you need to synchronize that where things will be there otherwise if you if you divide the job into two where one executed
            • 08:00 - 08:30 now and one executed after couple of hours then it is it could have been better that is execute to in a one system so synchronization in between the processor is one of the important aspect so need to synchronize overheads of message communication between the processor is another aspect imbalance in the distribution of work to the processors another so it may not be equally divided and type of things so these are the different aspects which ah indirectly affect this efficiency or bring about inefficiency into this para parallel
            • 08:30 - 09:00 ah implementation so parallel efficiency of an algorithm can be defined as ah t by p into t by ah p so if it is scalable i say this is we say this is scalable or scalable parallel efficiency remains constant as the size of the data increase along with a corresponding increase of the processor so what is happening when more data is coming so i go on you go on deploying more processor
            • 09:00 - 09:30 or you go on requesting from the cloud more processor and then your efficiency remains constant the efficiency values does not change so then what we say that it is scalable so that ah if i increase both in the say a say for example for linearly then it goes on in the constant thing parallel efficiency increases with the size of the data for a fixed number
            • 09:30 - 10:00 of processor right it increases with the size of the ah data and if it is a fixed number of processor then we can have effectively more efficiency now ah the example which is there in that ah book you are referring also you will find the example in different literature this sort of example not the same consider a very large collection of documents say the web document crawled by the entire internet
            • 10:00 - 10:30 so it is a pretty large ah its large every day it is growing the problem is to determine the frequency that is total number of occurrences of each word in this collection right so i want to determine the total number of what is the frequency of occurrences of each word in this ah document d right so thus if there are n documents and m distinct word we use to determine m frequency one for each word right so this is a simple problem may be ah
            • 10:30 - 11:00 true or may be ah more relevant for search engines and type of things right so we have two approaches let each processor compute the frequency for m by p words right so each processors if there are p processors if the m m frequencies i need to calculate i divide m by p so many so for example i want to look for i have ten processors and i have no i
            • 11:00 - 11:30 am look for some ninety words let me call to ninety so i every processor does it chunk of table right roughly if i am if it is not divisible then you have to make some as as symmetric division so it makes ah things and that at the end of things they again report the things together or or in a some ah through some system so other way let each processor compute the frequency
            • 11:30 - 12:00 of m words across n by p documents so total number of documents say ten thousands so one ten thousand number of words i am looking for ninety number of processor i am having ten so one is ninety by ten is the nine words on an average given to the every processor and they count on the things other thing what we are telling that each processor compute the frequency for all the ninety words but on n by p document if that ten thousand words
            • 12:00 - 12:30 and then p ten processors so i am some thousand document so each take thousand documents and do the processing and once that frequency of this m words by individual professors processors come out then i sum up this thing and aggregate and so the result that this is the thing right by followed by all processors summing their results right parallel computing now which one will be efficient ah based on this parallel
            • 12:30 - 13:00 computing paradigm we need to look at right so parallel computing is implemented as a distributed memory model with a shared disk right so that each processor is able to access any document from the disk in parallel with no contention so this can be one of the implementation mechanisms now time to read each word from the document say if let us assume that time
            • 13:00 - 13:30 to read each word from the document equal to time to send the word to another processor via inter processor communication and equals to c so make it making thing simple so it may be it should be means ideally in ideal case or in a real life case it will be different but we make this ah scenarios so first approach so time for ah so time to add to a runni ah add to a running total of the frequencies negligible so summing up is negligible once
            • 13:30 - 14:00 i find the frequencies of this a m word then summing up is negligible each what word occurs f times on the document on an average so if i for our calculation sake that each word that on an average workers some f time right time for compute all m frequencies with the single processor equal to then i have n into m into f into c right so this is the time to compute m frequencies with a single processor if i have a single process this could have
            • 14:00 - 14:30 been the thing so if i if you if we do the first approach first approach was this one each let each processor compute the frequency of m by p words so that is a first approach so each processor reads at most n into m by p into f times right so parallel efficiency is calculated as nffc by nf a pnmfc so one by p very vanilla type consideration so we
            • 14:30 - 15:00 take that all are doing all has morally same frequencies all are ah hmm the negligible time for the ah any aggregation then the all time for the ah means execu means read and write another operations we have to consider c considering this we are getting one by p so efficiency falls with increasing p so if we increase the p then the efficiency falls
            • 15:00 - 15:30 right so it is not constant so it is not scalable it is one of the major problem is that though it is ah hmm what we say ah easy to conceptualize etcetera but there is a problem in the scalability so of the things so this one that let each processor compute frequencies per n by m words
            • 15:30 - 16:00 ah n by m words is not scalable whereas in the second approach where we ah we ah that m words we divide into the different processes oh sorry we divide that ah document d whereas i ah we every processor compute this for all the m words and then aggregate so apparently what it looks that this could me more costly right so it is there is a aggregation thing then you are ah doing clubbing those ah processor
            • 16:00 - 16:30 ah club means dividing the m set in to different ah this whole documents set into different partitions and doing that this could be ah in efficient ah than the first one but let us see what is there so the number of read performs for each processor is n by p into m into f right num the time taken to read is n by p m into f its because you are having
            • 16:30 - 17:00 n by p amount of volume of the data and then ah want to calculate ah for m into f into c so that number of time taken to calculate this read time taken to write partial frequency on of m words in parallel to disk is m into c into m right so ah once you are done you need to write on the parallel to the disk and that is that comes to be c into m time
            • 17:00 - 17:30 taken to communicate partial frequency right ah to p minus one processors and then locally adding sub p sub vectors to generate one by p of the final m vector of frequencies then what we have p into n by m by p into c so what you need to do we are time taken to communicate partial frequencies right because you dont have the whole frequencies so partial frequency by ah different processor and p
            • 17:30 - 18:00 minus one processor and then totaling total ah locally adding p sub vectors to generate one by p here of the final m vector frequencies is this one right so individually need to do so if we adopt all those things ah incase of this ah second approach what we have this parallel frequency as this structure so one by one plus two p by n f so that is the hmm
            • 18:00 - 18:30 we if you if you look at it little man dually if you consult the book it is not a very ah difficult problem difficult to deduce it is pretty easy just have to go by step by step now this is a inter interesting phenomena so the so the term we are having here is one plus two p by nm right so ah in this case a p is many many times less than nm nf right
            • 18:30 - 19:00 efficiency of the second approach is higher than that of the first right here if it is p is many many times less than nf then this term that this will be tending towards one and it is it can be seemed that there is much much efficiency is much ah higher in the first approach so there is a type it it should be let us in the in first approach each processor
            • 19:00 - 19:30 is reading many words than it needs to read resulting in wastage of time right what we have done in the first approach this many processor we have divided this m into different chunk so ah so the processor on on a say as we as we have taken the example that if i am having m as ninety and number of processor is p so ninety by p is ten so everybody is
            • 19:30 - 20:00 getting ten but when it is searching the whole document so number of documents is reading where there is no hit right it is no success so efficiency ah so in the second approach every read is useful right as it results in a computation and distributes to the final results so if for ah in the second approach every read is likely to be useful where it ah contribute to this ah result so it is scalable
            • 20:00 - 20:30 also the efficiency remains constant at both n and p increases potentially they proportionally so what we see what we have see what we have done there that if my data load increases i will increase the processor so if if i proportionally increase the data ah processor then my efficiency remains constant in this case in the second case efficiency tends to one for fixed p and
            • 20:30 - 21:00 gradually increasing n right so efficiency tends to one if the number of processor is fixed and gradually increased we are increasing n that means we are increasing the data load number of processor fixed and it will means it will basically approaches one so with these ah context or with these ah background of that which can be ah that ah
            • 21:00 - 21:30 this doing that individually then aggregating is becoming more efficient with this things we look at that maa your map reduce model right so so parallel it is a parallel programming abstraction used by many different parallel applications which carry out large scale computations involving thousands of processors leverages a common underlining fault tolerant implementation
            • 21:30 - 22:00 right two face of map reduce map operation and reduce operation a configurable number of m mapper mapper processor and r reducer processors are assigned to work on the problem computational computation is coordinated by a single master process
            • 22:00 - 22:30 so what we are having now there are different mapper processors like and there is a different reducer processor so i whole process i divide into two things like i have a ah mapper so ah different mapper processor so there are m processor and there is reducer so there are different reducer processor so what we does it it when the data come here it basically
            • 22:30 - 23:00 do some execution and then this reducer may be based on the ah type of ah problem it will go on different reduce things and and ah do the ah execution so reducer will generate is more of a aggregated results right will so what what it tries to do it is a parallel programming abstraction used by mineral parallel applications which carryout large scale computation
            • 23:00 - 23:30 involving thousands of processors so here the application come into play so it is a two phase process one is a map operation another is a reduce operation so that ah hmm the configurable number of m mapper processor r reducer processors so it is configurable that means you can have more etcetera mapper and reducer so map reduce phase so if we look at the map phase each mapper read approximately
            • 23:30 - 24:00 one by m of the input from the global file so its a it is a hmm not the whole data d but a chunk of the data read map operation consists of transforming one set of value key value pair to another set of key value pair so what map does it is a one set of key value pair to another set of ah key value pair so map k one v one to k two v two right
            • 24:00 - 24:30 so each mapper writes computational results in one file per reducer so what it does it basically for every reducer it produces a file so it says if there are reducers r one r two r three a mapper m i create three files based on the corresponding the reducer so the files are sorted by a key and stored in a local file systems right the master keeps
            • 24:30 - 25:00 tracks of the location of these files so there is a master map reduce master so which takes care of this location of the file each mapper produces a ah one file for every reducers and the master takes care where the files are stored in the local disk etcetera in the reduce phase the master informs the reducers where the partial computation have been stored
            • 25:00 - 25:30 on local file systems ah of respective mappers so that means in the reducer phase the reducer consult this ah master which informs that where its related files are stored corresponding to the every mapper functions reducer makes remote procedure call ah procedure calls to the mappers to fetch the files so reducer in turn make a remote procedure call for the
            • 25:30 - 26:00 mapper so mapper it is somewhere in the disk and the reducer there may be in different structure with different types of v ms etcetera running on the things ideally ah it is not far not geographically distributed then its the things will not work so never the less it is it is ah it is working on that particular ah data which are produced by the ah mapper so each reducer groups the results of the map step using the same key value ah key value function
            • 26:00 - 26:30 f etcetera so k two v two k two f v two so here the aggregated functions in comes into play in other sense if we remember our problem so what we do that every every doc every key or every word we want to calculate the frequency so that functional model is summing up the frequencies of the things it can be different for different type of things so it does a k two into v etcetera so its goes for a another key value up here
            • 26:30 - 27:00 final results are return back to the g f s file system google file system right so map reduce ah example so if we see there are three mapper two reducer so map function in this in our case is ah that is the data d there are the set of what w one w two w n and it produce for every w i the count of the things how much count the portion of the mapper it
            • 27:00 - 27:30 is having right so every w i it counts the thing so if you see if d one it has w one w two w four d two these are the things and it counts this so every mapper does it and then it basically stored in the in a intermediate space where the reducer reads so it it generates every file for every reducer like this particular things is generate a particular file for a
            • 27:30 - 28:00 reducer so there are two reducer so for two reducer every mapper generates the file so and the reducer in turns basically accumulate those so it it says that w it has the thing w one w two so w one as seven w two as ah something fifteen right in this case w three w four are the other two so the reducers reduces the thing ah from
            • 28:00 - 28:30 the inputs of the or from the outputs of the mapper getting the input from the mappers output so map reduce model is fault tolerance there are different way to look at it one is ah heart beat message so every every particular time period it says that whether it is a live and type of things communication exists but no progress master if there are communication exists but no progress master duplicate those tasks and assign the processor who are already
            • 28:30 - 29:00 completed or some free processors if the mapper fails the mapper reassigns key value designated to it ah to another work node on the re execution so if it is a failure then it re execute the thing if the reducer fails only the remaining task need to be reassigned to another node since the completed tasks are already written back to google file system so if the completed tasks are there they are already in google file systems only the remaining
            • 29:00 - 29:30 tasks need to be reassigned so ah if you want to calculate the efficiency of the map reduce so the general computation task on a volume of data d ah so takes w d time to uniprocessor read time to read data from disk performing computation write back to the disk ah time to read write one word from to disk is c now the computation task is decomposed into map reduce stages like map stage mapping time
            • 29:30 - 30:00 cm into d ah data producing and output rho d reduce stage reduce time c r rho d and data produced at the output is sigma mu d so this is this is ah not that difficult so mapping time if the how much that with d every mapper is doing data produced time is ah from the particular mapper which is how much time it is producing reduce reducers time in calculated
            • 30:00 - 30:30 with the ah every c r and that finally we have that reducer output so considering no overheads in decomposing the task into map and reduce stages we can have the following relationship right so if we if we forget the overhead in decomposing in mapping and reducing so we can have this summation of the things now if we had p processors
            • 30:30 - 31:00 that serve as both mapper and reducer right irrespective of the phases to solve problem so if we use p processor sometimes it acts as a mapper sometimes act as a reducer then we have additional overhead each mapper writes to its local we have some additional overheads writes to local disk followed by each reducer remotely reading to the disk right for analysis purpose time to read to a word locally or remotely let us consider as same time to read a data from the disk is for each mapper is w d by number of with an if the number of
            • 31:00 - 31:30 processor is p wd by p data producer is mapper is rho d by p so time required to write back to the disk because ah once you read then you have to after computation you have to write back to that is that much so similarly data read by each reducer from its its partition to each mappers p mappers are ah rho d by p by p so
            • 31:30 - 32:00 rho d by p square so if we calculate like that we say that the parallel efficiency of the map reduce implementation comes as ah this one one by one plus two c by w into rho now ah so this is the what we get a parallel efficiency out here now if the indexing map reduce there are several type of applications one is indexing a large collection of documents
            • 32:00 - 32:30 right so that which is primarily ah one of the major motivation for google so important aspect for web search as well as handling structured data so map task consists of emitting a word document record ah id pair for the each word like as we have seen w dk into w one into n into map to one its map w one into every what dks so i can have a some sort of
            • 32:30 - 33:00 indexing reduce step groups the pair of words and creates entry into the thing so there are applications in relational operations using map reduce right execute execual sql statements ah relational join group by on large set of data advantages of parallel data base large scale fault tolerance we want to exploit and i can have those type of function like ah as we have seen that it is a group by clause and type of etcetera we can do so
            • 33:00 - 33:30 that some sort of a relational operations we can execute so with these ah we have a ah we come to this end of todays talk so what we try to do here to give you a overview of a map reduce paradigm that how ah how a problem can be divided into a into a set of ah parallel executions which is a mapper node which creates intermediate
            • 33:30 - 34:00 results and there is a set of reducer nodes which takes this data and create the final results right and what we can which there are some of the things which is interesting that the mapper creates ah data file for every reducer so it is the it is the data is created per reducer so the reducer knows that where the data is there over and above there is a master controller
            • 34:00 - 34:30 or the map reduce master ah things which come to which knows where there things where the data is stored by the mapper and how the reducer will read not only that if the mapper node fails how to reallocate the things if the reducer node fails how to reallocate because the things or the reallocate the not executed data ah not executed things etce not executed hmm yet to be executed operations and so and so forth so ah with this we will stop our
            • 34:30 - 35:00 ah lecture today thank you