Introduction to Hadoop Mapreduce framework

3 December 2017

Hadoop Mapreduce framework is a Big data processing framework which consists of MapReduce programming model and Hadoop Distributed File System. MapReduce Framework is the parallel programming model for processing huge amount of data. A MapReduce job usually splits the input data-set into independent chunks, which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.

MapReduce applications specify the input/output locations and supply map and reduce functions via implementations of appropriate Hadoop interfaces, such as Mapper and Reducer. These, and other job parameters, comprise the job configuration. The Hadoop job client then submits the job (jar/executable, etc.) and configuration to the JobTracker .After that JobTracker assumes the responsibility of distributing the software/configuration to the slaves, scheduling tasks and monitoring them, providing status and diagnostic information to the job-client.

The Map/Reduce framework operates exclusively on <key, value> pairs .So this framework views the input to the job as a set of <key, value> pairs and produces a set of <key, value> pairs as the output of the job, conceivably of different types. The underlying system takes care of the partitioning of the Input data ,scheduling the program’s execution across several machines ,handling several machine failures and managing requires inter-machine communication. Computation processing occurs on both the structured data stored in the files system and structured data on database.

Following are the MapReduce Component

Job:

A job contains compiled binary ,that contains our mapper and reduce functions , implementation to those functions and some configurations information that will drive the job.

Input Format:

Determines how files are parsed into MapReduce pipeline. Different input have different formats .Example: Image have binary format ,database format, record based input format.

MapReduce Phases

Split Phase (Input Splits):

Input data is divided into input splits based on the InputFormat. Input splits equate to a map task which runs in parallel. In this phase ,data stored in the HDFS is splitted and is send to the mappers. Default input format is the text format which is broken up into line by line.

Map Phase

Transforms the input splits into key/value pairs based on user defined code. Mapper gets all of the data based on the keys .

Combiner Phase:

It is the local reducer which runs after mapper phase but before shuffle and sort phase. It is an optimization as it saves network optimization by running the local reducer .Generally it is the same reduce code which runs after map phase but before shuffle and sort phase.

Combiner is like a mini reducer function that allow us to perform a local aggregation of map output before it is transferred to reducer phase. Basically, it is used to optimize the network bandwidth usage during a MapReduce task by cutting down the amount of data that is transferred from a mapper to the reducer.

Shuffle and Sort Phase:

Moves map outputs to the reducers and sorts them by key .It is carried out by data node.

Shuffle :It partitions and groups the data

Sort: It sorts the data and send it to the reducers

Shuffle and sort phase takes all the network bandwidth and uses data nodes to shuffle and sort the data.

Reduce Phase:(Reducers):

It aggregates the key/value pairs based on user defined code. It acquires the sorted data and sorts the result. A reducer function receives an iterate input values from an output list .It combines these value together returning an single value output.

(input) –>map(K1,V1) –> list(K2,V2) –>Shuffle /Sort–>reduce (k2,list(v2))–>list(k3,v3)–>(Output)

 Hadoop Mapreduce Framework
Figure: Hadoop Mapreduce Framework(Hadoop in Action)

Joins using MapReduce Framework

There are 3 types of joins, Reduce-Side joins, Map-Side joins and the Memory-Backed Join that can be used to join Tables in MapReduce .

MapSide Join

Joining at map side performs the join before data reached to map. function It expects a strong prerequisite before joining data at map side.

  • Data should be partitioned and sorted in particular way.
  • Each input data should be divided in same number of partition.
  • Must be sorted with same key.
  • All the records for a particular key must reside in the same partition.

Reduce Side Join

Reduce side join occurs in reducer side and is also called as Repartitioned join or Repartitioned sort merge join and also it is mostly used join type. This type of join would be performed at reduce side. i.e it will have to go through sort and shuffle phase which would incur network overhead. to make it simple we are going to add the steps needs to be performed for reduce side join.

  • Data Source is referring to data source files, probably taken from RDBMS
  • Tag would be used to tag every record with it’s source name, so that it’s source can be identified at any given point of time be it is in map/reduce phase. why it is required will cover it later.
  • Group key is referring column to be used as join key between two data sources.

Memory Backed Join

Memory Backed Join-we use this join for small tables which can be fit in the memory of data nodes Among these Reduce side join is the efficient one as it joins the tables based on the key which are shuffled and sorted before going to the reducer. Hadoop sends identical keys to the same reducer, so by default the data is organized for the joins.

Map side join and its advantage

Map side join is a process where two data sets are joined by the mapper.

The advantages of using map side join in MapReduce are as follows: • Map-side join helps in minimizing the cost that is incurred for sorting and merging in the shuffle and reduce stages. • Map-side join also helps in improving the performance of the task by decreasing the time to finish the task.

References

Apache Hadoop

Gautam, N. “Analyzing Access Logs Data using Stream Based Architecture.” Masters, North Dakota State University ,2018.Available

Share: Twitter Facebook Google+ LinkedIn
comments powered by Disqus