Hadoop2 was a complete overhaul of Hadoop1, in Hadoop2 ASF introduced MapReduce 2.0 (MRv2) or Apache Hadoop YARN. It is a sub-project of Hadoop at the Apache Software Foundation introduced in Hadoop 2.0 that separates the resource management and processing components.
The main differences are categorized below:
The main differences are categorized below:
Daemons
Daemons | Hadoop 1 | Hadoop 2 |
---|---|---|
HDFS |
|
|
Processing | MR1
| MR2 (YARN)
|
Web UI Default Ports
Services | Hadoop 1 | Hadoop 2 |
---|---|---|
HDFS : NameNode | 50070 | 50070 |
MapReduce 1 : Job Tracker | 50030 | -- |
YARN : Resource Manager | -- | 8088 |
YARN : MapReduce Job History Server | -- | 19888 |
Directory Structure
Files | Hadoop 1 | Hadoop 2 |
---|---|---|
user commands | $HADOOP_HOME/bin | $HADOOP_HOME/bin |
admin commands (start-* and stop-* scripts) | $HADOOP_HOME/bin | $HADOOP_HOME/sbin |
configuration Files | $HADOOP_HOME/conf | $HADOOP_HOME/etc/hadoop |
jar files | $HADOOP_HOME/lib | $HADOOP_HOME/share/hadoop
jar files live in component specific sub-directories (common, hdfs, mapreduce, yarn) |
Start/Stop Scripts
Task | Hadoop 1 | Hadoop 2 |
---|---|---|
start HDFS | $HADOOP_HOME/bin/start-dfs.sh
$HADOOP_HOME/bin/hadoop-daemon.sh start namenode | $HADOOP_HOME/sbin/start-dfs.sh
$HADOOP_HOME/sbin/hadoop-daemon.sh start namenode |
start Map Reduce | $HADOOP_HOME/bin/start-mapred.sh | $HADOOP_HOME/sbin/start-yarn.sh |
start everything | $HADOOP_HOME/bin/start-all.sh | $HADOOP_HOME/sbin/start-all.sh |
Configuration Files
Task | Hadoop 1 | Hadoop 2 |
---|---|---|
Core | $HADOOP_HOME/conf/core-site.xml | $HADOOP_HOME/etc/hadoop/core-site.xml |
HDFS | $HADOOP_HOME/conf/hdfs-site.xml | $HADOOP_HOME/etc/hadoop/hdfs-site.xml |
MapReduce | $HADOOP_HOME/conf/mapred-site.xml | $HADOOP_HOME/etc/hadoop/mapred-site.xml |
YARN | -- | $HADOOP_HOME/etc/hadoop/yarn-site.xml |
MapReduce - old API (Hadoop 0.20) vs new API (Hadoop 1.x or 2.x)
Feature | Old API | New API |
---|---|---|
Mapper & Reducer | Uses Mapper & Reducer as interface (still exist in new API for backward compatibility) | Mapper & Reducer as class, hence can add a method with a default implementation (if needed) to an abstract class without breaking old implementations of the class |
Package | org.apache.hadoop.mapred | org.apache.hadoop.mapreduce |
Communicate with MapReduce system | JobConf, OutputCollector and the Reporter object use for communicate with MapReduce system | Use "context" object to communicate with MapReduce system |
Mapper & Reducer execution control | Controlling mappers by writing a MapRunnable, but no equivalent exists for reducers. | Allows both mappers and reducers to control the execution flow by overriding the run() method. |
Job Control | Job Control was done through JobClient (does not exist in the new API) | Job control is done through the Job class |
Job Configuration | jobconf object was used for Job configuration, which is extension of Configuration class. java.lang.Object extended by org.apache.hadoop.conf.Configuration extended by org.apache.hadoop.mapred.JobConf | Done through Configuration class via some of the helper methods on Job. |
Output file name | Both map and reduce outputs are named part-nnnnn | Map outputs are named part-m-nnnnn, and reduce outputs are named part-r-nnnnn (where nnnnn is an integer designating the part number, starting from zero). |
reduce() method passes values | reduce() method passes values as a java.lang.Iterator | reduce() method passes values as a java.lang.Iterable |