Apache Hadoop

distributed data processing framework

Apache Hadoop is a collection of open-source software tools, that speeds up the processing of very large amounts of data by distributing it to multiple computers. Its core consists of:

  1. a storage part: Hadoop Distributed File System (HDFS). This abstracts away the distributed nature of storage so that data can be moved around the cluster of computers more easily.
  2. a processing part: Hadoop MapReduce (Hadoop's implementation of MapReduce). This is used to distribute and program transformations within the intended data processing pipeline.
    1. Map: The Map phase organises data into key:value pairs according to a user's pre-defined mapping, and
    2. Shuffle: a process called Shuffle uses those keys to sort and redistribute the data to nodes in the cluster so that similar data is stored together. This is known as Data Locality.
    3. Reduce: The Reduce phase then does the transformation on the values, perhaps just a mathematical operation like addition, now taking advantage of Data Locality to streamline that processing at each node, like specialisation in economics. It then brings that data back together and finally writes it back to HDFS.


The distributed nature of Hadoop means that processing:

  • can be done in parallel rather than only sequentially, waiting for one step to finish before another begins.
  • is far more fault tolerant and so the processing has higher availability.


HDFS is also used for other projects such as the Apache NoSQL Wide Column Store, Apache HBase.


Reading and writing to and from disk for each transformation means that Hadoop MapReduce is slower for most pipelines than Apache Spark which uses RAM rather than disk and also runs within the Hadoop ecosystem. Spark may cost more in terms of resources such as memory and CPU.

Available to work alongside HDFS and Hadoop MapReduce are other Apache tools in the Hadoop ecosystem, such as:

  • YARN (Yet Another Resource Negotiator) which does the distributing.
  • Hive: a data warehousing tool which also translates your SQL queries into MapReduce or Spark jobs.
  • Pig, which make writing MapReduce programs easier than writing them in Python, C++ or Java directly.


As of 2013, Hadoop adoption had become widespread: more than half of the Fortune 50 companies used Hadoop.