Hadoop is an open source distributed processing framework that manages data processing and storage for big data applications running in clustered systems. It is at the center of a growing ecosystem of big data technologies that are primarily used to support advanced analytics initiatives, including predictive analytics, data mining and machine learning applications. Hadoop can handle various forms of structured and unstructured data, giving users more flexibility for collecting, processing and analyzing data than relational databases and data warehouses provide.
Hadoop and big data
Hadoop runs on clusters of commodity servers and can scale up to support thousands of hardware nodes and massive amounts of data. It uses a namesake distributed file system that's designed to provide rapid data access across the nodes in a cluster, plus fault-tolerant capabilities so applications can continue to run if individual nodes fail. Consequently, Hadoop became a foundational data management platform for big data analytics uses after it emerged in the mid-2000s.
Hadoop was created by computer scientists Doug Cutting and Mike Cafarella, initially to support processing in the Nutch open source search engine and web crawler. After Google published technical papers detailing its Google File System (GFS) and MapReduce programming framework in 2003 and 2004, respectively, Cutting and Cafarella modified earlier technology plans and developed a Java-based MapReduce implementation and a file system modeled on Google's.
Components of Hadoop
Core Hadoop, including HDFS, MapReduce, and YARN, is part of the foundation of Cloudera's platform. All platform components have access to the same data stored in HDFS and participate in shared resource management via YARN.
Hadoop Distributed File System (HDFS)
The Hadoop Distributed File System (HDFS) is the underlying file system of a Hadoop cluster. It provides scalable, fault-tolerant, rack-aware data storage designed to be deployed on commodity hardware. Several attributes set HDFS apart from other distributed file systems. Data in a Hadoop cluster is broken down into smaller units (called blocks) and distributed throughout the cluster. Each block is duplicated twice (for a total of three copies), with the two replicas stored on two nodes in a rack somewhere else in the cluster.
MapReduce is a framework tailor-made for processing large datasets in a distributed fashion across multiple machines. The core of a MapReduce job can be, err, reduced to three operations: map an input data set into a collection of pairs, shuffle the resulting data (transfer data to the reducers), then reduce over all pairs with the same key. The top-level unit of work in MapReduce is a job. Each job is composed of one or more map or reduce tasks.
Yet Another Resource Negotiator (YARN)
YARN (Yet Another Resource Negotiator) is the framework responsible for assigning computational resources for application execution. YARN uses some very common terms in uncommon ways. For example, when most people hear “container”, they think Docker. In the Hadoop ecosystem, it takes on a new meaning: a Resource Container (RC) represents a collection of physical resources. It is an abstraction used to bundle resources into distinct, allocatable units.