Hadoop Eco-systems

Summary -

In this topic, we described about the Hadoop Eco System in detail.

Hadoop Ecosystem is neither a programming language nor a service. Hadoop Ecosystem is a platform or framework which solves big data problems. Hadoop is best known for map reduces and its distributed file system (HDFS, renamed from NDFS).

Note! NDFS is also used for projects that fall under the umbrella infrastructure for distributed computing and large-scale data processing.

HDFS -

HDFS abbreviated as Hadoop distributed file system and is the core component of Hadoop Ecosystem. HDFS is the primary storage system of Hadoop and distributes the data from across systems.

HDFS provides scalable, fault tolerance, reliable and cost-efficient data storage for Big data. HDFS makes it possible to store several types of large data sets (i.e. structured, unstructured and semi structured data).

HDFS is a distributed file system that runs on commodity hardware. HDFS helps in storing our data across various nodes and maintaining the log file about the stored data (metadata). HDFS by default configured for many installations.

Hadoop interact directly with HDFS by shell-like commands. HDFS has two core components, i.e. Name Node and Data Node. In HDFS, Name Node stores metadata and Data Node stores the actual data.

MapReduce -

MapReduce is the programming model for Hadoop. MapReduce is the core component of processing in a Hadoop Ecosystem as it provides the logic of processing. MapReduce is a software framework that helps in writing applications to processes large data sets.

MapReduce programs runs parallel algorithms in the distributed Hadoop environment. MapReduce improves the speed and reliability of cluster using parallel processing. MapReduce component has two phases: Map phase and Reduce phase.

Each phase has key-value pairs as input and output. In addition to the built-in, programmer can also specify two functions: map function and reduce function. Map function takes a set of data and converts it into tuples (key/value pairs).

Reduce function takes the output from the Map as an input and combines those data tuples based on the key and accordingly modifies the value of the key.

Hadoop streaming -

Hadoop Streaming utility used by developer when they are unable to code map reduce code in other languages. Hadoop Streaming is a generic API that allows writing Mappers and Reduces in any language like c, Perl, python, c++ etc. Mappers and Reducers receive their input and output on stdin and stdout as (key, value) pairs. Streaming is the best fit for text processing.

Hive -

Apache Hive is an open source system for querying and analyzing large datasets stored in Hadoop files. HIVE performs reading, writing and managing large data sets in a distributed environment using SQL-like interface. Hive use language called Hive Query Language (HQL) that is similar to SQL.

HiveQL automatically translates SQL-like queries into MapReduce jobs that execute on Hadoop. The Hive Command line interface is used to execute HQL commands. Hive is highly scalable because of large data set processing and real time processing.

HiveQL supports all primitive data types of SQL. Hive main parts are -

Megastore – It stores the metadata.
Driver – Manage the lifecycle of a HiveQL statement.
Query compiler – Compiles HiveQL into Directed Acyclic Graph(DAG).
Hive server – Provide a thrift interface and JDBC/ODBC server.

Pig -

Apache Pig is a high-level language platform for analyzing and querying large dataset stored in HDFS. Pig has two parts: Pig Latin and Pig Runtime. Pig Latin is the language and pig runtime is the execution environment.

Pig Latin language is very similar to SQL. It loads the data, applies the required filters and dumps the data in the required format. Pig requires Java runtime environment for programs execution.

Apache Pig features are Extensibility, Optimization opportunities and Handles all kinds of data. Pig has incredible price performance and high availability.

Sqoop -

Sqoop imports data from external sources into related Hadoop ecosystem components like HDFS, HBase or Hive. Sqoop also exports data from Hadoop to other external sources. Map Task is the sub task that imports part of data to the Hadoop Ecosystem.

When the Job submitted, it is mapped into Map Tasks that brings the chunk of data from HDFS. These chunks are exported to the structured data destination. Combining all those data chunks, the whole data received at destination.

Sqoop works with relational databases such as Teradata, Netezza, oracle, MySQL. Apache Sqoop features are direct to ORC files, efficient data analysis, fast data copying, importing sequential datasets from mainframe and Parallel data transfer. Sqoop provides bi-directional data transfer between Hadoop and relational data base.

Oozie -

Oozie is a workflow scheduler system for managing apache Hadoop jobs. Oozie combines multiple jobs sequentially into one logical unit of work (UOW). For Apache jobs, Oozie has been just like a scheduler.

Oozie framework is fully integrated with apache Hadoop stack, YARN and supports Hadoop jobs for apache MapReduce, Pig, Hive, and Sqoop. Oozie is scalable and can manage timely execution of workflows in a Hadoop cluster.

Oozie is very much flexible because one can easily start, stop, suspend and rerun jobs. Oozie provide if-then-else branching and control within Hadoop jobs.

There are two kinds of Oozie jobs -

Oozie Workflow – These are sequential set of actions to be executed. It is to store and run workflows composed of Hadoop jobs e.g., MapReduce, pig, Hive.
Oozie Coordinator – These are the Oozie jobs which are triggered when the data is made available to it. It runs workflow jobs based on predefined schedules and availability of data.

HBase -

HBase is an open source, scalable, distributed and non-relational distributed database, i.e. NoSQL database built on top of HDFS. HBase was designed to store structured data in tables that could have billions of rows and millions of columns.

HBase supports all types data including structured, non-structured and semi-structured. HBase provides real time access to read or write data in HDFS. The HBase was designed to run on top of HDFS to provide Bigtable like capabilities.

It is accessible through a Java API and has ODBC and JDBC drivers. There are two HBase Components namely - HBase Master and Region Server. HBase Master is not part of the actual data storage but negotiates load balancing across all Region Server.

Region Server is the worker node that handle read, write, update and delete requests from clients.

Flume -

Flume efficiently collecting, aggregating and moving large amount of data from its origin and sending it back to HDFS. Flume is distributed, reliable and available service and fault tolerant, reliable mechanism. Flume allows the data flow from the source into Hadoop environment.

Moving data from multiple servers can be done immediately into Hadoop by using Flume. Flume also helps to transfer online streaming data from various sources like network traffic, social media, email messages, log files etc. in HDFS. Flume is a real time loader for streaming data in to Hadoop.

Mahout -

Mahout is renowned for machine learning. Mahout is open source framework for creating scalable machine learning algorithm and data mining library. Machine learning algorithms allows to build self-learning machines that evolve by itself without being explicitly programmed.

Mahout performs collaborative filtering, clustering and classification. Mahout provides the data science tools to automatically find meaningful patterns in data stored in HDFS big data sets. Mahout used for predictive analytics and other advanced analysis.

Mahout algorithms are -

Classification - It learns from existing categorization and assigns unclassified items to the best category.
Clustering - It takes the item in particular class and organizes them into naturally occurring groups.
Collaborative filtering - It mines user behavior and makes product recommendations.
Frequent itemset missing - It analyzes which objects are likely to be appearing together.

Yarn -

YARN is abbreviated as Yet Another Resource Negotiator. Apache Yarn is a part or outside of Hadoop that can act as a standalone resource manager.

YARN is the framework responsible for providing the computational resources needed for application executions. Yarn consists of two important elements are: Resource Manager and Node Manager.

Resource Manager -

One resource manager can be assigned to one cluster per the master. Resource manager has the information where the slaves are located and how many resources they have. Resource manager runs several services.

The most important services is the Resource Scheduler that decides how to assign the resources. The Resource Manager does this with the Scheduler and Applications Manager.

Node Manager -

More than one Node Managers can be assigned to one Cluster. Node Manager is the slave of the infrastructure. Node Manager sends a heartbeat to the Resource Manager periodically.

Node Manager takes instructions from the Yarn scheduler to decide which node should run which task. The Node Manager reports CPU, memory, disk and network usage to the Resource Manager to decide where to direct new tasks.

HCatalog -

HCatalog is a Hadoop storage and table management layer. HCatalog enables different data processing tools like Pig, MapReduce for Users. Users can easily read and write data on the grid by using the tools enabled by HCatalog.

Users can directly load the tables using pig or MapReduce and no need to worry about re-defining the input schemas. HCatalog exposes the tabular data of HCatalog meta store to other Hadoop applications. Apache HCatalog is a project enabling non-HCatalog scripts to access HCatalog tables.

The users need not worry about where or in what format their data is stored. HCatalog table concept provides a relational view of data in the Hadoop Distributed File System (HDFS) to the users.

HCatalog can displays data from RCFile format, text files, or sequence files in a tabular view. HCatalog also provides APIs to access these tables metadata by external systems.

Spark -

Apache Spark is both a programming model and a computing model framework for real time data analytics in a distributed computing environment. It executes in-memory computations to increase speed of data processing over Map-Reduce which is a big reason for its popularity. Spark is an alternative to MapReduce that enables workloads to execute in memory instead of on disk.

By using in-memory computing, Spark workloads typically run between 10 and 100 times faster compared to disk execution. Spark can be used independently of Hadoop. Spark supports SQL that helps to overcome a short coming in core Hadoop technology. The Spark programming environment works with Scala, Python and R shells interactively.

Apache Drill -

Apache Drill processes large-scale data including structured and semi-structured data. Apache Drill is low latency distributed query engine designed to scale several thousands of nodes and query petabytes of data.

The drill has specialized memory management system to eliminates garbage collection and optimize memory allocation and usage. Apache Drill is used to drill into any kind of data. Drill is an open source application works well with Hive by allowing developers to reuse their existing Hive deployment.

The main power of Apache Drill lies in combining a variety of data stores just by using a single query. Apache Drill features are Extensibility, flexibility, drill decentralized metadata and dynamic schema discovery.

Ambari -

Ambari is a management platform for provisioning, managing, monitoring and securing apache Hadoop cluster. This includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, Zookeeper, Oozie, Pig, and Sqoop.

Ambari provide consistent, secure platform for operational control. Ambari features are Simplified installation, configuration and management, Centralized security setup, Highly extensible and customizable and Full visibility into cluster health.

Zookeeper -

Apache Zookeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization and group services. Zookeeper manages and coordinates with various services in a distributed environment.

It saves a lot of time by performing synchronization, configuration maintenance, grouping and naming. Zookeeper is fast with workloads where reading data are more common than writing data. Zookeeper maintains a record of all transactions.