Sunday, May 24, 2020

Hadoop Ecosystem - Simple Terms

Hadoop is a Framework that enables processing of large data sets which reside in the form of clusters. Being a framework, Hadoop is made up of several modules that are supported by a large ecosystem of technologies.

There are four major elements of Hadoop i.e. HDFS, MapReduce, YARN, and Hadoop Common
  1. HDFS: Hadoop Distributed File System (Distributed Data Storage)
  2. YARN: Yet Another Resource Negotiator
  3. MapReduce: Programming based Data Processing
  4. Rest
    1. Spark: In-Memory Data Processing
    2. PIG, HIVE: Query based Processing of Data Services 
    3. HBase: NoSQL Database
    4. Mahout, Spark MLLib: Machine Learning Algorithm Libraries
    5. Solar, Lucene: Searching and Indexing
    6. Zookeeper: Managing Cluster, Synchronization Tool
    7. Oozie: Job Scheduling - Workflow Scheduler System
    8. Sqoop – Structured Data Importing and Exporting Utility, 
    9. Flume – Data Ingestion Tool for unstructured and semi-structured data in HDFS
    10. Ambari – Tool for Managing and Securing Hadoop clusters, and lastly 
    11. Avro – RPC, and Data Serialization Framework.



What is HDFS:

HDFS is a distributed filesystem that runs on commodity hardware.
HDFS is specially designed for storing huge datasets in commodity hardware.


HDFS is the primary or major component of Hadoop ecosystem and is responsible for Storing Large Data Sets of structured or unstructured data across various nodes and thereby maintaining the metadata in the form of log files.

It’s a Java Based File System that provides scalable, fault tolerance, reliable and cost efficient data storage for Big data. 

HDFS consists of two core components i.e. Name node & Data Node
  • Name Node is the prime node which contains metadata (data about data) requiring comparatively fewer resources than the data nodes that stores the actual data. 
  • Data nodes are Commodity Hardware in the distributed environment, making Hadoop cost effective.

What is MapReduce:

Hadoop MapReduce provides data processing. 
MapReduce is a software framework for easily creating applications that transforms big date sets (Stored in HDFS) into a manageable one.

MapReduce programs are parallel in nature, thus are very useful for analyzing large-scale data using multiple machines in the cluster. Thus, it improves the speed and reliability of cluster this parallel processing.


Hadoop Ecosystem component ‘MapReduce’ works by breaking the processing into two phases:
  • Map phase
  • Reduce phase

Map() performs Sorting & Filtering of data and thereby organizing them in the form of group. Map generates a key-value pair based result which is later on processed by the Reduce() method.

Reduce(), as the name suggests does the Summarization By Aggregating the mapped data. 
In simple, Reduce() takes the output generated by Map() as input and combines those tuples into smaller set of tuples.



























What is YARN: (Yet Another Resource Manager)

Provides the resource management. Is called as the Operating System of Hadoop as it is responsible for managing and monitoring workloads. 
YARN is the one who helps to Manage Resources Across Clusters. In short, it performs Scheduling And Resource Allocation for the Hadoop System.























Consists of three major components i.e.
  • Resource Manager
  • Nodes Manager
  • Application Manager

Resource manager has the privilege of allocating resources for the applications in a system.
Node managers work on the allocation of resources such as CPU, memory, bandwidth per machine and later on acknowledges the resource manager.
Application manager works as an Interface Between Resource and Node Manager and performs negotiations as per the requirement of the two.




























What is HIVE:

Is an open source Data Warehouse System for Analyzing and Querying Large Dataset stored in Hadoop files.
Hive use language called HiveQL (HQL), for Reading and Writing of large data sets.

Hive do three main functions:  Data Summarization, Query and Analysis.

HiveQL automatically translates SQL-like queries into MapReduce jobs which will execute on Hadoop. Also, all the SQL datatypes are supported by Hive thus, making the query processing easier.

It is highly scalable as it allows Real-Time & Batch Processing both.































What is PIG:

Apache Pig is a High-Level Language Platform for Analyzing and Querying Huge Dataset that are stored in HDFS. Pig as a component of Hadoop Ecosystem uses PigLatin language. It is very similar to SQL. It loads the data, applies the required filters and dumps the data in the required format. For Programs execution, pig requires Java runtime environment.

Pig was basically developed by Yahoo which works on a pig Latin language, which is Query based language similar to SQL.

How does Pig work?
  1. First, the load command loads the data.
  2. At the backend, the compiler converts pig latin into the sequence of map-reduce jobs.
  3. Over this data, we perform various functions like joining, sorting, grouping, filtering etc.
  4. Now, you can dump the output on the screen or store it in an HDFS file.






























What is Apache Spark:

Apache Spark unifies all kinds of Big Data processing under one umbrella. 
Apache Spark is lightening fast. It gives good performance for both Batch and Stream Processing. 
Does all process consumptive tasks like Batch Processing, Iterative Real-Time Processing, Graph Conversions, Visualization, Machine Learning etc, using built-in libraries. 

It does this with the help of DAG scheduler, query optimizer, and physical execution engine.
Spark offers 80 high-level operators which makes it easy to build parallel applications.  Has various libraries like MLlib for machine learning, GraphX for graph processing, SQL and Data frames, and Spark Streaming. 





























One can write Spark applications using SQL, R, Python, Scala, and Java. As such Scala is Native Language of Spark
One can run Spark in standalone cluster mode on Hadoop, Mesos, or on Kubernetes. 

Diff between Spark & Hadoop Map-Reduce – 
Spark is best suited for Real-Time Data or Real-Time Processing whereas 
Hadoop is best suited for Structured Data or Batch Processing; hence both are used in most of the companies interchangeably. 

It uses In-Memory Data Processing/Calculations, thus makes Spark faster than Hadoop Map-Reduce.

What is Mahout

Mahout, allows Machine Learnability to a system or application. It provides various libraries or functionalities such as collaborative filtering, clustering, and classification which are nothing but concepts of Machine learning. It allows invoking algorithms as per our need with the help of its own libraries.

Apache HBase:

It’s a NoSQL database which supports all kinds of data and thus capable of handling anything of Hadoop Database. It provides capabilities of Google’s BigTable, thus able to work on Big Data sets effectively.

At times where we need to search or retrieve the occurrences of something small in a huge database, the request must be processed within a short quick span of time. At such times, HBase comes handy as it gives us a tolerant way of storing limited data.

Solr, Lucene:
These are the two services that perform the task of Searching and Indexing with the help of some java libraries, especially Lucene is based on Java which allows spell check mechanism, as well. However, Lucene is driven by Solr.

  • Solr is Highly Scalable, Reliable and Fault Tolerant.
  • It provides Distributed Indexing, Automated Failover and Recovery, Load-Balanced Query, Centralized Configuration and much more.
  • Solr provides Matching Capabilities like phrases, wildcards, grouping, joining and much more.
  • Solr takes advantage of Lucene’s near Real-Time indexing. It enables you to see your content when you want to see it.
  • You can query Solr using HTTP GET and receive the result in JSON, binary, CSV and XML.

Zookeeper: 
There was a huge issue of management of coordination and synchronization among the resources or the components of Hadoop which resulted in inconsistency, often. 
Zookeeper overcame all the problems by performing Synchronization, Inter-Component based Communication, Grouping, and Maintenance.

Oozie

Oozie simply performs the task of a scheduler, thus scheduling jobs and binding them together as a single unit. There is two kinds of jobs .i.e Oozie Workflow and Oozie Coordinator jobs. 

Oozie workflow is the jobs that need to be executed in a sequentially ordered manner whereas 
Oozie Coordinator jobs are those that are triggered when some data or external stimulus is given to it.

Hadoop - Simple Use Case:


























Hope this helps!!!!

Arun Manglick

No comments:

Post a Comment