Arun Manglick - Big Data

This post is about - Ingest, Prepare, and Transform using Azure Databricks and Azure Data Factory.

ETL/ELT workflows (extract, transform/load and load/transform data) - Allows businesses to ingest data in various forms and shapes from different on-prem/cloud data sources; transform/shape the data and gain actionable insights into data to make important business decisions.

With the general availability of Azure Databricks comes support for doing ETL/ELT with Azure Data Factory.

This integration allows you to formulate ETL/ELT workflows using data factory pipelines that do the following:

Ingest data at large scale using 70+ on-prem/cloud data sources
Prepare and transform (clean, sort, merge, join, etc.) the ingested data in Azure Databricks as a Notebook activity step in data factory pipelines.
Monitor and manage your E2E workflow

What is Databricks:

Databricks is a company founded by the original creators of Apache Spark.
Databricks grew out of the AMPLab project at University of California, that was involved in making Apache Spark, an open-source distributed computing framework built atop Scala.
Databricks is an industry-leading, cloud-based data engineering tool used for processing and transforming massive quantities of data and exploring the data through machine learning models. Recently added to Azure, it's the latest big data tool for the Microsoft cloud

What is Azure Databricks:

Azure Databricks is an Apache Spark-based Analytics Platform optimized for the Microsoft Azure cloud services platform.

Azure Databricks is a fast, easy, and collaborative Apache Spark-based Analytics Service.

Design/Flow:

For a big data pipeline, the data (raw or structured) is ingested into Azure through Azure Data Factory in batches, or streamed near real-time using Kafka, Event Hub, or IoT Hub.
This data lands in a Data Lake for long term persisted storage, in Azure Blob Storage or Azure Data Lake Storage.
Prepare and Transform (clean, sort, merge, join, etc.) the ingested data in Azure Databricks as a Notebook activity step in data factory pipelines.
Now as part of your analytics workflow, use Azure Databricks to read data from multiple data sources such as Azure Blob Storage, Azure Data Lake Storage, Azure Cosmos DB, or Azure SQL Data Warehouse
Finally data is turned into breakthrough insights using Apache Spark.

Role of Apache Spark in Azure Databricks:

Azure Databricks builds on the capabilities of Apache Spark by providing a zero-management cloud platform that includes:

Fully managed Spark clusters
An interactive workspace for exploration and visualization
A platform for powering your favorite Spark-based applications

Azure Databricks comprises the complete open-source Apache Spark cluster technologies and capabilities. Spark in Azure Databricks includes the following components:

Hope this helps!!

Arun Manglick

Hadoop is a Framework that enables processing of large data sets which reside in the form of clusters. Being a framework, Hadoop is made up of several modules that are supported by a large ecosystem of technologies.

There are four major elements of Hadoop i.e. HDFS, MapReduce, YARN, and Hadoop Common

HDFS: Hadoop Distributed File System (Distributed Data Storage)
YARN: Yet Another Resource Negotiator
MapReduce: Programming based Data Processing
Rest

Spark: In-Memory Data Processing
PIG, HIVE: Query based Processing of Data Services
HBase: NoSQL Database
Mahout, Spark MLLib: Machine Learning Algorithm Libraries
Solar, Lucene: Searching and Indexing
Zookeeper: Managing Cluster, Synchronization Tool
Oozie: Job Scheduling - Workflow Scheduler System
Sqoop – Structured Data Importing and Exporting Utility,
Flume – Data Ingestion Tool for unstructured and semi-structured data in HDFS
Ambari – Tool for Managing and Securing Hadoop clusters, and lastly
Avro – RPC, and Data Serialization Framework.

What is HDFS:

HDFS is a distributed filesystem that runs on commodity hardware.

HDFS is specially designed for storing huge datasets in commodity hardware.

HDFS is the primary or major component of Hadoop ecosystem and is responsible for Storing Large Data Sets of structured or unstructured data across various nodes and thereby maintaining the metadata in the form of log files.

It’s a Java Based File System that provides scalable, fault tolerance, reliable and cost efficient data storage for Big data.

HDFS consists of two core components i.e. Name node & Data Node

Name Node is the prime node which contains metadata (data about data) requiring comparatively fewer resources than the data nodes that stores the actual data.
Data nodes are Commodity Hardware in the distributed environment, making Hadoop cost effective.

What is MapReduce:

Hadoop MapReduce provides data processing.

MapReduce is a software framework for easily creating applications that transforms big date sets (Stored in HDFS) into a manageable one.

MapReduce programs are parallel in nature, thus are very useful for analyzing large-scale data using multiple machines in the cluster. Thus, it improves the speed and reliability of cluster this parallel processing.

Hadoop Ecosystem component ‘MapReduce’ works by breaking the processing into two phases:

Map phase
Reduce phase

Map() performs Sorting & Filtering of data and thereby organizing them in the form of group. Map generates a key-value pair based result which is later on processed by the Reduce() method.

Reduce(), as the name suggests does the Summarization By Aggregating the mapped data.

In simple, Reduce() takes the output generated by Map() as input and combines those tuples into smaller set of tuples.

What is YARN: (Yet Another Resource Manager)

Provides the resource management. Is called as the Operating System of Hadoop as it is responsible for managing and monitoring workloads.

YARN is the one who helps to Manage Resources Across Clusters. In short, it performs Scheduling And Resource Allocation for the Hadoop System.

Consists of three major components i.e.

Resource Manager
Nodes Manager
Application Manager

Resource manager has the privilege of allocating resources for the applications in a system.
Node managers work on the allocation of resources such as CPU, memory, bandwidth per machine and later on acknowledges the resource manager.
Application manager works as an Interface Between Resource and Node Manager and performs negotiations as per the requirement of the two.

What is HIVE:

Is an open source Data Warehouse System for Analyzing and Querying Large Dataset stored in Hadoop files.
Hive use language called HiveQL (HQL), for Reading and Writing of large data sets.

Hive do three main functions: Data Summarization, Query and Analysis.

HiveQL automatically translates SQL-like queries into MapReduce jobs which will execute on Hadoop. Also, all the SQL datatypes are supported by Hive thus, making the query processing easier.

It is highly scalable as it allows Real-Time & Batch Processing both.

What is PIG:

Apache Pig is a High-Level Language Platform for Analyzing and Querying Huge Dataset that are stored in HDFS. Pig as a component of Hadoop Ecosystem uses PigLatin language. It is very similar to SQL. It loads the data, applies the required filters and dumps the data in the required format. For Programs execution, pig requires Java runtime environment.

Pig was basically developed by Yahoo which works on a pig Latin language, which is Query based language similar to SQL.

How does Pig work?

First, the load command loads the data.
At the backend, the compiler converts pig latin into the sequence of map-reduce jobs.
Over this data, we perform various functions like joining, sorting, grouping, filtering etc.
Now, you can dump the output on the screen or store it in an HDFS file.

What is Apache Spark:

Apache Spark unifies all kinds of Big Data processing under one umbrella.

Apache Spark is lightening fast. It gives good performance for both Batch and Stream Processing.

Does all process consumptive tasks like Batch Processing, Iterative Real-Time Processing, Graph Conversions, Visualization, Machine Learning etc, using built-in libraries.

It does this with the help of DAG scheduler, query optimizer, and physical execution engine.

Spark offers 80 high-level operators which makes it easy to build parallel applications. Has various libraries like MLlib for machine learning, GraphX for graph processing, SQL and Data frames, and Spark Streaming.

One can write Spark applications using SQL, R, Python, Scala, and Java. As such Scala is Native Language of Spark.

One can run Spark in standalone cluster mode on Hadoop, Mesos, or on Kubernetes.

Diff between Spark & Hadoop Map-Reduce –

Spark is best suited for Real-Time Data or Real-Time Processing whereas

Hadoop is best suited for Structured Data or Batch Processing; hence both are used in most of the companies interchangeably.

It uses In-Memory Data Processing/Calculations, thus makes Spark faster than Hadoop Map-Reduce.

What is Mahout

Mahout, allows Machine Learnability to a system or application. It provides various libraries or functionalities such as collaborative filtering, clustering, and classification which are nothing but concepts of Machine learning. It allows invoking algorithms as per our need with the help of its own libraries.

Apache HBase:

It’s a NoSQL database which supports all kinds of data and thus capable of handling anything of Hadoop Database. It provides capabilities of Google’s BigTable, thus able to work on Big Data sets effectively.

At times where we need to search or retrieve the occurrences of something small in a huge database, the request must be processed within a short quick span of time. At such times, HBase comes handy as it gives us a tolerant way of storing limited data.

Solr, Lucene:

These are the two services that perform the task of Searching and Indexing with the help of some java libraries, especially Lucene is based on Java which allows spell check mechanism, as well. However, Lucene is driven by Solr.

Solr is Highly Scalable, Reliable and Fault Tolerant.
It provides Distributed Indexing, Automated Failover and Recovery, Load-Balanced Query, Centralized Configuration and much more.
Solr provides Matching Capabilities like phrases, wildcards, grouping, joining and much more.
Solr takes advantage of Lucene’s near Real-Time indexing. It enables you to see your content when you want to see it.
You can query Solr using HTTP GET and receive the result in JSON, binary, CSV and XML.

Zookeeper:

There was a huge issue of management of coordination and synchronization among the resources or the components of Hadoop which resulted in inconsistency, often.

Zookeeper overcame all the problems by performing Synchronization, Inter-Component based Communication, Grouping, and Maintenance.

Oozie:

Oozie simply performs the task of a scheduler, thus scheduling jobs and binding them together as a single unit. There is two kinds of jobs .i.e Oozie Workflow and Oozie Coordinator jobs.

Oozie workflow is the jobs that need to be executed in a sequentially ordered manner whereas

Oozie Coordinator jobs are those that are triggered when some data or external stimulus is given to it.

Hadoop - Simple Use Case: