Sunday, June 7, 2020

Azure Data Factory & Azure Databricks

This post is about - Ingest, Prepare, and Transform using Azure Databricks and Azure Data Factory.

ETL/ELT workflows (extract, transform/load and load/transform data) - Allows businesses to ingest data in various forms and shapes from different on-prem/cloud data sources; transform/shape the data and gain actionable insights into data to make important business decisions.

With the general availability of Azure Databricks comes support for doing ETL/ELT with Azure Data Factory.

This integration allows you to formulate ETL/ELT workflows using data factory pipelines that do the following:
  1. Ingest data at large scale using 70+ on-prem/cloud data sources
  2. Prepare and transform (clean, sort, merge, join, etc.) the ingested data in Azure Databricks as a Notebook activity step in data factory pipelines.
  3. Monitor and manage your E2E workflow

























What is Databricks: 

  • Databricks is a company founded by the original creators of Apache Spark. 
  • Databricks grew out of the AMPLab project at University of California, that was involved in making Apache Spark, an open-source distributed computing framework built atop Scala.
  • Databricks is an industry-leading, cloud-based data engineering tool used for processing and transforming massive quantities of data and exploring the data through machine learning models. Recently added to Azure, it's the latest big data tool for the Microsoft cloud

What is Azure Databricks: 

Azure Databricks is an Apache Spark-based Analytics Platform optimized for the Microsoft Azure cloud services platform.

























Azure Databricks is a fast, easy, and collaborative Apache Spark-based Analytics Service.

Design/Flow:
  1. For a big data pipeline, the data (raw or structured) is ingested into Azure through Azure Data Factory in batches, or streamed near real-time using Kafka, Event Hub, or IoT Hub.
  2. This data lands in a Data Lake for long term persisted storage, in Azure Blob Storage or Azure Data Lake Storage
  3. Prepare and Transform (clean, sort, merge, join, etc.) the ingested data in Azure Databricks as a Notebook activity step in data factory pipelines.
  4. Now as part of your analytics workflow, use Azure Databricks to read data from multiple data sources such as Azure Blob Storage, Azure Data Lake Storage, Azure Cosmos DB, or Azure SQL Data Warehouse
  5. Finally data is turned into breakthrough insights using Apache Spark.























Role of Apache Spark in Azure Databricks: 

Azure Databricks builds on the capabilities of Apache Spark by providing a zero-management cloud platform that includes:

  • Fully managed Spark clusters
  • An interactive workspace for exploration and visualization
  • A platform for powering your favorite Spark-based applications

Azure Databricks comprises the complete open-source Apache Spark cluster technologies and capabilities. Spark in Azure Databricks includes the following components:













Hope this helps!!

Arun Manglick