Introduction
This is my
first article related to Azure. This article is dedicated to my father Late
Subal Chandra Das, who always inspire me to do something new. I missed you lot
Baba.
This article
is related to the general architecture of Azure Data Lake. Hope it will be a
good foundation to start with Azure Data Lake. The article is a representation
of my understanding with Azure Data Lake.
In coming
days we are going to be more advanced with it. Hope it will be informative.
What is Data Lake
Before jump
into Azure Data Lake, we have to understand the concept behind Data Lake.
A data lake
is a storage repository that holds a vast amount of raw data in its native
format until it is needed. While a hierarchical data warehouse stores data in
files or folders, a data lake uses a flat architecture to store data. Each data
element in a lake is assigned a unique identifier and tagged with a set of
extended metadata tags. When a business question arises, the data lake can be
queried for relevant data, and that smaller set of data can then be analyzed to
help answer the question.
A data lake,
on the other hand, maintains data in their native formats and handles the three
Vs of big data (Volume, Velocity and Variety) while providing tools for
analysis, querying, and processing. Data Lake eliminates all the restrictions
of a typical data warehouse system by providing unlimited space, unrestricted
file size, schema on read, and various ways to access data (including
programming, SQL-like queries, and REST calls).
With the
emergence of Hadoop (including HDFS and YARN), the benefits of data lake –
previously available only to the most resource-rich companies like Google,
Yahoo, and Facebook – became a practical reality for just about anyone. Now,
organizations who had been generating and gathering data on a large scale but
had struggled to store and process them in a meaningful way, have more options.
Feature of Azure Data Lake
Azure Data
Lake is a new kind of data lake bock from Microsoft Azure. The features that it
offers are mentioned below.
• The ability to store and analyze data
of any kind and size.
• Multiple access methods including
U-SQL, Spark, Hive, HBase, and Storm.
• Built on YARN and HDFS.
• Dynamic scaling to match your
business priorities.
• Enterprise-grade security with Azure
Active Directory.
• Managed and supported with an
enterprise-grade SLA.
Parts of Azure Data Lake
Broadly the
Azure Data Lake is classified into three parts
Azure Data Lake Store
The Data Lake
store provides a single repository where organizations upload data of just
about infinite volume. The store is designed for high-performance processing
and analytics from HDFS applications and tools, including support for low
latency workloads. In the store, data can be shared for collaboration with
enterprise-grade security.
Azure Data Lake analytics
Data Lake
analytics is a distributed analytics service built on Apache YARN that
compliments the Data Lake store. The analytics service can handle jobs of any
scale instantly with on-demand processing power and a pay-as-you-go model
that’s very cost effective for short term or on-demand jobs. It includes a
scalable distributed runtime called U-SQL, a language that unifies the benefits
of SQL with the expressive power of user code.
Azure HDInsight
Azure
HDInsight is a full stack Hadoop Platform as a Service from Azure. Built on top
of Hortonworks Data Platform (HDP), it provides Apache Hadoop, Spark, HBase,
and Storm clusters.
References:
Hope you like
it.
Posted
by: MR. JOYDEEP DAS