BUSINESS IMPACT

Feb 11, 2019

Azure Data Lake Storage Gen2: 10 Things You Need to Know

Melissa Coates Posted by Melissa Coates

Azure Data Lake Storage (ADLS) Gen2 reached general availability on February 7, 2019. This post will help you understand its advantages and what you need to know to get started. If you would like to become more familiar with the concepts of a data lake, please also check out our eBook: Data Lakes in a Modern Data Architecture.

10 Things to Know about Azure Data Lake Storage Gen2

1. The data lake story in Azure is unified with the introduction of ADLS Gen2

Prior to the introduction of ADLS Gen2, when we wanted cloud storage in Azure for a data lake implementation, we needed to decide between Azure Data Lake Storage Gen1 (formerly known as Azure Data Lake Store) and Azure Storage (specifically blob storage). This involved weighing the business and technical requirements versus features available in order to make the decision on which service to use. While ADLS Gen1 offers important optimizations important for analytic workloads and more granular security (see section 3 for details), Azure Storage has built-in features like geo-redundancy, hot/cold/archive tiers, additional metadata, and broader regional availability which are very compelling. In the past, we either accepted some trade-offs or stored the data twice in certain situations.

The new ADLS Gen2 service is built upon Azure Storage as its foundation. When the hierarchical namespace (HNS) property is enabled (see section 2 for details), an otherwise standard, general purpose V2, storage account becomes ADLS Gen2. For this reason, you will not see ADLS Gen2 listed in Azure as its own service – since ADLS Gen1 is its own service, this shift has been confusing for many people. There are a couple of ways to verify if ADLS Gen2 is enabled for a storage account:

When viewing the Azure Storage account, if the file system service is displayed this indicates that ADLS Gen2 is supported:

Azure Data Lake Storage account

Or, when viewing the Azure Storage account configuration properties, if the hierarchical namespace (HNS) is enabled, this indicates that ADLS Gen2 is supported:

Azure Data Lake Configuration Properties

takeawayKey takeaway: When we need a data lake in Azure for an analytics project, we will no longer need to make a choice between multiple independent services. Azure Storage, with the hierarchical namespace enabled, is now the service of choice for building a data lake use Azure cloud storage.

---------------------------------------------------------------------------------------

2. ADLS Gen2 converges the worlds of object storage and hierarchical file storage

Fundamentally, ADLS Gen2 is seeking to take advantage of file system benefits without giving up the type of scalability and cost-effectiveness available with an object store:

Azure Data Lake File Storage
Note that full feature support for ADLS Gen2 is still evolving, as discussed in Section 4. The following diagram represents the longer-term vision:

Azure Data Lake System

The dark blue shading represents new features introduced with ADLS Gen2.

The three new areas depicted above include:

(1) File System. There is a terminology difference with ADLS Gen2. The concept of a container (from blob storage) is referred to as a file system in ADLS Gen2.

(2) Hierarchical Namespace. The hierarchical namespace (HNS), coupled with the DFS endpoint, is what enables the performance and security improvements, which are discussed in Section 3.

(3) DFS Endpoint and File System Driver. ADLS Gen2 utilizes the ABFS driver, which is part of Apache Hadoop. For connectivity to ADLS Gen2, the ABFS driver utilizes the DFS endpoint to invoke performance and security optimizations.

  • ABFS = Azure Blob File System
  • DFS = Distributed File System 

Azure Data Lake Connectivity

At the time of this writing, only the ABFS driver is supported within storage accounts enabled for ADLS Gen2 (i.e., storage accounts with the hierarchical namespace enabled). Full interoperability between the object store model and the file system model are on the roadmap. The evolution of feature support is discussed further in Section 4.

takeawayKey takeaway: The longer-term vision (depicted in the image above), which includes full interoperability between the object store model and the file system model, will allow us to store the data once and access it multiple ways depending on the use case.

--------------------------------------------------------------------------------------- 

3. ADLS Gen2 has significant performance and security advantages for analytical workloads

Both the object store model (such as Azure blob storage) and the hierarchical file system model (ADLS Gen1 and Gen2) are compatible with HDFS (Hadoop Distributed File System). This is achieved with drivers that implement server-side HDFS semantics to translate into remote storage APIs, allowing ADLS Gen2 to behave very similarly to native HDFS. However, there are important distinctions between object storage and hierarchical file system storage in terms of performance and security.

With object storage, folders are virtual only. Although it appears like we can create folders in object storage, they are just mimicked within the URI string (or sometimes metadata is used as an alternative). Although that might initially seem trivial, it has the following implications:

(1) Query Performance. When sending a query that is only retrieving a subset of data, with a hierarchical file system like ADLS Gen2 it is possible to leverage partition scans for data pruning (predicate pushdown). This can improve query performance dramatically for compute engines that understand how to take advantage of partition scans.

Azure Data Lake Query Performance

(2) Data Load Performance. Sometimes it is necessary to rename files or relocate files from one directory to another.

With the object store driver, directory operations are not handled as efficiently. If the Temp directory shown in the below image held 10,000 files, relocating them to their permanent directory would involve 10,000 rename operations and 10,000 delete operations, resulting in 20,000 calls.

Conversely, with a file system like ADLS Gen2, when connecting through the DFS endpoint this is a metadata-only operation. This results in significantly improved performance for the data load, particularly at higher data volumes.

Azure Data Lake Data Load Performance

In addition to improving query performance, metadata-only operations are ultimately more cost-effective because less compute engine resources are required.

(3) Data Consistency via Atomic Operations. Continuing with the previous example of 10,000 files to be moved, the object store driver does not support atomic operations. If a failure occurred, the data could remain in an inconsistent state. Conversely, a file system like ADLS Gen2 does support atomic operations, via the DFS endpoint, which improves data consistency because the entire operation will succeed or fail as a unit.

(4) Granular Security at the Directory and File Level. The hierarchical file system of ADLS Gen2 (and Gen1) is POSIX-compliant. Access control lists (ACLs) can be defined at the directory and file level to define granular security, which offers much-needed flexibility for controlling data-level security. 

Granular Security at the Directory and File Level

See section 6 for additional details about managing security in ADLS Gen2.

takeawayKey takeaway: Enabling the hierarchical namespace for an Azure Storage account, along with usage of the ABFS driver for connectivity, is what facilitates file system optimizations which affect performance, data consistency, and security.

 --------------------------------------------------------------------------------------- 

4. Feature support in ADLS Gen2 is still evolving

Although ADLS Gen2 is designated as generally available, there are still quite a few planned features which have not been introduced yet to the service. As is the norm with technology vendors, Microsoft introduces features to market as quickly as possible then iterates to the point of maturity. The initial focus for ADLS Gen2 is supporting the modern data warehouse and advanced analytics scenarios.

Some of the features and functionality not yet released at the time of this writing (mid-February 2019) include:

  • Full interoperability between the Blob Storage API (wasb[s]://) and the new ADLS Gen2 API (abfs[s]://) on an Azure Storage account that has the hierarchical namespace enabled (we cannot yet use both endpoints interchangeably, which will be helpful for backwards compatibility)
  • Usage of the Blob Storage API (wasb[s]://) on an Azure Storage account that has the hierarchical namespace enabled (the Blob Storage API has been disabled)
  • Web-based data explorer (workaround: utilize Azure Storage Explorer 1.6.0 or newer)
  • Many of the built-in Azure Storage features such as snapshots, soft delete, storage tiers (such as hot/cold/archive), lifecycle management, and immutable properties
  • Full PowerShell support for data management operations (i.e., for the data plane)
  • SDKs and misc. APIs (.NET SDK, Python, CLI, etc.)
  • Direct connectivity from Power BI or Azure Analysis Services (workaround: Power BI Dataflows)
  • Full support for logging, auditing, and file system metrics, including Azure Monitor support
  • Integration with Azure Data Lake Analytics (U-SQL)
  • Integration with Azure Data Catalog
  • Destination support from other Azure services such as Azure Stream Analytics, Azure Event Hubs Capture
  • Support from various partners and third parties

Please verify current feature support utilizing these two resources:

takeawayKey takeaway: Full interoperability (as depicted in the diagram in section 2) is a critical capability which is still evolving. When it arrives, that will provide significant flexibility to land the data using whichever endpoint is preferred (i.e., to support an unchanged or legacy application or service) and use the new endpoint for analytical processing to gain the performance advantages.

 --------------------------------------------------------------------------------------- 

5. ADLS Gen2 is the underlying storage for Power BI Dataflows

Power BI Dataflows are a new capability targeted towards reusable, self-service data preparation. The output from queries prepared in the web-based Power Query Online are output to ADLS Gen2. The objective is that the queries and data preparation are handled once and is then consumed by numerous Power BI datasets.

Dataflows can be fully managed by Power BI, in which case the ADLS Gen2 account is present but only visible via the Power BI Dataflows user interface. Alternatively, the ‘bring your own storage’ scenario (depicted below) is appropriate for organizations who wish to interact with the data in the data lake via additional tools and compute engines beyond Power BI:

Azure Data Lake ALS gen2

takeawayKey takeaway: The storage service behind Power BI Dataflows is ADLS Gen2 and can be an important part of the self-service business intelligence strategy. 

  ---------------------------------------------------------------------------------------

6. There are two levels of security in ADLS Gen2

The two levels of security applicable to ADLS Gen2 were also in effect for ADLS Gen1. Even though this is not new, it is worth calling out the two levels of security because it’s a very fundamental piece to getting started with the data lake and it is confusing for many people just getting started.

Azure Data Lake Security

(1) Role-Based Access Control (RBAC). RBAC includes built-in Azure roles such as reader, contributor, owner or custom roles. Typically, RBAC is assigned for two reasons. One is to specify who can manage the service itself (i.e., update settings and properties for the storage account). Another reason is to permit use of the built-in data explorer tools, which require reader permissions.

(2) Access Control Lists (ACLs). Access control lists specify exactly which data objects a user may read, write, or execute (execute is required to browse the directory structure). ACLs are POSIX-compliant, thus familiar to those with a Unix or Linux background.

POSIX does not operate on a security inheritance model, which means that access ACLs are specified for every object. The concept of default ACLs is critical for new files within a directory to obtain the correct security settings, but it should not be thought of as inheritance. Because of the overhead assigning ACLs to every object, and because there is a limit of 32 ACLs for every object, it is extremely important to manage data-level security in ADLS Gen1 or Gen2 via Azure Active Directory groups.

takeawayKey takeaway: Via RBAC and ACLs, there is quite a bit of flexibility for defining security for ADLS Gen2.

---------------------------------------------------------------------------------------

7. Planning for ADLS Gen2 involves multiple levels

There are quite a few considerations when planning for a data lake, particularly if you have numerous data ingestion patterns, different data usage patterns, various types of users, and several tools/languages. Some organizations seek to implement one global data lake, while others utilize a multi-lake approach.

With the introduction of ADLS Gen2, there is one additional level to plan for that was not present previously in ADLS Gen1: the file system. A file system in ADLS Gen2 is the equivalent of a container in the blob service. The levels to be consider during planning include:

  • Account
  • File system(s) within an account
  • Directory structure within a file system

Azure Data Lake Storage Account

A few considerations:

  • Region and geo-replication are account-level properties. If there are multiple data residency requirements and/or different geo-replication requirements, that will need to be satisfied with multiple storage accounts. Alternatively, if you have specific compute engines (like HDInsight or Azure Databricks) which reside in a specific region, the best performance will be gained when the ADLS Gen2 account resides in the same region.
  • The hierarchical namespace is enabled at the account level. Should there be use cases which have no need for the benefits of the hierarchical namespace, that data should reside in a different storage account.
  • Immutable policies and shared access policies are set at the container level for blob storage (so we can expect them to apply at the file system level for an ADLS Gen2-enabled account). Should there be different policies required, that may justify separate file systems.
  • For ACLs, the root in ADLS Gen1 was at the account level, whereas the root in ADLS Gen2 is at the file system level.
  • Power BI Dataflows, discussed in section 5, will require one or more file systems in its integration with the Common Data Model.

takeawayKey takeaway: There may be use cases, permissions boundaries, or cost considerations (see section 8) that cause you to consider segregating data beyond one data lake. The file system is a new level which has its own set of properties and should be accounted for when planning.

---------------------------------------------------------------------------------------

8. Pricing for ADLS Gen2 is almost as economical as object storage

Object storage, such as Azure blob storage, is known for being highly economical. Microsoft is releasing ADLS Gen2 at the same storage price as Azure blob storage (i.e., block blob pricing). Following is a simple storage cost pricing example: 

Data Size Redundancy Cost
1 terabyte
(1,000 GB)
Locally redundant storage
(LRS)
~$18 / month (USD)
Equates to $.0184 per GB per month
1 terabyte
(1,000 GB)
Read access geo-redundant storage 
(RA-GRS)
~$46 / month (USD)
Equates to $.046 per GB per month


You only pay for the storage that you use; there is not the concept of reserving a specific size.

The transaction costs (measured in batches of 10,000 operations and in 4MB sizes) are indeed higher for accounts which have the hierarchical namespace enabled. Following is a highly simplified transaction cost pricing example:

Transaction Type Tier Blob Storage Cost ADLS Gen2 Cost
Write operations
(per 10,000)
Hot $.05 $.13
Read operations
(per 10,000)
Hot $.0004 $.0052
Metadata storage
(per GB)
Hot N/A $.0658 per month


Please refer to the official documentation for more complete pricing details. The FAQs section for ADLS Gen2 pricing has an excellent practical example which contrasts pricing for the flat namespace (i.e., block blob storage) and the hierarchical namespace (i.e., ADLS Gen2).

takeawayKey takeaway: The transaction and metadata storage costs are higher when the hierarchical namespace is enabled for a storage account, while the storage costs are equivalent. Although the transaction costs are still exceedingly economical, workloads that will never take advantage of the hierarchical namespace (HNS) features should reside in a storage account that does not have the HNS enabled.

---------------------------------------------------------------------------------------

9. Azure Data Lake Analytics and U-SQL have an uncertain future

The initial Azure services supported by ADLS Gen2 via the ABFS driver include:

  • Azure Databricks
  • Azure HDInsight
  • Azure Data Factory
  • Azure SQL Data Warehouse (PolyBase)

Third party partner support is emerging as well.

Considering that U-SQL within Azure Data Lake Analytics (ADLA) is not one of the initial services to be supported by ADLS Gen2, that says something about where we should be placing our bets. Microsoft has not announced the future roadmap for ADLA, but we are observing that open source technologies such as Spark appeal to a wider customer base vs. proprietary tools and languages.

I suspect that U-SQL will become supported on ADLS Gen2 once interoperability with the blob API is introduced. However, since first class connectivity from ADLA is not supported with the new ABFS driver in ADLS Gen2 (thus not receiving any of the additional performance benefits as discussed in section 3), we would encourage any customers to be cautious in choosing to use ADLA on future projects.

takeawayKey takeaway: Currently there is not a serverless (pay per use) way to execute queries against ADLS Gen2. Azure Databricks and HDInsight are currently the preferred methods for direct querying capabilities.

---------------------------------------------------------------------------------------

10. ADLS Gen1 will be supported for quite some time

All signs indicate the ADLS Gen1 will not be deprecated anytime soon. If you have a large implementation on ADLS Gen1, there is no cause for concern.

If you do wish to migrate from ADLS Gen1 to ADLS Gen 2, there are several upgrade strategies. Following are a few key considerations:

  • Migrating data via Azure Data Factory is currently the easiest way to do a one-time data migration, as there is not currently a migration tool available.
  • Defer migration plans if you use PowerShell modules to manage your data lake, or you utilize services that are not yet supported with ADLS Gen2 (for instance, landing data from Azure Stream Analytics).
  • If you have any files in ADLS Gen1 larger than 5TB, they will need to be separated into multiple files before migration.
  • All tool connectivity will need to change from the adl:// addressing scheme to utilize abfs[s]:// connectivity, the new REST APIs, and/or the new SDKs.

takeawayKey takeaway: Migration from ADLS Gen1 is not urgent whatsoever. Brand new implementations should utilize ADLS Gen2 if there are no feature gaps.

---------------------------------------------------------------------------------------

If you're exploring the best Azure solutions for your firm's needs, BlueGranite would love to help. Contact us today for more information.

New call-to-action
Melissa Coates

About The Author

Melissa Coates

Melissa is a Principal Architect with BlueGranite. Her main focus is on client project delivery of data management and analytics solutions. Melissa is a Data Platform MVP and volunteers with the Charlotte BI Group in North Carolina. To learn more about BI, data warehousing, and data lake development, please also visit Melissa’s personal blog at www.sqlchick.com.

Latest Posts

New Call-to-action