BUSINESS IMPACT

Feb 11, 2019

Azure Data Lake Storage Gen2: 10 Things You Need to Know

Melissa Coates Posted by Melissa Coates

Updated on October 1, 2019

Azure Data Lake Storage (ADLS) Gen2 reached general availability on February 7, 2019, and has continued to evolve and mature since then. This post will help you understand its advantages and what you need to know to get started. If you would like to become more familiar with the concepts of a data lake, please also check out our eBook: Data Lakes in a Modern Data Architecture.

10 Things to Know about Azure Data Lake Storage Gen2

1. The data lake story in Azure is unified with the introduction of ADLS Gen2

Prior to the introduction of ADLS Gen2, when we wanted cloud storage in Azure for a data lake implementation, we needed to decide between Azure Data Lake Storage Gen1 (formerly known as Azure Data Lake Store) and Azure Storage (specifically blob storage). This involved weighing the business and technical requirements versus features available in order to make the decision on which service to use. While ADLS Gen1 offers important optimizations important for analytic workloads and more granular security (see section 3 for details), Azure Storage has built-in features like geo-redundancy, hot/cold/archive tiers, additional metadata, and broader regional availability which are very compelling. In the past, we either accepted some trade-offs or stored the data twice in certain situations.

The new ADLS Gen2 service is built upon Azure Storage as its foundation. When the hierarchical namespace (HNS) property is enabled (see section 2 for details), an otherwise standard, general purpose V2, storage account becomes ADLS Gen2. For this reason, you will not see ADLS Gen2 listed in Azure as its own service – since ADLS Gen1 is its own service, this shift has been confusing for many people. There are a couple of ways to verify if ADLS Gen2 is enabled for a storage account:

When viewing the Azure Storage account, if the file system service is displayed this indicates that ADLS Gen2 is supported:

Azure Data Lake Storage account

Or, when viewing the Azure Storage account configuration properties, if the hierarchical namespace (HNS) is enabled, this indicates that ADLS Gen2 is supported:

Azure Data Lake Configuration Properties

takeawayKey takeaway: When we need a data lake in Azure for an analytics project, we will no longer need to make a choice between multiple independent services. Azure Storage, with the hierarchical namespace enabled, is now the service of choice for building a data lake use Azure cloud storage.

---------------------------------------------------------------------------------------

2. ADLS Gen2 converges the worlds of object storage and hierarchical file storage

Fundamentally, ADLS Gen2 is seeking to take advantage of file system benefits without giving up the type of scalability and cost-effectiveness available with an object store:

Azure Data Lake File Storage
Note that full feature support for ADLS Gen2 is still evolving, as discussed in Section 4. The following diagram represents the longer-term vision:

Azure Data Lake Storage:

The dark blue shading represents new features introduced with ADLS Gen2.

The three new areas depicted above include:

(1) File System. There is a terminology difference with ADLS Gen2. The concept of a container (from blob storage) is referred to as a file system in ADLS Gen2.

(2) Hierarchical Namespace. The hierarchical namespace (HNS), coupled with the DFS endpoint, is what enables the performance and security improvements, which are discussed in Section 3.

(3) DFS Endpoint and File System Driver. ADLS Gen2 utilizes the ABFS driver, which is part of Apache Hadoop. For connectivity to ADLS Gen2, the ABFS driver utilizes the DFS endpoint to invoke performance and security optimizations.

  • ABFS = Azure Blob File System
  • DFS = Distributed File System 

Azure Data Lake Connectivity

Documentation for each:

takeawayKey takeaway: The longer-term vision (depicted in the image above), which includes full interoperability between the object store model and the file system model, will allow us to store the data once and access it multiple ways depending on the use case. This is referred to as multi-protocol access.

--------------------------------------------------------------------------------------- 

3. ADLS Gen2 has significant performance and security advantages for analytical workloads

Both the object store model (such as Azure blob storage) and the hierarchical file system model (ADLS Gen1 and Gen2) are compatible with HDFS (Hadoop Distributed File System). This is achieved with drivers that implement server-side HDFS semantics to translate into remote storage APIs, allowing ADLS Gen2 to behave very similarly to native HDFS. However, there are important distinctions between object storage and hierarchical file system storage in terms of performance and security.

With object storage, folders are virtual only. Although it appears like we can create folders in object storage, they are just mimicked within the URI string (or sometimes metadata is used as an alternative). Although that might initially seem trivial, it has the following implications:

(1) Query Performance. When sending a query that is only retrieving a subset of data, with a hierarchical file system like ADLS Gen2 it is possible to leverage partition scans for data pruning (predicate pushdown). This can improve query performance dramatically for compute engines that understand how to take advantage of partition scans.

Azure Data Lake Query Performance

(2) Data Load Performance. Sometimes it is necessary to rename files or relocate files from one directory to another.

With the object store driver, directory operations are not handled as efficiently. If the Temp directory shown in the below image held 10,000 files, relocating them to their permanent directory would involve 10,000 rename operations and 10,000 delete operations, resulting in 20,000 calls.

Conversely, with a file system like ADLS Gen2, when connecting through the DFS endpoint this is a metadata-only operation. This results in significantly improved performance for the data load, particularly at higher data volumes.

In addition to improving query performance, metadata-only operations are ultimately more cost-effective because less compute engine resources are required.

(3) Data Consistency via Atomic Operations. Continuing with the previous example of 10,000 files to be moved, the object store driver does not support atomic operations. If a failure occurred, the data could remain in an inconsistent state. Conversely, a file system like ADLS Gen2 does support atomic operations, via the DFS endpoint, which improves data consistency because the entire operation will succeed or fail as a unit.

(4) Granular Security at the Directory and File Level. The hierarchical file system of ADLS Gen2 (and Gen1) is POSIX-compliant. Access control lists (ACLs) can be defined at the directory and file level to define granular security, which offers much-needed flexibility for controlling data-level security. 

Granular Security at the Directory and File Level

See section 6 for additional details about managing security in ADLS Gen2.

takeawayKey takeaway: Enabling the hierarchical namespace for an Azure Storage account, along with usage of the ABFS driver for connectivity, is what facilitates file system optimizations which affect performance, data consistency, and security.

 --------------------------------------------------------------------------------------- 

4. Feature support in ADLS Gen2 is still evolving

Although ADLS Gen2 is designated as generally available, there are still quite a few planned features which are being introduced over time. As is the norm with technology vendors, Microsoft introduces features to market as quickly as possible then iterates to the point of maturity. The initial focus for ADLS Gen2 is supporting the modern data warehouse and advanced analytics scenarios.

takeawayKey takeaway: Multi-protocol data access (as depicted in the diagram in section 2) is a critical capability which is still evolving. When it arrives, that will provide significant flexibility to land the data using whichever endpoint is preferred (i.e., to support an unchanged or legacy application or service) and use the new endpoint for analytical processing to gain the performance advantages.

 --------------------------------------------------------------------------------------- 

5. ADLS Gen2 is the underlying storage for Power BI Dataflows

Power BI dataflows are a new capability targeted towards reusable, self-service data preparation. The output from queries prepared in the web-based Power Query Online are output to ADLS Gen2. The objective is that the queries and data preparation are handled once and is then consumed by numerous Power BI datasets.

Dataflows can be fully managed by Power BI, in which case the ADLS Gen2 account is present but only visible via the Power BI dataflows user interface. Alternatively, the ‘bring your own storage’ scenario (depicted below) is appropriate for organizations who wish to interact with the data in the data lake via additional tools and compute engines beyond Power BI:

Azure Data Lake ALS gen2

takeawayKey takeaway: The storage service behind Power BI dataflows is ADLS Gen2 and can be an important part of the self-service business intelligence strategy. 

  ---------------------------------------------------------------------------------------

6. There are two levels of security in ADLS Gen2

The two levels of security applicable to ADLS Gen2 were also in effect for ADLS Gen1. Even though this is not new, it is worth calling out the two levels of security because it’s a very fundamental piece to getting started with the data lake and it is confusing for many people just getting started.

Azure Data Lake Security

(1) Role-Based Access Control (RBAC). RBAC includes built-in Azure roles such as reader, contributor, owner or custom roles. Typically, RBAC is assigned for two reasons. One is to specify who can manage the service itself (i.e., update settings and properties for the storage account). Another reason is to permit use of the built-in data explorer tools, which require reader permissions.

(2) Access Control Lists (ACLs). Access control lists specify exactly which data objects a user may read, write, or execute (execute is required to browse the directory structure). ACLs are POSIX-compliant, thus familiar to those with a Unix or Linux background.

POSIX does not operate on a security inheritance model, which means that access ACLs are specified for every object. The concept of default ACLs is critical for new files within a directory to obtain the correct security settings, but it should not be thought of as inheritance. Because of the overhead assigning ACLs to every object, and because there is a limit of 32 ACLs for every object, it is extremely important to manage data-level security in ADLS Gen1 or Gen2 via Azure Active Directory groups.

Fortunately, both the ACLs for both directories and files are enforced regardless of which multi-protocol access point is used to access the data.  

takeawayKey takeaway: Via RBAC and ACLs, there is quite a bit of flexibility for defining security for ADLS Gen2.

---------------------------------------------------------------------------------------

7. Planning for ADLS Gen2 involves multiple levels

There are quite a few considerations when planning for a data lake, particularly if you have numerous data ingestion patterns, different data usage patterns, various types of users, and several tools/languages. Some organizations seek to implement one global data lake, while others utilize a multi-lake approach.

With the introduction of ADLS Gen2, there is one additional level to plan for that was not present previously in ADLS Gen1: the file system. A file system in ADLS Gen2 is the equivalent of a container in the blob service. The levels to be consider during planning include:

  • Account
  • File system(s) within an account
  • Directory structure within a file system

Azure Data Lake Storage Account

A few considerations:

  • Region and geo-replication are account-level properties. If there are multiple data residency requirements and/or different geo-replication requirements, that will need to be satisfied with multiple storage accounts. Alternatively, if you have specific compute engines (like HDInsight or Azure Databricks) which reside in a specific region, the best performance will be gained when the ADLS Gen2 account resides in the same region.
  • The hierarchical namespace is enabled at the account level. Should there be use cases which have no need for the benefits of the hierarchical namespace, that data should reside in a different storage account.
  • Immutable policies and shared access policies are set at the container level for blob storage (so we can expect them to apply at the file system level for an ADLS Gen2-enabled account). Should there be different policies required, that may justify separate file systems.
  • For ACLs, the root in ADLS Gen1 was at the account level, whereas the root in ADLS Gen2 is at the file system level.
  • Power BI dataflows, discussed in section 5, will require one or more file systems in its integration with the Common Data Model.

takeawayKey takeaway: There may be use cases, permissions boundaries, or cost considerations (see section 8) that cause you to consider segregating data beyond one data lake. The file system is a new level which has its own set of properties and should be accounted for when planning.

---------------------------------------------------------------------------------------

8. Pricing for ADLS Gen2 is almost as economical as object storage

Object storage, such as Azure blob storage, is known for being highly economical. With respect to the direct storage cost, Microsoft has released ADLS Gen2 at the same price as Azure blob storage (i.e., block blob pricing). You only pay for the storage that you use; there is not the concept of reserving a specific size.

However, the transaction costs are somewhat higher for storage accounts which have the hierarchical namespace enabled. Transaction costs are usually measured in batches of 10,000.

Please refer to the official documentation and the online pricing calculator for more complete pricing details. The FAQs section for ADLS Gen2 pricing has an excellent practical example which contrasts pricing for the flat namespace (i.e., block blob storage) and the hierarchical namespace (i.e., ADLS Gen2).


takeaway
Key takeaway: The transaction and metadata storage costs are higher when the hierarchical namespace is enabled for a storage account, while the storage costs are equivalent. Although the transaction costs are still exceedingly economical, workloads that will never take advantage of the hierarchical namespace (HNS) features should reside in a storage account that does not have the HNS enabled.

---------------------------------------------------------------------------------------

9. Azure Data Lake Analytics and U-SQL have an uncertain future

The initial Azure services supported by ADLS Gen2 via the ABFS driver include:

  • Azure Databricks
  • Azure HDInsight
  • Azure Data Factory
  • Azure SQL Data Warehouse (PolyBase)

Third party partner support is emerging as well.

Considering that U-SQL within Azure Data Lake Analytics (ADLA) is not one of the initial services to be supported by the optimized ABFS driver, that says something about where we should be placing our bets. Microsoft has not announced the future roadmap for ADLA, but we are observing that open source technologies such as Spark appeal to a wider customer base vs. proprietary tools and languages.

We would encourage any customers to be cautious in choosing to use ADLA on future projects.

takeawayKey takeaway: Currently there is not a serverless (pay per use) way to execute queries against ADLS Gen2. Azure Databricks and HDInsight are currently the preferred methods for direct querying capabilities.

---------------------------------------------------------------------------------------

10. ADLS Gen1 will be supported for quite some time

All signs indicate the ADLS Gen1 will not be deprecated anytime soon. If you have a large implementation on ADLS Gen1, there is no cause for immediate concern.

If you do wish to migrate from ADLS Gen1 to ADLS Gen 2, there are several upgrade strategies. Following are a few key considerations:

  • Migrating data via Azure Data Factory is currently the easiest way to do a one-time data migration, as there is not currently a migration tool available.
  • If you have any files in ADLS Gen1 larger than 5TB, they will need to be separated into multiple files before migration.
  • Any references which utilize the adl:// addressing scheme will need to be changed to utilize abfs[s]:// connectivity, the new REST APIs, and/or the new SDKs.

takeawayKey takeaway: Migration from ADLS Gen1 is not urgent whatsoever, but you should migrate if it is practical to do so. Brand new implementations should utilize ADLS Gen2 if there are no feature gaps.

---------------------------------------------------------------------------------------

If you're exploring the best Azure solutions for your firm's needs, BlueGranite would love to help. Contact us today for more information.

New call-to-action
Melissa Coates

About The Author

Melissa Coates

Melissa is a Principal Architect with BlueGranite. Her main focus is on client project delivery of data management and analytics solutions. Melissa is a Data Platform MVP and volunteers with the Charlotte BI Group in North Carolina. To learn more about BI, data warehousing, and data lake development, please also visit Melissa’s personal blog at www.sqlchick.com.

Latest Posts

New Call-to-action