As a followup to my blog post Azure Data Lake Store Gen2 is GA, I wanted to give some pointers when using ADLS Gen2 as well as blob storage, as it can get a bit confusing with all the options that are available.
The major features that are missing from ADLS Gen2 are premium tier, soft delete, page blobs, append blobs, and snapshots. The major features that are in preview are archive tier, lifecycle management, and diagnostic logs. Check out all the missing features at Known issues with Azure Data Lake Storage Gen2.
Note that underneath the covers, ADLS Gen2 uses Azure Blob Storage and is simply a layer over blob storage providing additional features (i.e. hierarchical file system, better performance, enhanced security, Hadoop compatible access).
Multi-protocol access (MPA) is the big feature that was made available this past November, which allows both the Azure Blob Storage API and Azure Data Lake Storage Gen2 API to access data on a single storage account.
The tips in this blog are focused on interacting with storage via three manually-focused options: the Containers blade in the Azure portal, Storage Explorer (preview) in the portal, and the desktop Azure Storage Explorer.
My top tips:
- For blob storage, you organize a set of files/blobs under a container. In the Azure portal this is located in “Containers” under “Blob service”. It is called “Blob Containers” in both the portal and desktop Storage Explorer. For ADLS Gen2, you also use containers, located in the portal in “Containers” under “Data Lake Storage”. It is called “Blob Containers” in the desktop Storage Explorer but it is called “File Systems” in the portal Storage Explorer
- In the Azure portal, for blob storage, you can upload/access files by going to the storage account and choosing Containers (under “Blob service”) or by using the Storage Explorer (preview) in the portal. For ADLS Gen2, Containers (under “Data Lake Storage”) has no functionality except to create a container (click “File system”), but you can use the portal Storage Explorer. However, that is limited (you can’t upload files or change access tiers), so you should use the desktop Azure Storage Explorer to upload files or change access tiers
- There are two types of storage performance tiers: Premium and Standard. The Premium performance tier can’t be changed to the Standard performance tier and visa-versa, so this is locked in when you create the storage account
- The Premium performance tier is not yet available for ADLS Gen2, and only supports locally redundant storage (LRS), and does not support page blogs (only block and append)
- There are three types of storage access tiers: Hot, Cool, and Archive. You can change access tiers with the Standard performance tier, but not with the Premium performance tier. Only block blobs support access tiers
- When creating a storage account, you will be asked for the Account kind and you should use the default of General purpose v2 (StorageV2) unless you want to create block blobs or append blobs with the premium performance tier in which case you should choose Block Blob (BlockBlobStorage). Note that BlockBlobStorage accounts don’t currently support tiering to hot, cool, or archive access tiers
- A storage access tier can be set for each file, but if it is not set it will default to the access tier (Hot or Cool) that the storage account is set to. The account access tier is the default tier that is inferred by any file without an explicitly set tier. The Archive access tier can only be set at the file level and not on the account
- The only premium tier available with a General purpose v2 storage account is the premium tier for page blobs (e.g., unmanaged disks)
- For blob storage, you can specify the access tier for a file (hot, cool, or archive) when uploading via the portal, but not when using the desktop Storage Explorer. For ADLS Gen2, there is not a way to upload files via the portal and you also can’t specify the access tier when uploading via the desktop Storage Explorer
- For blob storage, to change an access tier, in the Azure portal, under the storage account, go to the container and choose the file and click “Change tier” to change its access tier. Or go to the portal or go to desktop Storage Explorer and right-click the file and choose “Change Access Tier”. For ADLS Gen2, you must use desktop Storage Explorer to change the access tier
- For blob storage, you can specify that the blob type of a file is block, page, or append when uploading the file via the portal or with desktop Storage Explorer. Once a file has been created, its blob type cannot be changed. ADLS Gen2 only supports block blob type
- Data in the Archive tier blob cannot be read until it is rehydrated to the Cool or Hot tier. The “standard” rehydration process can take up to 15 hours to complete. There is a priority retrieval (called “high”) that takes less than an hour (see Azure Archive Storage expanded capabilities: faster, simpler, better). You specify the rehydrate priority (standard or high) when choosing to switch from the archive tier on the portal. The option to choose the high priority retrieval is not available in the desktop Storage Explorer and is not available anywhere for ADLS Gen2
Note this blog is focusing on doing things manually, as opposed to using tools such as the Blob REST API, ADLS Gen2 REST API, Azure PowerShell, Azure CLI, storage client libraries such as Python (ADLS) or Python (Blob), or azcopy, which may support more features. But not all tools support all options. For example, when copying a file to storage, you can specify the storage access tier of each file when using the Blob REST API (via the x-ms-access-tier on the Copy Blob operation) or when using azcopy cp (via the “block-blob-tier” option), but when using the Copy Data activity in Azure Data Factory you can’t specify the access tier, so it defaults to the access tier of the account. See Connecting to Azure Data Lake Storage Gen2 from PowerShell using REST API – a step-by-step guide
The documentation I found the most helpful for learning about the features of ADLS Gen2 are:
- Introduction to Azure Storage
- Storage account overview (helpful table)
- Azure Blob storage: hot, cool, and archive access tiers (helpful table)
- Understanding block blobs, append blobs, and page blobs
- Performance tiers for block blob storage
- Rehydrate blob data from the archive tier
- Manage the Azure Blob storage lifecycle
- Azure Storage redundancy (helpful table)
An excellent lab to learn the features of ADLS Gen2 can be found at Large-Scale Data Processing with Azure Data Lake Storage Gen2.