I was recently fortunate enough to join the Azure Advisors group for SQL vNext. I was particularly looking forward to using PolyBase and Native R.
I have been fortunate enough to be employed by a company that invested in an APS (Analytics Platform System) which uses SQL 2012 Parallel Data Warehouse. Pretty sure everyone knows that this is not an edition of SQL Server that you can buy over the counter and ships exclusively with the APS.
Now anyone who has worked on an APS (not to be mistaken with its predecessor the PDW ... there is a subtle difference) will know that it ships with built in HDInsights functionality and, more specifically, PolyBase.
I am writing this article because I found it strange that all the use cases for PolyBase on the Azure Advisors group revolved around using PolyBase and vanilla T-SQL to bring your structured and unstructured data sets together and yes, this is what the product is being sold as.
I have been using PolyBase with Azure on our APS for about 12 months now, and I think the big use case that everyone is overlooking here is the ability it would have to replace most SME backup and disaster recovery needs, without having to learn a new language or set of skills.
Something to consider when reading our use case for PolyBase is the costing of an APS, it is a substantial investment for any company and prior to the advent of Azure Data Warehouse, the recommended approach to Disaster Recovery was to purchase two appliances. An impractical scenario. I will post another article discussing our testing on the Azure DW private preview in the near future, it makes the acquisition of an APS that much more appealing.
Use Case for Polybase
Some contextual background of our infrastructure. We have two remote offices situated approximately 1800km apart in South Africa, which is not a country known for its blazing internet connectivity. Both sites have their own appliance and in any given month we transfer about 50TB of data between the two sites.
Configuration of Appliance :
- Site 1 - 2/3 DELL Appliance, No Hadoop Region
- Site 2 - 1/3 DELL Appliance, No Hadoop Region
Most of this data was moving between the two sites using a dedicated fibre link, and in the rare case that we had a looming deadline, it involved a hard drive and someone on a plane. Neither of the sites had a dedicated DR site, and both production environments relied on each other for DR. This was obviously a less than ideal configuration, so we started investigating alternatives, Enter PolyBase and Azure.
With the inclusion of an ExpressRoute connection at 5GB/s to Azure, we are now able to ship nearly 50TB of data a month to Azure Blob Storage using simple T-SQL (CREATE EXTERNAL TABLE AS SELECT) and compress the data using a gzip algorithmn, all in a single statement. The second site is then able to re-import that data (given they have the FORMAT) using a simple CREATE TABLE AS SELECT from the "external" table, which is in fact just a bunch of flat files lying on Azure Blob Storage. The beauty is PolyBase is able to generate the Metadata to create a tablesque representation of the data.
Esentially what this has allowed us to do is to store all our data in Azure Blob Storage and recreate it in the APS with in a matter of minutes. We are able to move approximately 100GB of data from the on premise appliance to Azure in 10 minutes, and re-import it, on another appliance.
The speed we get over the express routes, the relatively low cost of Azure Storage, and the ease of accessing this data from any sight using essentially standard T-SQL makes this a no brainer for us. It is all thanks to PolyBase.
Added to this, with the advent of Azure Data Warehouses we can now create ADW in minutes and reload all our data from the exisiting blob storage using the exact same T-SQL we use to transfer the data for the On-Prem solution.
It really has been a fantastic journey and I think the future of on-prem, cloud, structured and unstrucutred is extremely exciting and will make the APS. It should be all the more exciting now that PolyBase will be coming to SQL vNext, which means we could see a much greater level of integration, automation and data movement through Azure between all our data stores.
I will try and post some more in depth technical articles on how we automated this process through Powershell, if anyone is interested.