SQL Clone
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 


Polybase Optimization Question


Polybase Optimization Question

Author
Message
xsevensinzx
xsevensinzx
SSChampion
SSChampion (13K reputation)SSChampion (13K reputation)SSChampion (13K reputation)SSChampion (13K reputation)SSChampion (13K reputation)SSChampion (13K reputation)SSChampion (13K reputation)SSChampion (13K reputation)

Group: General Forum Members
Points: 13326 Visits: 4698
What are some good approaches to optimizing large data loads of compressed CSV files in this data flow:

Data Source -> Blob Storage -> Polybase -> External Tables -> Azure Data Warehouse Tables

One of the biggest hurdles here is that you have loads of documents you're filtering through when querying external tables into Azure tables. This means, you're reading all documents versus with common ETL tools like SSIS, you can filter per file as it's loaded. For example, I can take a 2 GB document, filter data from that document into a staging table within 10 seconds using conditional splits. Similar could be found using BULK LOAD and T-SQL.

There is a number of approaches I'm thinking about. One is to try and filter the data ahead of time and parrellel load the data based on a partition key I define on disk. Another would be to have two blob storage accounts and or containers and use one for staging and the other for archiving. Only push over what I need from archive -> staging to load incrementally, then delete the files from staging once loaded.

The biggest hurdle is the fact most of this is extremely easy to solve on-premsis with SSIS or SP based ETL solutions incrementally. I'm still diving into the complexities of data factory, the API's and various other third-party tools, which a lot seem to be centered around on-premsis solutions pushing/filtering data to the cloud, which I'm trying to avoid.
Grant Fritchey
Grant Fritchey
SSC Guru
SSC Guru (272K reputation)SSC Guru (272K reputation)SSC Guru (272K reputation)SSC Guru (272K reputation)SSC Guru (272K reputation)SSC Guru (272K reputation)SSC Guru (272K reputation)SSC Guru (272K reputation)

Group: General Forum Members
Points: 272189 Visits: 33813
A couple of things you can do to enhance performance on the data load. You can break up the files into 60 chunks to coincide with the 60 processes that are going to do the data loading (although, I'm told that polybase internally does this now, our testing still shows that as superior). You can also try to avoid using compressed files (although that has implications on uploading since uncompressed files take longer). Since you're using polybase, you're already using the fastest method, and the one that will go parallel. Also, make sure you're loading the table in the optimal method, either round-robin or hash, out of the gate. If the table isn't used to join or it's for staging or temp, then round-robin is best. Also, if you don't have a good hash key that will evenly distribute the data, then I'd suggest round-robin. Also, you need to avoid small data loads because they'll go to the delta group first. Basically, you want the load to be in 102,400 row chunks per distribution, or 6.144 million rows because of the fact that there are 60 tables under the covers. The advice we've got from Microsoft is to do big loads because they will increase readers and writers with scale.

That's about all I've got from my notes.

----------------------------------------------------
The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood...
Theodore Roosevelt

The Scary DBA
Author of: SQL Server Query Performance Tuning and SQL Server Execution Plans
Product Evangelist for Red Gate Software
xsevensinzx
xsevensinzx
SSChampion
SSChampion (13K reputation)SSChampion (13K reputation)SSChampion (13K reputation)SSChampion (13K reputation)SSChampion (13K reputation)SSChampion (13K reputation)SSChampion (13K reputation)SSChampion (13K reputation)

Group: General Forum Members
Points: 13326 Visits: 4698
Thanks for the feedback.

I was able to talk with a Polybase team member today. We went over some of my optimization approaches thus far that seemed to have worked.

Chunks

The data is already in hourly chunks. We were running tests with 23 hours at a time and had some performance increases doing the load every 23 hours at a time before moving the data from one container to another so Polybase doesn't read all files in the directory each load. I can try doing 60 file chunks to see what happens.

Hash

We hashed on our customer ID, which seemed to work well. Prior, we had nothing and it defaulted to round robin based on the query plans I saw.

Model

The polybase rep we talked to said one optimization would to optimize the model. Seems like a nobrainer, but we padded the fields a great deal to just get the data into a internal table. That in the fact we got confused by the truncation errors on the normal model, when in fact, it's just that Polybase does not support headers (why!!??!). We have some very large nvarchar fields that are 3000+ in length (we don't control the data source) that can vary. We also have very wide files, with 40+ fields in the data.

The polybase rep said that no matter if you have 1 character or 4,000 characters, if the field has a length of 4,000 then polybase is going to treat every record as if it has 4,000 characters (as expected). So, unless you know for sure you need that much, optimize, optimize, optimize.

When I tested the results of ignoring the larger fields and copying only most of the metric based fields, we had significant improvements. The moment I included just one large length field, it decreased the performance 3x. So, once I optimized those larger ones and only selected the fields I needed as opposed to slamming everything over, the time to copy went from 20 minutes to 1:30 minutes per 23 hours at 400 DWU.

I'm going to start testing higher DWU's for loading only and try your suggestions. Hopefully i can start getting this faster per chunk.
xsevensinzx
xsevensinzx
SSChampion
SSChampion (13K reputation)SSChampion (13K reputation)SSChampion (13K reputation)SSChampion (13K reputation)SSChampion (13K reputation)SSChampion (13K reputation)SSChampion (13K reputation)SSChampion (13K reputation)

Group: General Forum Members
Points: 13326 Visits: 4698
Oh, one more thing because you prob know more about this.

I assume the SELECT INTO versus INSERT INTO is similar to SQL Server with Azure Data Warehouse when it comes to performances?
Grant Fritchey
Grant Fritchey
SSC Guru
SSC Guru (272K reputation)SSC Guru (272K reputation)SSC Guru (272K reputation)SSC Guru (272K reputation)SSC Guru (272K reputation)SSC Guru (272K reputation)SSC Guru (272K reputation)SSC Guru (272K reputation)

Group: General Forum Members
Points: 272189 Visits: 33813
xsevensinzx - Friday, February 24, 2017 9:50 PM
Oh, one more thing because you prob know more about this.

I assume the SELECT INTO versus INSERT INTO is similar to SQL Server with Azure Data Warehouse when it comes to performances?

I don't know. That's a great question.


----------------------------------------------------
The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood...
Theodore Roosevelt

The Scary DBA
Author of: SQL Server Query Performance Tuning and SQL Server Execution Plans
Product Evangelist for Red Gate Software
xsevensinzx
xsevensinzx
SSChampion
SSChampion (13K reputation)SSChampion (13K reputation)SSChampion (13K reputation)SSChampion (13K reputation)SSChampion (13K reputation)SSChampion (13K reputation)SSChampion (13K reputation)SSChampion (13K reputation)

Group: General Forum Members
Points: 13326 Visits: 4698
Yeah, the performance is much worse with insert into versus select into's. Now I have to debate to constantly create new tables for staging only or not.
Go


Permissions

You can't post new topics.
You can't post topic replies.
You can't post new polls.
You can't post replies to polls.
You can't edit your own topics.
You can't delete your own topics.
You can't edit other topics.
You can't delete other topics.
You can't edit your own posts.
You can't edit other posts.
You can't delete your own posts.
You can't delete other posts.
You can't post events.
You can't edit your own events.
You can't edit other events.
You can't delete your own events.
You can't delete other events.
You can't send private messages.
You can't send emails.
You can read topics.
You can't vote in polls.
You can't upload attachments.
You can download attachments.
You can't post HTML code.
You can't edit HTML code.
You can't post IFCode.
You can't post JavaScript.
You can post emoticons.
You can't post or upload images.

Select a forum








































































































































































SQLServerCentral


Search