Polybase Optimization Question

  • What are some good approaches to optimizing large data loads of compressed CSV files in this data flow:

    Data Source -> Blob Storage -> Polybase -> External Tables -> Azure Data Warehouse Tables

    One of the biggest hurdles here is that you have loads of documents you're filtering through when querying external tables into Azure tables. This means, you're reading all documents versus with common ETL tools like SSIS, you can filter per file as it's loaded. For example, I can take a 2 GB document, filter data from that document into a staging table within 10 seconds using conditional splits. Similar could be found using BULK LOAD and T-SQL.

    There is a number of approaches I'm thinking about. One is to try and filter the data ahead of time and parrellel load the data based on a partition key I define on disk. Another would be to have two blob storage accounts and or containers and use one for staging and the other for archiving. Only push over what I need from archive -> staging to load incrementally, then delete the files from staging once loaded.

    The biggest hurdle is the fact most of this is extremely easy to solve on-premsis with SSIS or SP based ETL solutions incrementally. I'm still diving into the complexities of data factory, the API's and various other third-party tools, which a lot seem to be centered around on-premsis solutions pushing/filtering data to the cloud, which I'm trying to avoid.

  • A couple of things you can do to enhance performance on the data load. You can break up the files into 60 chunks to coincide with the 60 processes that are going to do the data loading (although, I'm told that polybase internally does this now, our testing still shows that as superior). You can also try to avoid using compressed files (although that has implications on uploading since uncompressed files take longer). Since you're using polybase, you're already using the fastest method, and the one that will go parallel. Also, make sure you're loading the table in the optimal method, either round-robin or hash, out of the gate. If the table isn't used to join or it's for staging or temp, then round-robin is best. Also, if you don't have a good hash key that will evenly distribute the data, then I'd suggest round-robin. Also, you need to avoid small data loads because they'll go to the delta group first. Basically, you want the load to be in 102,400 row chunks per distribution, or 6.144 million rows because of the fact that there are 60 tables under the covers. The advice we've got from Microsoft is to do big loads because they will increase readers and writers with scale.

    That's about all I've got from my notes.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Thanks for the feedback.

    I was able to talk with a Polybase team member today. We went over some of my optimization approaches thus far that seemed to have worked.

    Chunks

    The data is already in hourly chunks. We were running tests with 23 hours at a time and had some performance increases doing the load every 23 hours at a time before moving the data from one container to another so Polybase doesn't read all files in the directory each load. I can try doing 60 file chunks to see what happens.

    Hash

    We hashed on our customer ID, which seemed to work well. Prior, we had nothing and it defaulted to round robin based on the query plans I saw.

    Model

    The polybase rep we talked to said one optimization would to optimize the model. Seems like a nobrainer, but we padded the fields a great deal to just get the data into a internal table. That in the fact we got confused by the truncation errors on the normal model, when in fact, it's just that Polybase does not support headers (why!!??!). We have some very large nvarchar fields that are 3000+ in length (we don't control the data source) that can vary. We also have very wide files, with 40+ fields in the data.

    The polybase rep said that no matter if you have 1 character or 4,000 characters, if the field has a length of 4,000 then polybase is going to treat every record as if it has 4,000 characters (as expected). So, unless you know for sure you need that much, optimize, optimize, optimize.

    When I tested the results of ignoring the larger fields and copying only most of the metric based fields, we had significant improvements. The moment I included just one large length field, it decreased the performance 3x. So, once I optimized those larger ones and only selected the fields I needed as opposed to slamming everything over, the time to copy went from 20 minutes to 1:30 minutes per 23 hours at 400 DWU.

    I'm going to start testing higher DWU's for loading only and try your suggestions. Hopefully i can start getting this faster per chunk.

  • Oh, one more thing because you prob know more about this.

    I assume the SELECT INTO versus INSERT INTO is similar to SQL Server with Azure Data Warehouse when it comes to performances?

  • xsevensinzx - Friday, February 24, 2017 9:50 PM

    Oh, one more thing because you prob know more about this.

    I assume the SELECT INTO versus INSERT INTO is similar to SQL Server with Azure Data Warehouse when it comes to performances?

    I don't know. That's a great question.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Yeah, the performance is much worse with insert into versus select into's. Now I have to debate to constantly create new tables for staging only or not.

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply