• Thanks for the feedback.

    I was able to talk with a Polybase team member today. We went over some of my optimization approaches thus far that seemed to have worked.

    Chunks

    The data is already in hourly chunks. We were running tests with 23 hours at a time and had some performance increases doing the load every 23 hours at a time before moving the data from one container to another so Polybase doesn't read all files in the directory each load. I can try doing 60 file chunks to see what happens.

    Hash

    We hashed on our customer ID, which seemed to work well. Prior, we had nothing and it defaulted to round robin based on the query plans I saw.

    Model

    The polybase rep we talked to said one optimization would to optimize the model. Seems like a nobrainer, but we padded the fields a great deal to just get the data into a internal table. That in the fact we got confused by the truncation errors on the normal model, when in fact, it's just that Polybase does not support headers (why!!??!). We have some very large nvarchar fields that are 3000+ in length (we don't control the data source) that can vary. We also have very wide files, with 40+ fields in the data.

    The polybase rep said that no matter if you have 1 character or 4,000 characters, if the field has a length of 4,000 then polybase is going to treat every record as if it has 4,000 characters (as expected). So, unless you know for sure you need that much, optimize, optimize, optimize.

    When I tested the results of ignoring the larger fields and copying only most of the metric based fields, we had significant improvements. The moment I included just one large length field, it decreased the performance 3x. So, once I optimized those larger ones and only selected the fields I needed as opposed to slamming everything over, the time to copy went from 20 minutes to 1:30 minutes per 23 hours at 400 DWU.

    I'm going to start testing higher DWU's for loading only and try your suggestions. Hopefully i can start getting this faster per chunk.