Partition vs Distribution - Parallel Data Warehouse MPP

  • Hi,

    I have an MPP PDW hardware scenario of 4 Compute nodes, each having 8 Distributions, totalling to 32 distributions.

    I have a table that grows at the rate of 3 million per day.

    My question is, if I partition my table on Date, I believe that REPLICATE is a better performant design than HASH Distribution, because - Partition is done at a higher level, and Distribution is done within EACH partition.  So, it is advisable to Replicate a 3 million mini-table, than Hash Distributing it across Compute nodes.

    please share your thoughts

    thank you

  • Well, partitioning by date is good if your queries take advance of the partition keys. In your case, that would be queries that filter on date. It's also good to use partitioning like this if you are needing to switch data for example.

    That being said, when you query the table with a partition and using the partitions within the query, this will improve query performance because it's going to read only the partitions of data it needs. Thus, if those datasets are hashed, it will perform worse than replicate simply because hashing is better for extremely large datasets where replicate is better for smaller datasets. The only issue you will run into is replicate needs to be preloaded before you query it.

    So, everything you said sounds good to me for both maintenance of the table and performance.

  • thank you.

    Are Azure SQL Data Warehouse and APS (Analytical Platform System) built on same MPP PDW architecture (nodes, distributions etc) ?

    If so, while subscribing to Azure, can we opt for a variation of nodes / distributions etc depending on our budget affordability and processing requirements?

    thanks

Viewing 3 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic. Login to reply