Tabular Partitioning Approach

  • The back-end data is a single Hive record set. That record set includes columns that were joined within Hive from Hive tables that I don't have access to. Anyway, that record set will probably change every 10 minutes or so (meaning, there will be new rows added every 10 minutes or so). There will eventually be millions of rows in that record set.

    I am wondering, what is the best approach to partitioning the tabular model, from an architectural perspective?

  • imani_technology (4/26/2016)


    The back-end data is a single Hive record set. That record set includes columns that were joined within Hive from Hive tables that I don't have access to. Anyway, that record set will probably change every 10 minutes or so (meaning, there will be new rows added every 10 minutes or so). There will eventually be millions of rows in that record set.

    I am wondering, what is the best approach to partitioning the tabular model, from an architectural perspective?

    I'd say that if you are seeing the volumes of data that would require you to think of partitioning then it might be worth looking at SSAS Multidimensional instead. There are various reasons for this (storage notwithstanding); SSAS Tabular does not support parallel processing of partitions and queries against partitioned facts can actually be slower (http://blogs.prodata.ie/post/Be-Careful-Partitioning-Analysis-Services-%28Tabular%29.aspx). I don't know that much about Hive but if your source isn't also partitioned then you will hit processing problems as well.


    I'm on LinkedIn

  • We tried using Multidimensional against our Hadoop/Hive data. It didn't work. The problem was linked server, which made performance unacceptably slow. That's why we are using Tabular. Tabular doesn't require a linked server object to connect to Hadoop/Hive, so we get better performance than with Multidimensional.

  • By the way, thanks for the article!

  • imani_technology (4/27/2016)


    We tried using Multidimensional against our Hadoop/Hive data. It didn't work. The problem was linked server, which made performance unacceptably slow. That's why we are using Tabular. Tabular doesn't require a linked server object to connect to Hadoop/Hive, so we get better performance than with Multidimensional.

    You can use an ODBC driver in SSIS to get your data across to a staging area and then process it from there. No linked server required and you're not processing from a big table each time (it would be fairly simple to get an incremental load using this method). http://m.sqlmag.com/blog/use-ssis-etl-hadoop


    I'm on LinkedIn

  • Good idea, but we have new data going into Hadoop/Hive every 10 minutes or so. Are you saying we should use SSIS every 10 minutes or so as well? How would that effect performance?

  • imani_technology (4/27/2016)


    Good idea, but we have new data going into Hadoop/Hive every 10 minutes or so. Are you saying we should use SSIS every 10 minutes or so as well? How would that effect performance?

    I don't know what your architecture is like, every 10 minutes (though pushing it a bit) could be feasible. I think at this point it's about managing client expectations; so there is new data every 10 minutes, do they really need that latency?


    I'm on LinkedIn

  • Actually, new data will probably be constantly flowing into Hadoop. The 10-minute latency is just something the business set up as a requirement.

  • imani_technology (4/27/2016)


    Actually, new data will probably be constantly flowing into Hadoop. The 10-minute latency is just something the business set up as a requirement.

    Okay, so assuming your facts don't change over time (they shouldn't, they're not technically "facts" otherwise but that's for another time...) I would probably set something up that does the following:

    1. Create a partition scheme in SSAS that would reap the best query return (for example, if the users mostly focus on one month in the latest year, do it monthly for the current year and open it up to bigger partitions the further you go back).

    2. Create a partition for the current day - make sure none of the other partitions overlap this.

    3. With each new data load, do a process incremental/process index on the current day partition.

    4. At the end of the day, programatically merge this partition with the latest larger partition and create a new one for the next day's data.

    You'll have to write some funky dynamc XMLA to achieve this (although I have seen a similar thing done in a script task in SSIS before so if you're a .netter then perhaps that'll be easier for you).

    This is all assuming that the business that you work for need to see new data presented to them in a presentation layer every 10 minutes, 24 hours a day, 7 days a week.


    I'm on LinkedIn

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply