One skinny table or many wide tables?

  • This is driving me nuts, and we can't generate enough data to really test for sure - we need about 30 billion rows and it would take about a week to generate. So, I'm looking for theory on this question...

    I have a crap load of telemetry data to deal with, and we're trying to build a "generic" tool so we can provide a web interface to query the data, create dashboards, that kind of thing. My users are pretty basic, but I want to show them how to generate intelligence with their data instead of just graphing it. A typical query is to pull multiple data series and graph them and then just make decisions based on the visual look of the graph - I want my users to be smarter than that. If they are looking for outliers or statistical interesting stuff, they shouldn't have to graph all of history and zoom in and out on it for a day, they should be able to ask the database to show them the answer and get it right away.

    Problem one: we have only one, very simple use case to work with. A user queries the database and pulls multiple telemetry items for a time period.

    Problem two: the existing software has resulted in a certain view of the data, which is culturally well understood, but this view of the data doesn't see it for what it really is... it complicates a simple situation, but people will want to be able to view the data in the old way so we can't eliminate the ability to do that.

    I have multiple systems which spew telemetry all day and night, possibly for the next few years. Each system reports things in different ways. For example we have one system that spits out a single number 10 times per second, and another system that spits out 6 related numbers 10 times per second. Maintaining the relationship is wicked important, so if a system produces packets of 6 numbers each, we need to make sure those packets can always be reassembled so you can see the 6 numbers side by side.

    Here's an example of what we have now...

    create table HeaterTelemetry (

    HEATER_PACKETS0 uniqueidentifier, --the packet ID

    FK_T_LOG_FILES uniqueidentifier, --the source log file ID

    TLM_TIME datetime, --the time of the packet

    Temperature float --the packet data

    )

    create table AxisTelemetry (

    AXIS_PACKETS0 uniqueidentifier,

    FK_T_LOG_FILES uniqueidentifier,

    TLM_TIME datetime,

    X_AXIS float,

    Y_AXIS float,

    Z_AXIS float,

    X_ANGLE float,

    Y_ANGLE float,

    Z_ANGLE float

    )

    And there's like 200 tables like that, one for each telemetry source, regardless of the fact that there's only 9 or 10 types of telemetry sources... if we have two heaters we have two different tables.

    So, I was originally against this idea, but I think I want to roll all this up into one table that looks like:

    create table SkinnyTableTest (

    LogFileId int not null,

    [Time] datetime not null,

    TargetId int not null,

    MnemonicId int not null,

    StringValue varchar(200),

    FloatValue float

    )

    There are metadata tables behind this, so you can look up the TargetId by the name of the target in another table. MnemonicId is now required because we're putting multiple telemetry sources in the same table - it's basically the data that was represented by the table names in the old schema. Telemetry sources might produce character data such as "I received a command: CMDNAME" while other sources produce floating point values like the measured temperature or position of something, so I do need both StringValue and FloatValue - I'm against the idea of storing everything as a string because most of this data will be subject to numerical analysis.

    I know that articles have been written about this choice, but I am having a major problem finding them! We can't do a reasonable A/B test because we can't generate enough data to show what's really going to happen down the road when we might have 30 billion data points.

    My question to you all is: how do I figure out which way to go? Do I refuse to make a decision until I have better end user requirements? Do I build both schemas and plan to dump one down the road when we figure it out? I'm looking for ideas here...

  • Seems like you should wait the week it will take to generate the 30 billion rows and then see what happens. Also, make sure you use NEWID if you're going to make those uniqueidentifier columns indexes.

  • AlBran LVP (8/17/2012)


    Seems like you should wait the week it will take to generate the 30 billion rows and then see what happens. Also, make sure you use NEWID if you're going to make those uniqueidentifier columns indexes.

    Those GUIDs come in from the telemetry machines, they aren't generated by SQL Server.

  • Are you using them as clustered indexes?

  • You might look up EAV (Entity-Attribute-Value) model. There are lots of good opinions and article about the complexities and pitfals of the paradigm. Tough to opimize, join complexity can get out of hand quickly, and declarative RI is very difficult, just to mention a couple.

    Why not just take the 10 different types and aggregate them into 10 different tables. They would all share attributes, and it would still result in a simpler design.

    Are your users quering this while the data is being updated, or is the reporting done from a copy somewhere? Concurrency might be an issue once the data is aggregated.


    And then again, I might be wrong ...
    David Webb

  • Never having dealt with quite that much data I can't say for sure but I would lean towards multiple tables. Either way I would also look into partitioning for performance.

    Kimberly Tripp has done a number of good articles & blogs on both partitioning and indexing. Here is a good one on partitioning.

    http://www.sqlmag.com/blog/sql-server-questions-answered-28/sql-server-2008-r2/partitioned-tables-viewsndashwhy-142158

    Kenneth FisherI was once offered a wizards hat but it got in the way of my dunce cap.--------------------------------------------------------------------------------For better, quicker answers on T-SQL questions, click on the following... http://www.sqlservercentral.com/articles/Best+Practices/61537/[/url]For better answers on performance questions, click on the following... http://www.sqlservercentral.com/articles/SQLServerCentral/66909/[/url]Link to my Blog Post --> www.SQLStudies.com[/url]

  • David Webb-200187 (8/17/2012)


    Why not just take the 10 different types and aggregate them into 10 different tables. They would all share attributes, and it would still result in a simpler design.

    That's what I'd really like to do but I'm not sure if that is actually possible. A table for each type of machine would be perfect. Getting away from the massive amount of tables is what I want.

  • Check out the EAV stuff I mentioned before, or better yet, try and reproduce a complex current query with the one-table design and it may become obvious how quickly the complexity multiplies.

    As suggested above, partitioning is probably something you should consider. Having a separate reporting environment would be a good idea also. Do your users have to have near real time data, or is yesterday's or one hour ago's data good enough.

    In your current scenario, your I/O is scattered across a lot of tables. In a new scenario, those I/Os would be funneled into fewer objects so careful placement of data, indexes and logs would be in order to make sure the disks don't become a bottleneck.

    Just out of curiosity, how many rows per second does this system update? Does it run 24X7? What hardware are you running it on? Do you have a test environment to try this on? How big is the database? What disk system are you using (locally attached or SAN)?


    And then again, I might be wrong ...
    David Webb

  • Thanks David, that's exactly what I needed. I'll look into that information and do some more testing.

  • You should be fine with a single table provided you cluster the table by:

    ( [Time], MnemonicId ) --if I understand Mnemonic id correctly, as the original telemetry data src

    Uniqueidentifiers are much slower than ints, so avoid them if you can (and it's easy enough based on what you've specified so far). They also make horrible index keys, since they are essentially random.

    You may indeed need to partition, into historic and current data, if the current data gets too fragmented. Not likely but it's possible.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply