Best way to deal with millions of very short rows?

  • I'm fairly confident in my SQL design skills, but I thought I had better just throw my thoughts "out there" to check that this is the best way to deal with my new employer's database that needs a bit of tweaking as it was built quickly to cope with growth of the company (isn't this always the case?).

    Scenario: My workplace provides about 3000 systems to clients that each have between 8 and 20 sensors each. When the sensors are triggered the controller will send data back to the server (at most every 5 minutes).

    The sensor data is loaded into tables: one table PER system, with the rows taking the following approximate format, where the dataItem values are what was sent from the controller:

    Table: system_x

    custom timestamp value[day+time](int) | systemID(smallint) | dataItem1(tinyint) | dataItem2(tinyint) | dataItem3(tinyint) | dataItem4(tinyint)

    (And yes, I know the systemid in a table with the name in it is redundant, that has to be normalised out! 🙂 )

    To date, we have 350,000,000+ rows across 3000ish tables with the biggest table being a shade over 1,600,000+ rows. Performance isn't currently an issue (the server is beefy!) but with new sensors being developed and plans for realtime reporting and more customers, we could see substantial growth in the data. The idea of having 5,000+ tables in the future doesn't appeal either... I would rather tackle this now!

    The data is displayed on charts for the clients: we don't do any analysis of it so it's a write-once, read many situation.

    After doing some calculations on the database size, it appears that almost half of the 7.5GB of data is consumed in row headers (4 bytes) and null maps (which are a minimum of 3 bytes per row as far as I understand). It certainly doesn't seem optimal.

    My Plan

    Normalise data older than X days (the system has to continue logging into the existing tables for now due to the way the system works) into a new table.

    Table: tblArchivedData

    dateReceived (date) | systemID(smallint) | dataForTheDay(varbinary)={custom timestamp|dataItem1|dataItem2|dataItem3|dataItem4}... repeat the curly braced section for each record in the earlier table that contains data for the particular day and system.

    By some quick observations and back-of-the-envelope maths I reckon I can trim 2GB from the database size without loss of data and the new table will only have 1–5% of the row count of the original table (350m rows > 3.5m rows). Retrieving a day's chart would only require find the matching date and systemID and reading out the varbinary (perhas the adjacent days as well). We only have SQL server Web Edition, but we could always manually make partitions by feeding data into a table for each month/quarter/year as needed should the data REALLY grow!

    With the above in mind:

    1) Are there any better solutions?

    2) Are there any pitfalls of this method I might find later down the road?

    Many thanks in advance if you have taken the time and read all the way to this point 🙂

  • Not need to even worry about the total number of rows stored across all your tables, performance is a function of the tables included in a particular query, the quality of the query and the quality of the indexing strategy.

    Having said that, a 1.6 million rows table - as stated the largest one - is a tiny table.

    Point here is, how are queries hitting those tables? what answers the users want?

    If user requirements translate in accessing large tables by index then the number of rows stored in such table is rarely a factor affecting performance.

    Do user queries require to read the whole 1.6 million rows on your table?

    _____________________________________
    Pablo (Paul) Berzukov

    Author of Understanding Database Administration available at Amazon and other bookstores.

    Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.
  • Hi, thanks for your feedback.

    Yes, the tables are probably quite small compared to some of the other databases out there, we're not pushing SQL server to it's extremes that's for certain!

    Once the data is inserted, the only queries are SELECT queries that retrieve a single system's data between two of our custom date values (the int values) that represent the day and time (custom epoch). This means returning rows WHERE datetime is between X and Y.

    A day with average activity is approximately 600 rows, sometimes people view a whole week (~4,200 rows). We may build in longer range reports in the future. The date range selected is continuous at the moment, however I couldn't rule out in the future that we might want to do more complex things like retrieve data from weekends only.

    Admittedly, these queries will always be quick on an indexed field. 🙂

    I suppose that this more of an issue for long-term administration than performance: a table for every system is not ideal. From some back of the envelope maths, we could reduce the logging system to one table for the last 24/48 hours (replacing the existing 3000 tables) to maintain compatiiblity with the log receiving software, with a further archive table of approx 3,500,000 rows, indexed by system and date.

    It may be beneficial to know that we retrieve the information using PHP to generate a "Gantt chart" of sorts that shows time against each sensor's activity. Currently each chart is generated using static HTML divs, but in the future we plan to make it generate in a "infinitely scrollable" javascript timeline widget where we would pre-fetch data just off to the left and right of the chart and feed it to the chart in JSON on the fly.

    Another thought is that it would be faster to retrieve one record for the next/previous day and build it using PHP, rather than looping through a record set of X values to build the data to pass to the chart?

    Again, thanks for the constructive feedback.

  • The archive table is even less normalized than the current format. By doing it this way, it will definitely reduce your options in the future.

    For example, it may not be a requirement now, but what if 6 months down the road, client demand causes you to need to provide an average value of each sensor for each day.

    How would you do that in this design? It would be an extremely difficult and expensive operation.

    You don't do any analysis now, but that can change. Right now the only option your clients would have is to have the interface parse it out so they could import it into their own systems for analysis. Or force the front-end developers to compensate for this design.

    Your company may be better served with one "Systems" table, rather than thousands of system_xxxx tables. It sounds like they needed to support systems with varying counts of sensors, and didn't understand the correct way to do that (Or, they decided to go with the poor man's partitioned table, but the amount of maintenance that would require on a per-customer basis would make that unlikely)

    Some important questions:

    Are there system "types"?

    For example, type 1 has sensors A, B, and C, while type 2 has those AND sensors D, E, and F.

    Type 3 has sensors X,Y and Z.

    x number of systems are type 1, y number of systems are type 2, etc.

    How many different types of sensors do you have for all systems? Is every sensor unique?

    Can a single system have more than one sensor of the same type?(are there "types" of sensors?) Do some sensors have more than one datapoint?

    Do the datatypes the sensors store vary? (Some floating point values, some tinyints, etc.)

    Is that epoch timestamp necessary? Or can it be replaced with a smalldatetime? (accurate to 1 minute, 4 bytes of storage)

    In the current format, is it actually "DataItem1, DataItem2", or were you simply filling in the blanks for more specific datapoints?

    I would think a more normalized design would make things more flexible for you. Row count is not the only metric to consider when looking into how well it would perform. In most cases a more normalized approach would use less storage and/or perform better. (3NF is considered to be the minimum that a database should adhere to in most cases)

    Also, storage is cheap nowadays. A 7.5GB database is still quite small. Twice that is still small. The additional storage overhead of having the database be 50% "row headers" is offset by the fact that having that data there improves performance and flexibility by much more than that. That's not to say the current design can't be improved, because it most certainly can.

  • Thanks clubbavich,

    You're quite right, I'm un-normalising the data. Bad me!

    Each system has a number of sensors (a little as 1, as many as 20), of 2 different types (but with more planned).

    Regardless of sensor type, each time we get data, it comes as 4 tinyints showing the various properties at that time. That looks like the plan for the near to mid-term future.

    Although we do not have table partitioning in our version of SQL server (it's Web Edition). We could look at making the split by date and not system so our tables would be named tblSensorData2012Q1, tblSensorData2012Q2 etc. and UNION queries that span more than one quarter when retrieving. Once a quarter has finished the index can be defragged and it will effectively becomes used for SELECT queries only. If we get a lot more data coming in, then we could shrink the time-frame of each table.

    I did a comparison in table sizes before/after the weekend. We are adding approximately 400,000 rows a day based on current usage, so I'm looking at reaching 500,000,000 records in total over the next year [41% growth] (assuming no new sensors, extra systems or enabling of real-time reporting).

    Also realised that there has been no index defragging done... some of those tables are 80%+ (with over 100,000 rows). However that's a task for another day... 🙂

  • Consider the following tables, and tell me how well this would work for you:

    --It's likely this table already exists in your database, but I'm creating it again here as an example

    CREATE TABLE systems

    (

    system_id smallint IDENTITY (1,1),

    other_data_about_systems varchar(50) NULL,

    CONSTRAINT PK_Systems

    PRIMARY KEY CLUSTERED (system_id)

    WITH (IGNORE_DUP_KEY = OFF)

    )

    --This table will store your different sensor types

    CREATE TABLE sensortypes

    (

    sensor_type_id tinyint IDENTITY (1,1),

    sensor_type_desc varchar(50) NOT NULL,

    CONSTRAINT PK_sensortypes

    PRIMARY KEY CLUSTERED (sensor_type_id)

    WITH (IGNORE_DUP_KEY = OFF)

    )

    --This table describes each individual sensor

    CREATE TABLE sensors

    (

    sensor_id int IDENTITY (1,1), /* How you will uniquely identify the sensor.

    Alternatively you could create a composite key using this field (as a tinyint) and the system_id fields.

    That would make things a bit more complicated to code, but saves a little space space in this example (1 byte per row) */

    system_id smallint

    REFERENCES systems(system_id),

    sensor_type_id tinyint

    REFERENCES sensortypes(sensor_type_id),

    CONSTRAINT PK_sensors PRIMARY KEY CLUSTERED (sensor_id)

    WITH (IGNORE_DUP_KEY = OFF)

    )

    CREATE TABLE sensor_data_new

    (

    sensor_id int

    REFERENCES sensors(sensor_id),

    data_timestamp smalldatetime NOT NULL, /* accurate to 1 minute.

    See http://msdn.microsoft.com/en-us/library/ms186724.aspx to choose the best type for your requirements.

    I HIGHLY recommend that you store this timestamp information in GMT, so you aren't affected by daylight savings time*/

    sensor_datapoint_a tinyint NOT NULL,

    sensor_datapoint_b tinyint NOT NULL,

    sensor_datapoint_c tinyint NOT NULL,

    sensor_datapoint_d tinyint NOT NULL,

    /* The assumption here is that future sensor types will continue to use the same format for their data.

    If the datapoints are different between sensor types (sensor_datapoint_a represents X for one type of sensor,

    and represents Y for another), I would recommend having separate tables for the data for each sensor type.

    Not only does that provide a logical form of partitioning, but would also make the table columns more descriptive

    (instead of generic names, the column can be named for the specifc datapoint stored). */

    PRIMARY KEY PK_sensor_data CLUSTERED (sensor_id, data_timestamp)

    /* To reduce fragmentation, it would be better to put the data_timestamp as the first part of the clustered index.

    However, given that your queries will be focusing on retrieving data for a specific sensor for a given time, this index above is probably best */

    WITH (IGNORE_DUP_KEY = OFF)

    )

    You can create that sensor_data_new as sensor_data_2012Q1, sensor_data_2012Q2, sensor_data_pre2012 for archival purposes, but keep in mind that would definitely require more manual maintenance. Each and every quarter, you would need to create a new table, move the data over, and update the view.

    Another approach would be to do sensor_data_archive (Inserted into monthly, indexes rebuilt directly after. Inserts occur during offpeak time), sensor_data_last_x_months (inserted into weekly/daily, indexes rebuilt/reorganized after inserts, occurs during offpeak time, probably before the archive table), and have sensor_data_new be constantly written to.

    With this, you create the tables once, modify the view once, create the agent jobs, and you have no additional manual tasks to perform on a regular basis.

    Compared to the current design, it may end up with slightly less overhead per row than what you've got now, but it's most certainly more normalized and easier to query.

  • Yes, that would work well, we might not normalise each sensor just yet, but we will certainly be heading towards that depending on growth and sensor data.

    We will stick with our own custom timestamp as the system works and we don't want to change the firmware on our controllers just yet 😉 In the future though I imagine we may want to make the shift.

    We have systems internationally we are going through everything else in the system and writing a PHP layer to handle timezones and store everything in UTC with a separate field to store the time zone of a system, user etc. (Three of us had a 4 hour meeting to flesh out all the scenarios! We have decided to standardise before we start making major changes and get more than our current handful of international users.)

    I imagine that once manual table partition becomes hard to manage we will upgrade our SQL Server to an edition that supports it. This interim solution at least prevents the problems of "One Big Table" and "So Many Tables To Keep Optimised"!

    In the meantime I reorganised a few indexes today (freeing 400MB), with many more to do (or remove as they are not used), we have very little relational integrity constraints and a lot of the fields have the inappropriate datatype for what they hold!

    At least sounding out my ideas on the largest problem (both in terms of data and mentally), is resolved better in my head!

    Many thanks for the help 🙂

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply