RE: Best way to deal with millions of very short rows?

Default port

Points: 1449

July 23, 2012 at 3:53 pm

The archive table is even less normalized than the current format. By doing it this way, it will definitely reduce your options in the future.

For example, it may not be a requirement now, but what if 6 months down the road, client demand causes you to need to provide an average value of each sensor for each day.

How would you do that in this design? It would be an extremely difficult and expensive operation.

You don't do any analysis now, but that can change. Right now the only option your clients would have is to have the interface parse it out so they could import it into their own systems for analysis. Or force the front-end developers to compensate for this design.

Your company may be better served with one "Systems" table, rather than thousands of system_xxxx tables. It sounds like they needed to support systems with varying counts of sensors, and didn't understand the correct way to do that (Or, they decided to go with the poor man's partitioned table, but the amount of maintenance that would require on a per-customer basis would make that unlikely)

Some important questions:

Are there system "types"?

For example, type 1 has sensors A, B, and C, while type 2 has those AND sensors D, E, and F.

Type 3 has sensors X,Y and Z.

x number of systems are type 1, y number of systems are type 2, etc.

How many different types of sensors do you have for all systems? Is every sensor unique?

Can a single system have more than one sensor of the same type?(are there "types" of sensors?) Do some sensors have more than one datapoint?

Do the datatypes the sensors store vary? (Some floating point values, some tinyints, etc.)

Is that epoch timestamp necessary? Or can it be replaced with a smalldatetime? (accurate to 1 minute, 4 bytes of storage)

In the current format, is it actually "DataItem1, DataItem2", or were you simply filling in the blanks for more specific datapoints?

I would think a more normalized design would make things more flexible for you. Row count is not the only metric to consider when looking into how well it would perform. In most cases a more normalized approach would use less storage and/or perform better. (3NF is considered to be the minimum that a database should adhere to in most cases)

Also, storage is cheap nowadays. A 7.5GB database is still quite small. Twice that is still small. The additional storage overhead of having the database be 50% "row headers" is offset by the fact that having that data there improves performance and flexibility by much more than that. That's not to say the current design can't be improved, because it most certainly can.