Best way to effienctly store hugh amounts of data

  • Would be interested to hear on some of the best ways to approach this situation.

    I've recently started at a company as a DBA that has lot's of developers, but has never had a DBA or database developer. They have recently started developing products which receive huge amounts of data, very quickly (300 transactions a second), which needs to be stored, and queried for 5 years. They had already created the database before I came along and are now looking to test it and see if improvements can be made.

    One way they have implemented the insert process of the main bulk of the transactions is as follows. In an effort to keep the solution "generic", so that they can adapt it to fit multiple solutions, they have the following process flow.

    A top level products table receives a product line information. An Id, product type etc.

    Each product has attributes which are associated with it. A lookup is done on the product attributes table to get the attributes just for the product type.

    Then a row is inserted into the productdetails table for each product attribute in a cursor loop process. For each attribute type, insert the corresponding value for that attribute type of the product into the table.

    Summary:

    Product header inserted.

    select all attribute types on attributes table based product type.

    For each attribute type, insert into productdetails table the products value of that attribute.

    The idea is that if a new product comes along with a whole new set of unique attributes,, then they just add the new attribute types to the attributes table and the information can be stored.

    To me this seems very inefficient as one product, with lots of details, is being split to on average 10 inserts into one table.

    They predict there will be about 9 billion transactions a year, so on average that is going to be 90 billion rows in one table!! This has to be queried too.

    I'm thinking it's gonna fall on it's face!

    I realise this is a very open question with not much information, but any pointers on the best way to improve this would be great!

  • What you are describing is what is known as EAV (Entity Value Attribute). This is a horrible design in almost every sense of the word. There are a few cases where this type of pattern is ok but in general avoid it like the plague. You are absolutely right that it is going to fall on its face.

    Check out this article explaining a generic EAV system, sounds like it is somewhat similar to what you have going on. http://www.simple-talk.com/opinion/opinion-pieces/bad-carma/[/url]

    It doesn't take a rocket surgeon to figure out that once you have 20-30 attributes performance is a joke when you have to create queries with dynamic sql and 20-30 self joins. The best thing that can be done for performance is a complete rewrite. That isn't always possible of course but you are in for a long battle to get it straightened out.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • Sean Lange (9/28/2012)


    What you are describing is what is known as EAV (Entity Value Attribute). This is a horrible design in almost every sense of the word. There are a few cases where this type of pattern is ok but in general avoid it like the plague. You are absolutely right that it is going to fall on its face.

    Check out this article explaining a generic EAV system, sounds like it is somewhat similar to what you have going on. http://www.simple-talk.com/opinion/opinion-pieces/bad-carma/[/url]

    It doesn't take a rocket surgeon to figure out that once you have 20-30 attributes performance is a joke when you have to create queries with dynamic sql and 20-30 self joins. The best thing that can be done for performance is a complete rewrite. That isn't always possible of course but you are in for a long battle to get it straightened out.

    +1

    I had the misfortune to work on one of these EAV systems some years ago. Everything bad you read about these happened - poor performance, integrity issues, spiralling query complexity ... avoid if possible. Rewriting will be your best option.

    ____________________________________________________

    Deja View - The strange feeling that somewhere, sometime you've optimised this query before

    How to get the best help on a forum

    http://www.sqlservercentral.com/articles/Best+Practices/61537
  • Thanks guys. That article was scary reading. Very familiar and even some of the table names mentioned are the same as the ones the devs have deployed.

    I really need to understand the application more before coming up with a solution, but I am glad others agree that the current deployment is completely not right!

    Thanks.

  • This will be an interesting one!

    I can understand why EAV is not popular, but what other setup will let folks enter new attributes without defining new table columns? I know one application I've encountered uses this setup so that new attributes and their values can be defined without a DBA having to update tables. What other method will work in this situation if EAVs are not used?

  • Perhaps it's just the way I've been using them but I've had no such problems with a properly constructed EAV. Certainly, I've not had to do a 20 or 30 table self join.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (10/1/2012)


    Perhaps it's just the way I've been using them but I've had no such problems with a properly constructed EAV. Certainly, I've not had to do a 20 or 30 table self join.

    Unfortunately the OP has already eluded to at least 10 self joins. One of the big challenges in this type of system is RI. Foreign keys and constraints can be a serious PITA to deal with. I have also seen them have real challenges with datatypes because the value columns will just be a varchar(max) and then you have to cast everything into the correct datatype. I have see other implementations where there was a datatype column to indicate what the datatype of the value column is.

    I agree that properly constructed they can be useful and even preferred in some cases. The system described by the OP does not sound like that to me. It sounds like the beginning of a nightmare. That is just my $.02

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • Sean Lange (10/1/2012)


    Jeff Moden (10/1/2012)


    Perhaps it's just the way I've been using them but I've had no such problems with a properly constructed EAV. Certainly, I've not had to do a 20 or 30 table self join.

    Unfortunately the OP has already eluded to at least 10 self joins. One of the big challenges in this type of system is RI. Foreign keys and constraints can be a serious PITA to deal with. I have also seen them have real challenges with datatypes because the value columns will just be a varchar(max) and then you have to cast everything into the correct datatype. I have see other implementations where there was a datatype column to indicate what the datatype of the value column is.

    I agree that properly constructed they can be useful and even preferred in some cases. The system described by the OP does not sound like that to me. It sounds like the beginning of a nightmare. That is just my $.02

    I'll typically use SQL_Variant for such things. Although DRI is a bit of a pain (I agree that FK's really aren't practical here), the problem that properly written and used EAVs (and NVPs) solve usually makes it worth it if you do indeed allow such a thing. There are places, especially in reporting, where they pop up and most people dont even think of the tables as an EAV or NVP.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • EAV "can" be very fast and efficient if you use it for single-table storage. But that then requires EXTRAORDINARY effort on the application side to handle virtually all of the data processing needs and even when you do that it is useful for a pretty narrow class of apps. There is at least one Web2.0 company doing this, and I have another client using EAV to very good effect as well in a system that uses EAV constructs for a few tables.

    If you are going to do pure EAV storage and jump through the required hoops on the application side to manage the data, it is honestly a waste to spend the money for SQL Server. Use one of the many free storage systems available.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • One minor point ... even if you decide to use what they've established, you can make the inserts quite a bit more efficient by ditching the cursor loop. The same result can be achieved in a set-based manner. Every cursor that is instantiated has a performance cost and that cost will add up if you're looking at 300 transactions a second.

    └> bt



    Forum Etiquette: How to post data/code on a forum to get the best help[/url]

  • bteraberry (10/2/2012)


    One minor point ... even if you decide to use what they've established, you can make the inserts quite a bit more efficient by ditching the cursor loop. The same result can be achieved in a set-based manner. Every cursor that is instantiated has a performance cost and that cost will add up if you're looking at 300 transactions a second.

    Agreed. I would imagine that a table valued parameter would be the simple solution here for efficient batched, single-hit insert.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • Thanks guys, all very interesting stuff. I'm getting more in to how this application is implemented and it's becoming scary. As Sean Lange mentioned, the developers have implemented the columns as varchar(max) and there is a separate table with each datatype listed for every specific column, so that it is known what data is being inserted.

    I have also found that there is a parent/child relationship for some types of data so, for one "product", lets say, it can have multiple child products, which get read from a table and inserted into another table. When the product data is inserted into the databases, the parent is inserted first, then all child records for that item are inserted into the same table, but with a parent id column set to the parent.

    The next thing that happens is, each specific attribute for each child, is inserted into another table. So for example, the values, type = "addrLine1" and value = "123 5th avenue" are inserted, then, type = "addrLine2" and value = "New York" is inserted into another row of the same table, with the product id of the child. The "type" coluum is a foreign key of the only table the developers want to maintain, by adding new attributes when a different product comes along.

    All this does, is lead to the one table having about 10 or more rows, (therefore inserts too) for each attribute of just one item! This is bad enough as it is, but the child records could be in their millions for just one product! So what could potentially be a million single inserts, actually gets multiplied by the number of specific inserts for each child attributes!

    For example, 17000 products where meant to be added to the databases yersterday, each had 1000's and some had millions of child items. This 17,000 transactions took 4 hours to complete!!

    Once I get the code out, I'll post it and see what you think! It's quite shocking.

  • Maddave (10/3/2012)


    Thanks guys, all very interesting stuff. I'm getting more in to how this application is implemented and it's becoming scary. As Sean Lange mentioned, the developers have implemented the columns as varchar(max) and there is a separate table with each datatype listed for every specific column, so that it is known what data is being inserted.

    I have also found that there is a parent/child relationship for some types of data so, for one "product", lets say, it can have multiple child products, which get read from a table and inserted into another table. When the product data is inserted into the databases, the parent is inserted first, then all child records for that item are inserted into the same table, but with a parent id column set to the parent.

    The next thing that happens is, each specific attribute for each child, is inserted into another table. So for example, the values, type = "addrLine1" and value = "123 5th avenue" are inserted, then, type = "addrLine2" and value = "New York" is inserted into another row of the same table, with the product id of the child. The "type" coluum is a foreign key of the only table the developers want to maintain, by adding new attributes when a different product comes along.

    All this does, is lead to the one table having about 10 or more rows, (therefore inserts too) for each attribute of just one item! This is bad enough as it is, but the child records could be in their millions for just one product! So what could potentially be a million single inserts, actually gets multiplied by the number of specific inserts for each child attributes!

    For example, 17000 products where meant to be added to the databases yersterday, each had 1000's and some had millions of child items. This 17,000 transactions took 4 hours to complete!!

    Once I get the code out, I'll post it and see what you think! It's quite shocking.

    Quite a bit of unfortunate stuff going on here. Good luck with a) getting it to perform/scale and b) getting them to do a major refactor to allow a) to actually happen. Oh, and drop me a line if you need help with either of those!! 😎

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • Maddave (10/3/2012)


    Thanks guys, all very interesting stuff. I'm getting more in to how this application is implemented and it's becoming scary. As Sean Lange mentioned, the developers have implemented the columns as varchar(max) and there is a separate table with each datatype listed for every specific column, so that it is known what data is being inserted.

    I have also found that there is a parent/child relationship for some types of data so, for one "product", lets say, it can have multiple child products, which get read from a table and inserted into another table. When the product data is inserted into the databases, the parent is inserted first, then all child records for that item are inserted into the same table, but with a parent id column set to the parent.

    The next thing that happens is, each specific attribute for each child, is inserted into another table. So for example, the values, type = "addrLine1" and value = "123 5th avenue" are inserted, then, type = "addrLine2" and value = "New York" is inserted into another row of the same table, with the product id of the child. The "type" coluum is a foreign key of the only table the developers want to maintain, by adding new attributes when a different product comes along.

    All this does, is lead to the one table having about 10 or more rows, (therefore inserts too) for each attribute of just one item! This is bad enough as it is, but the child records could be in their millions for just one product! So what could potentially be a million single inserts, actually gets multiplied by the number of specific inserts for each child attributes!

    For example, 17000 products where meant to be added to the databases yersterday, each had 1000's and some had millions of child items. This 17,000 transactions took 4 hours to complete!!

    Once I get the code out, I'll post it and see what you think! It's quite shocking.

    YIKES!!!!! :w00t::w00t::w00t:

    It sounds like you may be able to write the next chapter in the article I posted. As a few others have said in this thread, EAV is useful for extremely specific situations. It is by no means a replacement for relational data in an OLTP system. I hope you are able to either talk some sense into these people. Your 17,000 product insert should take a few seconds...4 hours if unbelievable!!!

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • out of interest then; if I was selling books (ISBN, Author, Pages), wine (bottle size, ABV,Vinyard,grape,acidity) and shirts (material,collar type, sleeve length), how would you recommend storing the attributes for those products so that they were both easy to insert and easy to extract.

Viewing 15 posts - 1 through 15 (of 20 total)

You must be logged in to reply to this topic. Login to reply