Best way to effienctly store hugh amounts of data

  • aaron.reese (10/3/2012)


    out of interest then; if I was selling books (ISBN, Author, Pages), wine (bottle size, ABV,Vinyard,grape,acidity) and shirts (material,collar type, sleeve length), how would you recommend storing the attributes for those products so that they were both easy to insert and easy to extract.

    If you're going to try to cram such disparate entities into a single table you need to have a very very good reason to do so. In your scenario I don't see any reason why you wouldn't have separate tables for the completely different items you're selling. You can still query in such a way as to combine different items from the different tables without sacrificing sensible design.

    └> bt



    Forum Etiquette: How to post data/code on a forum to get the best help[/url]

  • Maddave (10/3/2012)


    Thanks guys, all very interesting stuff. I'm getting more in to how this application is implemented and it's becoming scary. As Sean Lange mentioned, the developers have implemented the columns as varchar(max) and there is a separate table with each datatype listed for every specific column, so that it is known what data is being inserted.

    I have also found that there is a parent/child relationship for some types of data so, for one "product", lets say, it can have multiple child products, which get read from a table and inserted into another table. When the product data is inserted into the databases, the parent is inserted first, then all child records for that item are inserted into the same table, but with a parent id column set to the parent.

    The next thing that happens is, each specific attribute for each child, is inserted into another table. So for example, the values, type = "addrLine1" and value = "123 5th avenue" are inserted, then, type = "addrLine2" and value = "New York" is inserted into another row of the same table, with the product id of the child. The "type" coluum is a foreign key of the only table the developers want to maintain, by adding new attributes when a different product comes along.

    All this does, is lead to the one table having about 10 or more rows, (therefore inserts too) for each attribute of just one item! This is bad enough as it is, but the child records could be in their millions for just one product! So what could potentially be a million single inserts, actually gets multiplied by the number of specific inserts for each child attributes!

    For example, 17000 products where meant to be added to the databases yersterday, each had 1000's and some had millions of child items. This 17,000 transactions took 4 hours to complete!!

    Once I get the code out, I'll post it and see what you think! It's quite shocking.

    The big question I have: Are you doing anything at all with the attribute data other than storing it and returning it back to the application? Are you using those attributes to filter queries, compile statistics, etc? If not, and if all you are doing is passing all the attributes back to the application, I would look at storing that data in an XML or JSON format in a single column in the parent table. (I prefer JSON since it is more efficient.) We have an application that works with Twitter data and we keep all the data that we want to present but that we never filter by in a JSON string because the attributes included with the tweet record are highly variable based on a number of factors. Queries are run based on attributes like username, date, etc, that are broken out into columns in the parent table and the rest of the data is conveyed via JSON to the application.

    If the attribute values are relevant to querying and whatnot, something I would look at is the use of sparse columns. Microsoft implemented that strategy in the database architecture that supports SharePoint. They put all their "things" into a single table and have a boatload of all possible attributes. Any individual item will only use a tiny fraction of the attributes and so the majority of the data in those columns is null. Without knowing the specifics of what your system is doing, it sounds as though your needs may be similar.

    Good luck.

    └> bt



    Forum Etiquette: How to post data/code on a forum to get the best help[/url]

  • aaron.reese (10/3/2012)


    out of interest then; if I was selling books (ISBN, Author, Pages), wine (bottle size, ABV,Vinyard,grape,acidity) and shirts (material,collar type, sleeve length), how would you recommend storing the attributes for those products so that they were both easy to insert and easy to extract.

    1) As I have said before, if you do EAV "properly" (which is VERY difficult/tedious), it really can be an efficient and "proper" design for this type of varied data collections. What you currently have is neither efficient nor proper on a variety of fronts.

    2) Having a table per object type sold is unworkable - how many different types of objects does Amazon sell??

    3) Sparse Columns come to mind as a potential solution for what you are trying to do, especially if you "genericize" them and let the application logic know what column is what attribute for each type of object. Again, that gets "difficult" for the app, but will SIGNIFICANTLY reduce the total columns you would need and also reduce the number of times over the life of the app that you will need to add columns to the table (unless you just start out with some very high number such as 1000).

    For example, have 5 Attribute columns: Attribute1, Attribute2, Attribute3, Attribute4, Attribute5. For Books, Attribute1 would be ISBN, Attribute2 would be Author, etc. But for Wine Attribute1 would be bottle size and Attribute2 would be ABV, etc.

    I note that Sparse columns do carry some downsides of primarily being xml-ish in nature and have quite a few caveats, limitations and provisos. Also, if you did go with a generic set you might have few-enough of them to not really benefit from their overhead.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • TheSQLGuru (10/4/2012)


    2) Having a table per object type sold is unworkable - how many different types of objects does Amazon sell??

    So if you're selling products you have to design a schema that would work for Amazon.com? What if you're only selling three different types of items?

    I disagree with your dogmatism. The correct answer is that "it depends". I've seen a lot of developers come up with very complex solutions to support a flexibility they don't really need and they end up wasting a lot of time in the process.

    └> bt



    Forum Etiquette: How to post data/code on a forum to get the best help[/url]

  • bteraberry (10/4/2012)


    TheSQLGuru (10/4/2012)


    2) Having a table per object type sold is unworkable - how many different types of objects does Amazon sell??

    So if you're selling products you have to design a schema that would work for Amazon.com? What if you're only selling three different types of items?

    I disagree with your dogmatism. The correct answer is that "it depends". I've seen a lot of developers come up with very complex solutions to support a flexibility they don't really need and they end up wasting a lot of time in the process.

    It is easy to show points by taking things to the extreme. But in this case say you only sell 3 types of things and you code that. Now you sell a 4th thing. You will likely have to modify the database schema, load processes, application code to know about that additional product table. Then a 5th, etc. I imagine there are few entities of any size that sell things that don't add new products/product types (or at least want to)! 🙂

    And I thought my dogmatism was actually anti-dogmatism, with several proposed solutions and guidances about potential pitfalls. And I ABSOLUTELY agree with your statement that developers sometimes (often?) build solutions that are overly/unnecessarily complex (or use new/shiney "crap" to do so) and wind up with pain because of those decisions. Hell - I make quite a bit of money cleaning up those messes!! :hehe:

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • aaron.reese (10/3/2012)


    out of interest then; if I was selling books (ISBN, Author, Pages), wine (bottle size, ABV,Vinyard,grape,acidity) and shirts (material,collar type, sleeve length), how would you recommend storing the attributes for those products so that they were both easy to insert and easy to extract.

    This is an easy problem to run into... particularly if you want to help out, say, webcomic authors sell random stuff off their site. Two mugs, a few shirts, some comics... etc.

    These guys (and gals) will sell anything that makes sense to them, it's their business. How do you fit in Deelybobbers into that list when they fad again? Semi-generic makes sense here.

    To the question itself, it's a question of where do you want to do the processing? In the end, most of it will be on the front end. To accurately display and explain the information coming from such disparate entities the front end needs to have some kind of brain to deal with it.

    At the DB level, you primarily store the core information. Price, quantity/shipment, cost, etc. What's generic to each and every item. Then, for attributes that have no business affect, you EAV store (personal preference). The reason I do it this way is that the database does NOT care about this information, it just stores it for the front end.

    But I want to know how many medium vs. large shirts I sell!!! Well, that's great, you approach that backwards and get your medium shirt ProdIDs, Large Shirt Prod IDs, then drop them back to the main table and get your real information. Yes, it's painful, and requires a ton of ad-hoc querying, and you probably just blew out performance, but you'll get there.

    The other option is the generic column description above, where the front end re-interprets the information. I dislike these, as it's far too easy to end up with a mess if a single item is 'misplaced'. Also, most of the EAV data is only required when you're listing descriptive information on a single product, thus a single call to the EAV table with the ProductID and Type returns everything the front end needs to display the product's information in an intelligent manner. This is actually even WORSE for ad-hoc querying, because trying to decipher the generic columns for a particular product type usually has you going through reams of external documentation. Depending on indexes however it may be faster once you're done with the development... of an ad-hoc.

    However, a lot of this discussion isn't about variable product information, it's about storing things like state in an EAV for a client. That is poor use of the design. EAV has its place, and primarily (in my opinion) it's for data the database won't care about except in one-off requests for the front end's programming.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

Viewing 6 posts - 16 through 20 (of 20 total)

You must be logged in to reply to this topic. Login to reply