Antipathy for Entity Attribute Value data models

  • @eric, it depends on what you want to do with those contact details. If it is just a block of unstructured data that you want to eyeball whenever you reference the customers then fine.

    If you want to integrate it with separate telephone dialling systems and email targetting systems then I'd normalise it out.

    Ditto security. Separation of data to support different contact methods allows me to assign different security based on the sensitivity of the data. Email addresses tend to be less sensitive than telephone numbers.

    Frankly it is so easy to normalise your example that it simply wouldn't occur to me to put it in an EAV. You have disciplined data and you are going to store a lot of it if it is a contact system. I regard EAV as an option of last resort, not a starting point to be optimised later.

    The medical example early in this thread is an example where EAV genuinely is useful. Once you get past 40 the sheer number of things that can go wrong with a human body becomes horribly apparent and I sooner try and normalise the human body than represent its ailments in an EAV form.

  • Peter Schott (1/21/2013)


    For Customer Contacts, why wouldn't I make a "Contact Type" table and just normalize this? Sure it can store just about anything, but since all contacts are going to be a string of some sort, I don't see the value in not going the one extra step and normalizing. With the existing model, I still have to either know what the two-character code represents or have to maintain a lookup table somewhere. I could also potentially run out of combinations or be forced to make new entries that don't make sense just so they're unique ("XX" - what does this mean again?). (And yes, this is highly unlikely, but it could potentially happen.)

    Regarding the article, I appreciated the examples in the article and the practical cases for them. Too many times we get someone modeling out an EAV config just because it seems expedient when it will more than likely cause issues later down the line. I've also heard "we'll prevent bad data through the code" too many times. That's never worked well in my experience, though hoping others have had better experience.

    I actually do have a Contact_Type table and a foreign key constraint. Contact_Type can have additonal columns to indicate things like if it's a supertype of phone number, email, etc. but there has never been a requirement to take it that far.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Using SQL_variant data type for Value sends half of the article's concerns to rubbish.

    Including data size calculations.

    Filtering by comparing to a datetime value will show only records having datetime base type, no varchar values like 2012-01-21.

    And if you do not know how to retrieve items with all 3 attributes having requested values it indicates weakness of your SQL skills, not EAV model.

    To enforce data types for certain attributes you may simply use views, triggered if you use it for uploading data.

    You simply define a data type in columns definitions and only appropriate values may be inserted into those columns. No additional coding required.

    About locking.

    When you update an attrribute in a traditional model you lock at least the row having that item.

    Whole row. Including all attributes.

    If another process is trying to update another attribute it has to wait until the previous update is completed and the lock is released.

    And after that it will overwrite the whole row, including the just updated attribute, setting it back to previous value.

    Unless you take special precautions for this case in your code - talking about additional coding!!!

    And XML.

    Oh, right XML!

    Best alternative!

    Hierarchical list of attributes with associated values.

    Nothing like EAV!

    And even better - attributes are named in a human language every time they are mentioned. Definitely less storage space than using integer ID's.

    And values are all strings, no way around.

    Especially good when you share the data with representative offices in Germany, Russia and Korea. Dates saved in local formats (even if you do not allow words for months, still mdy is "only in America" format).

    To search against XML you need additional "XML index" - talking about duplicating data storage and extra CPU/IO load (to update the index you actually need to parse uploaded XML and rewrite the index pages).

    _____________
    Code for TallyGenerator

  • David.Poole (1/21/2013)


    @Eric, it depends on what you want to do with those contact details. If it is just a block of unstructured data that you want to eyeball whenever you reference the customers then fine.

    If you want to integrate it with separate telephone dialling systems and email targetting systems then I'd normalise it out.

    Ditto security. Separation of data to support different contact methods allows me to assign different security based on the sensitivity of the data. Email addresses tend to be less sensitive than telephone numbers.

    Frankly it is so easy to normalise your example that it simply wouldn't occur to me to put it in an EAV. You have disciplined data and you are going to store a lot of it if it is a contact system. I regard EAV as an option of last resort, not a starting point to be optimised later.

    The medical example early in this thread is an example where EAV genuinely is useful. Once you get past 40 the sheer number of things that can go wrong with a human body becomes horribly apparent and I sooner try and normalise the human body than represent its ailments in an EAV form.

    I still don't see the practical benefit of seperating phone / fax numbers from emails and twitter accounts. It seems like an arbitrary distinction that can change over time or depending on the context. The Contact_Type tells the application wether to enable an autodialer versus a hyperlink on the font end.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • The Anchor Modeling approach is interesting, but I imagine that an environment where there are many different entities each with hundreds or maybe even thousands of attributes are added over time (where an EAV really shines) would become difficult to manage since it looks like each attribute requires it's own table.

  • I think Eric's example demonstrates the strength of an EAV design..you can add new attributes via data and not table structure.

    David, I see from the responses that your article on this topic has initiated some lively debate! Well done!

  • Eric M Russell (1/21/2013)


    I still don't see the practical benefit of seperating phone / fax numbers from emails and twitter accounts.

    There were no twitter accouns several years ago.

    Database designed using traditional model back then needs to be redesigned to accomodate the new type of contact reference.

    When in EAV model it's a matter of adding a row in the list of contact types.

    People usually supply a single way of contacts (or 2 at most), and you have to keep a row with all possible contact option populated with NULLs.

    P.S. just in case - I'm not arguing with Eric, actually my post is in his support. 🙂

    _____________
    Code for TallyGenerator

  • Not sure what a traditional model would be when it comes to contacts. Seems to me that a table linking the contact info to a person/entity, a contact type, and the value (with maybe a listed/unlisted value as well) was a better design than one column per contact type or multiple tables for phone, email, fax, etc. The worst designs I've seen were those that had just a handful of columns in the model for storing these values - they just don't extend well. I never really considered this an EAV model because it still only stores communication-type values. We're not mixing in addresses, favorite colors, and such in the same table. I guess I'm just missing the point in this particular instance, though. 🙂

  • BillyJack (1/21/2013)


    I think Eric's example demonstrates the strength of an EAV design..you can add new attributes via data and not table structure.

    David, I see from the responses that your article on this topic has initiated some lively debate! Well done!

    Depending on how we look at it, the Customer_Contact table I presented above is containing only one attribute (contacts). It's just flexible enough to accomodate multiple types of contacts.

    "Write a query that returns customer's name, primary method of contact, and last date/time of contact for all customers who have a late payment past 30 days."

    Given the above requirement, I don't want phone numbers, emails, and faxes all stored in seperate tables. Also, if dealing with seperate tables, I especially don't want a requirement from end users to start supporting text messaging.

    For the majority of contact types, a single table need contain only a type code and a single alpha-numeric value. The application or user can then decide what to do with the information. They just need to know the contact type they're dealing with for context.

    Of course mailing and home address should almost certainly be in a seperate table, because it is composed of multiple attributes (city, zip, geo location, etc.) that are commonly queried seperately.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Sergiy (1/21/2013)


    And if you do not know how to retrieve items with all 3 attributes having requested values it indicates weakness of your SQL skills, not EAV model.

    It isn't a case of a DBAs skills, its the case of people who have a basic knowledge but whose primary tool is not SQL. Users of SAS Analytics for example. I'm interested in designs that maximize the audience that can gain benefit from accessing the data.

    There is little point in coming up with a design that requires a data priesthood in order to get business value from data.

    As most of my work over the past 12 months has really emphasised the problems caused by poor data quality. Anything that opens the door to poor data quality has to be resisted at all costs in my environment. As data has to be shared with external organisations as part of what my organisation serves its customers poor data quality has a direct affect on revenue.

    I'm not sure that your locking statement is true. I could be wrong but I don't think SQL Server locks fields, it locks records and anything that requires more than 5,000 locks results in lock escalation.

  • Good article, thanks. I'll pass it on to my team. We're stuck in the Third Circle of Hell implementing a massive EAV in a third party tool. Helps to know what you are up against.

    [font="Comic Sans MS"]Tom Powell
    http://philergia.wordpress.com/[/font]

  • Turning into Magento EAV model (http://www.magentocommerce.com/wiki/_media/doc/magento---sample_database_diagram.png?cache=cache), EntityAttributeValue table is splitted in 5 different tables: EntityAttributeValue_datatime, EntityAttributeValue_int, EntityAttributeValue_decimal, EntityAttributeValue_text, EntityAttributeValue_varchar. This solution improve storage but it increase the model complexity.

    There are a bit of EAV model in Microsoft CRM, is a clear example of business requirement flexibility where entities are pre-defined but attributes can be manage by the end users.

    In my opinion EAV is a good & nice model with a huge weakeness: performance and complexity (as a colleague said above it's too far away from the logical model)

    A nice discussion in this thread.

  • LOL! All of you guys who added a [Customer_Tweet] table to your database, are you going to add tables for supporting Hoots and Chirps too?

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • I made an entire subsystem for EAV data, in order to hold configuration information to be used by the entire enterprise.

    There are stored procs, DTS packages, jobs, C# apps, web services and VB apps that access this data, which has been placed in lists. Those lists can be lists, or they can be lists that vary according to a second definable parameter, or tables that vary according to a definable parameter. Those 900 configuration tables we can can be abstracted down to 3 or 4 and the data normalized, since there are only keys, values (varchar(max)) and descriptions (also varchar(max)). We don't have a need to store binary information, and we use the TVFs to populate table variables that are then used in our queries (so casting is minimized). Even taking the hit where nested case statements would normally be used, it is actually faster than the nested case statements themselves and much cleaner and more maintainable. My goal is to have all our IFs and CASEs eventually use the configuration data so the entire enterprise is table driven. I guess the thing I'm most proud of is the index on varchar(max). Heh. Compute the SHA, persist it, close enough when you say 'the hash' = '@the hash' AND value = @value. Make sure there's indices on all the foreign keys and it's nothing but index seek. It conforms to my 'all your shit in one sock' algorithm, which places the configuration data where you know it'll be.

  • lnardozi 61862 (1/21/2013)


    I made an entire subsystem for EAV data, in order to hold configuration information to be used by the entire enterprise.

    There are stored procs, DTS packages, jobs, C# apps, web services and VB apps that access this data, which has been placed in lists. Those lists can be lists, or they can be lists that vary according to a second definable parameter, or tables that vary according to a definable parameter. Those 900 configuration tables we can can be abstracted down to 3 or 4 and the data normalized, since there are only keys, values (varchar(max)) and descriptions (also varchar(max)). We don't have a need to store binary information, and we use the TVFs to populate table variables that are then used in our queries (so casting is minimized). Even taking the hit where nested case statements would normally be used, it is actually faster than the nested case statements themselves and much cleaner and more maintainable. My goal is to have all our IFs and CASEs eventually use the configuration data so the entire enterprise is table driven. I guess the thing I'm most proud of is the index on varchar(max). Heh. Compute the SHA, persist it, close enough when you say 'the hash' = '@the hash' AND value = @value. Make sure there's indices on all the foreign keys and it's nothing but index seek. It conforms to my 'all your shit in one sock' algorithm, which places the configuration data where you know it'll be.

    Hasing can accelerate searching for equality/inequality, but after an equality match you still got to compare the base data as collisions will occur. More frequently so as the size of your tables increase! In the end all it takes is exactly one collisions to render a system useless.

    I made an 'object model' in SQL Server in the past and it had tricks that SQL Server could not do at the time (on delete set null for example) and was superior in cascading delete flexibilities. But as the amount of data put into the system grew, problems arose, many problems. Some of them I could today alleviate somewhat with better code, but the core of the problems remain. Inefficient IO, locking and lots of trigger/procedural code to enforce things. I have to say I did wend extreme and put everything into the model, the whole database, including all things I knew at design time. This is certainly not something I will ever repeat!

    I am all for metamodels for handling design time unknowns and/or to physically extend a set model with extra attributes when it benefits. Physical simply because you can access the data where it otherwise would be modeled too and is open to native indexes and constraints. This is more efficiently in all aspects concerned (speed/correctness/locking/code execution).

    There might be one case I would consider EAV over physical table and that is with very wide tables that exhibit access patterns over a fraction of the attributes at a time. But even then I would seem to structure it more as it is very likely that some attributes are always accessed in groups and thus a form partitioning is likely more effective. In a sense EAV is a form of partitioning that is overdone in most instances!

Viewing 15 posts - 16 through 30 (of 62 total)

You must be logged in to reply to this topic. Login to reply