Name value pair (EAV) model

  • I can see the flexibility demands that a rapidly changing data model requires. But I will argue strongly against abstracting the data model away from the database layer.

    All these conceptual thought games around changing data models, software changes and end user configurablity just pass the decisions in a changing application to a person that is abstracted from a relational view of the performance impact of the data model change they are contemplating.

    Unmanaged incremental performance risk with potentially cheaper development blind to the risks of the modelling decisions they are making. If only I could force the decision makers to do the ongoing support for these applications.

  • I love how everytime I read about EAVs there's always some quote about performance blah, blah. EAVs have their place and they are good at what they do. Hmm, let's take an example, start using Extended Properties or querying the SysProperties or using the ::fn_listExtendedProperty functions. Oh my god! Microsoft used an EAV! Did they use it throughout the whole design of Sql Server, No! EAVs serve their purpose as long as you are not relying on this data for business logic, I think it's fine.

    When you need to display different descriptions for Products based on culture, you're doing an EAV even though you've defined the columns very specifically to Products, and in most minds it's a 1-to-Many design, it's underlying concept is still EAV.

    I tend to use EAV data where it's applicable. For example, logging Event data. I don't use that data for business logic, maybe some reports, but it's rarely used. It does, however, need to be flexible or else you end up creating 100's of tables to log different types of events. Just looks at SQL Notification Services (hmm, I wonder why they're no longer going to support it?). It's too difficult and a maintenance nightmare.

    Please keep reading! There will come a point after storing certain attributes that they will either become their own tables or additional columns of a root table once you realize the value or having that as an additional column or you find a way to name the column so that it's meaning is universal to anyone using it (for example, changing SSN to NationalID so that it works in any country or State to Region and County to District). EAVs make great sandboxes for data that can eventually become part of the standard model.

    I don't know, my EAV tables have always had 2 indexes and have always been fine. Not sure how others are implementing this. Obviously partitioning will help.

  • Cade,

    True - but not in this case. They actually used a TEXT field for the 'value' part......

    Mark

  • tymberwyld (3/24/2008)


    I love how everytime I read about EAVs there's always some quote about performance blah, blah. EAVs have their place and they are good at what they do. Hmm, let's take an example, start using Extended Properties or querying the SysProperties or using the ::fn_listExtendedProperty functions. Oh my god! Microsoft used an EAV! Did they use it throughout the whole design of Sql Server, No! EAVs serve their purpose as long as you are not relying on this data for business logic, I think it's fine.

    When you need to display different descriptions for Products based on culture, you're doing an EAV even though you've defined the columns very specifically to Products, and in most minds it's a 1-to-Many design, it's underlying concept is still EAV.

    I tend to use EAV data where it's applicable. For example, logging Event data. I don't use that data for business logic, maybe some reports, but it's rarely used. It does, however, need to be flexible or else you end up creating 100's of tables to log different types of events. Just looks at SQL Notification Services (hmm, I wonder why they're no longer going to support it?). It's too difficult and a maintenance nightmare.

    Please keep reading! There will come a point after storing certain attributes that they will either become their own tables or additional columns of a root table once you realize the value or having that as an additional column or you find a way to name the column so that it's meaning is universal to anyone using it (for example, changing SSN to NationalID so that it works in any country or State to Region and County to District). EAVs make great sandboxes for data that can eventually become part of the standard model.

    I don't know, my EAV tables have always had 2 indexes and have always been fine. Not sure how others are implementing this. Obviously partitioning will help.

    I agree... event logging and audit logs are a great example of EAV's. I also use them with great success in cross-tab reporting... first step is frequently to put the data into some form of EAV as a derived table. If I really want things to fly, the EAV will be a temp table with the correct index. I've made several monthly reports that report 9 different aggregated attributes by hour of day by day of month for all days of the month. Millions of rows are the source of the data (IVR application, in this case) and they are quickly aggregated into an EAV and then cross-tabbed with ease. And, unlike the Pivot function, can return more than one measure (if you want to call it that.).

    We've also very successfully used them in "Customer Added 'Fields'"... not something you should go nuts with, but effective when the customer uses them properly.

    As has been said, they have their place and that's not "everywhere".

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Naturally I cannot remember where I read the articles I would like to reference but I did find the following:

    Academic sytle paper

    Data Extraction and Ad Hoc Query of an Entity— Attribute— Value Database

    http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=61332

    See section 11 Disccussion of drawacks

    http://en.wikipedia.org/wiki/Entity-Attribute-Value_model

    See Downsides

    http://www.answers.com/topic/entity-attribute-value-model?cat=technology

    Article in favor of generic design with many mostly negative comments posted

    http://www.sswug.org/see/26210

    Posted 4 table model of everything with a few mostly negative and not overly specific disagreements

    http://discuss.joelonsoftware.com/default.asp?design.4.331499.15

    The basic EV design works great when you have only one set of attributes to values, it works pretty good when you have a few sets of attributes to value pairs by a limited number of groups. By add another layer or two of grouping and increase the volume then the model starts to break down as soon as your query complexity increases.

    If you are just considering using the model for a simple parameter table then you will probably not have to deal with most of the issues but if you needs are more complex you can quickly get into trouble.

    -- Mark D Powell --

  • Just like to thank everyone for an excellent discussion. Very informative.

    Jeff Roughgarden, Ph.D., MCDBA, MCSD

  • Thanks for the article. I'm interested in understanding different design schemes. My question may have already been answered, but I'm not sure.

    For the first time, I'm considering an EAV setup, and I would be very interested to hear opinions. My staff need to do a "risk assessment", but you can also think of it as a "survey" problem. I think there is a very good chance that the questions in the assessment/survey will vary over time as we refine our assessment process. Here's a subset of some of the questions in the risk assessment to give you a sense of what I am talking about:

    > Risk of a fall

    > Risk to client or caregivers from pets, livestock, etc.

    > home has odors (gas, rotting food, etc.)

    > client has skin condition

    > client has access to adequate medical supplies

    > client can recall recent events

    The answer to "every" (as least this is what is claimed is true not only now, but in the future) question is "none, minor, moderate, Severe, or n/a". Also note that the questions/assessment categories are grouped together. For example, there are questions under the categories of: physical condition of the client, availability of shelter, hazards, sanitation of the home, etc.

    Each answer has a point/percentage implication for the final "risk assessment" for that client. I'm not sure yet of the formulas, but I think the application may have to do things like: "for every question in this category of questions (say for all the "hazards" questions), give 5 for every "minor" answer, 10 for every "moderate" answer, 15 for every "sever" answer. Then add it all up.

    Taking all that in, I was thinking that the following type of schema might make sense. But I'm nervous about it in light of this article. Note that this is bare-bones. I'm not trying to define the schema completely. I'm just exploring the idea of the basic set-up.

    Table: QuestionList

    qID - number for each question

    qStartDt - when question becomes valid

    qStopDt - when question is no longer valid

    CategoryCd - link to another "lookup" table that lets us group questions according to a category

    qDescription - the actual question (examples are given above)

    Table: Assessments (the main table, with EAV)

    ClientID - link to the client getting the assessment. I would expect a set of qID records for every client getting assessed.

    qID - link to above table

    AnswerCd - link to another "lookup" table that gives us the answers listed above (minor, moderate, etc). That lookup table would also include info for calculations.

    This is a simplified version of what I would do, but the basic idea is there. It seems to me that this would be easy to query, do formulas, and not be such a bad performer. Also since the answers all follow the same format, I can do good data validation/restriction. But I do not have much experience with this type of design, especially with the amount of usage the design could potentially get.

    Which brings us to this note. I'm not sure of the usage for this data. I would think that worse-case would put us at about 4 million records after 10 years. But that is only if the program ends up getting much wider usage than my little agency.

    Thoughts?

  • Aren't you just normalizing the question type out of the client-question-response ?

    consider finding a client that has

    have no Risk of a fall

    have low Risk to client or caregivers from pets, livestock, etc.

    have high home odors (gas, rotting food, etc.)

    have no skin condition

    have low access to adequate medical supplies

    has no recall recent of events

    Would you look for clients by their responses across survey's ?

    If they didn't answer that question should that mean that they are still eligable to be returned if your search was only looking for clients with no risk of fall ?

    I would not be scared of this type of normalisation with client,question and client-question-response covered in appropriate indexes and only a couple of million responses.

    If you were proposing to have client as an normalised type out of say person I would still probably not.

    But if Question,Person and survey were normalized types out of some entity table, and person-has-a-survey, survey-has-a-question and Person-takes-a-survey-question where normalized types out of some entity-relationship table then I would start to squirm.

    If you know that the client takes a survey not a supervisor or an operator and that the survey is the only thing that has questions that you need to search for clients by. Then denormalising into the database foreign key relationship the survey-question-response increases the scalability possibilities, specifically the case of clients that have no risk of fall type search. And has low, if not no risk to the flexibility of the application.

    Anyone else ?

  • David: Thanks for your reply!

    To answer one of your questions, "Would you look for clients by their responses across survey's?" - At the moment, I don't see that happening for particular questions, but you never know. At the moment, the main purpose of the survey would be to define a total risk level for the client. If the client's risk level is say "high", then the investigator/user would know to do certain actions to help the client. I also imagine the users wanting annual summary stats such as: How many clients last year had a risk level of low? med? high?

    You do not have to explain, but I thought I would let you know that I didn't understand this part of your reply: "But if Question,Person and survey were normalized types out of some entity table, and person-has-a-survey, survey-has-a-question and Person-takes-a-survey-question where normalized types out of some entity-relationship table then I would start to squirm."

    As for this part: "If you know that the client takes a survey not a supervisor or an operator...", I'm not sure I understood your full point, but I can comment on this one part. The clients themselves will never actually be filling out the "survey". Instead, the operator (our investigators/staff) will decide if a survey/assessment is required. If so, then a completely survey will likely be set up for that client, by which I mean a stored proc will likely be called to generate a record for every active question and attach to that client. The default answer for each question would be null or "not applicable", and the user would just change the answers for the questions that apply. For now, I think that set-up would allow me to create the most user-friendly interface.

    Thanks again for your interest.

  • David: I've been thinking more about your reply. Particularly, I've been caught on your first sentence: "Aren't you just normalizing the question type out of the client-question-response?"

    That's such a good point. So, is my proposal good normalization or EAV? At what point is something EAV vs good normalization? I understand that the answer relates to the content/subject of the data. But I think it is worth thinking about.

  • I would suggest that the EAV is where the types are taken out of the database entirely and only the application context would be able to decipher the significance of what entity type id x means.

    Which is fine when the application doesn't want to 'view' the data sorted by something other than the order that IS in the data model.(eg clientid order). Getting the answers for a customer; doesn't need the question types in the schema and has the advantage that if the customer has another set of 'typed' data that needs to be stored that there is no schema level change. (or change to SQL to extract for this customer survey)

    For example entity response type id 1-32 could be the original question responses that the application stored about the client. You could just as easily create response entity ids 33-66 that store a physicians responses to a different set of questions that had a contextually different meaning determined by the application (but about the same client). This would be to say that the database schema did not describe the applications data model or rather the bit of it where 1-32 means client responses and 33-66 means physicians and the application / stored procedure could then extract the two contextually different data sets internally knowing by id what the responses meant.

    Great so we have just allowed for the application to have the flexibility to change its data model without any change to db schema. This is at the cost of needing to recreate the part of the data model that is not in the schema each time we present it logically to the user. For simple parts like a flat list there is little effort or even no schema to be reconstructed in producing a flat list for a specific range of a larger flat list. For more abstracted types the reconstruction becomes non trivial.

    As you add more and more abstracted types (and layers of abstraction) you end up reducing the selectivity of the index (and increasing the number of seeks by the level of abstraction). For example the selectivity of the and client-question-response index by question id. At some number of layers of abstraction and amount of data being indexed the work reconstructing the actual types becomes measurable. I doubt a single abstracted type is going to do anything measurable.

    An EAV becomes a target for a change to a indexed denormalized relational structure with schema defined types (tables) similar to your proposal, if the application ends up with time critical grouping demands on the E's for a large number of V's in a grouping order very different to the order indexed (or initial grouping). Ie the ability to view clients by their id is as important as clients by their response to question 1 a survey done in 2005.

    Alternatively if you are happy that the relationships between the client type the question type and the survey type are unlikely to change then putting it into the db schema only adds the risk of needing to change the schema (as well as the application change which would be mandated anyway) if these relationships change in their actual as use.

  • David: Thanks again for your reply. You did a good job of laying out a decision framework / set of criteria to use for making these kinds of decisions. Nice.

  • This is exactly what I was talking about in my previous post. You might be able to represent this type of Client-Question-Response using an EAV, but you'll find it's better to have a good relational model with "limited" EAV-ness. The only EAV portion I can see in your model would be the Client "Answers" or "Responses" table.

    Basically there are two pieces:

    1.) The storage model for the Surveys, Questions, and Answers:

    Surveys (table)

    Questions(table) - this table stores the Question, type of Question (multiple-choice, true / false, or written). When the question type = "multiple choice" then the Answers table will be used. You could also provide a "Score" column here that would represent the overall points associated with this question if it had more "weight" than others (in the situation where this is a survey type of "Test / Exam"??).

    Answers (table) - this table stores the answers, their sequence (so they can be re-ordered) and any other info about them. This table is probably only used for multiple-choice questions. You could also provide a "Score" column in this table if you're trying to derive a total "risK assessment" type of survey.

    2.) The storage model for the Surveys, Clients, and their Responses / Answers. In this side of the design, you are sort of using an EAV on the "ClientAnswers" because you're not sure that the "Value" column is storing an answer to a True/False or a Multiple-Choice question.

    Surveys (table) - same as above

    ClientSurveys (table) - stores which surveys the client has taken. Maybe a total score if you're not always summing the score from the ClientAnswers.

    ClientAnswers (table) - stores the answer the client chose for the particular question in the survey. This is a child of ClientSurveys. In this table, the "Answer" (or "Value") column should probably be a Sql_Variant. This way you can either store the Value from one of the multiple-choice answers (an Int usually) or the "T" or "F" as a Bit data type.

    Actually, I'd have two valid "Value" columns (Value Sql_Variant NULL, and ValueText (Text or VarChar(MAX) NULL). This way if one of the question is a "written answer" then the storage or the answer will be more efficient because only for a limited number of records will the Text or Varchar(MAX) column be used.

    I hope I haven't confused you too much...

  • tymberwyld: Thank you for taking the time to reply. So much to think about....

  • Interesting discussion. I certainly see where EAV design is effective for front-end systems which expect to receive large numbers of unknown values from multiple sources or when the query objectives are for display of a single response set (to display a risk score on the screen, for example). The time and resource savings for redesign of data models are evident. Great for "getting data in". However, what this practice really does is delay the integration decisions to a later step in the process.

    We data warehouse developers struggle with name value pair design in our front end (source) systems. One of the primary purposes of a good data model is communication of the business relationships among the logical entities about which the business has an interest. The EAV model cannot be used as such a communication tool. On the data warehousing side and the downstream reporting side the intent is to make the data available to end report users and decision makers in an unambiguous easily queryable form that is tuned to return millions of rows in aggregated form. We are tasked with analyzing and dissecting the value pairs and creating a logical business model that is representative of the integrated content. Its really a case of pay now or pay later.

    We took an approach in our datamart (star schema) where each fact record represented an assessment and had a FK to our question table for each question. This of course reduced the number of storage rows by raising the grain from a single response to a set of responses, but had the negative and limiting effect of creating multiple joins, which at 24 currently is slowing performance not to mention confusing our OLAP Reporting tool developers.

    We will now create a view that performs the the joins for them, but I am still searching for an approach for the survey problem that combines reporting performance and business clarity with the lowest possible storage requirements. Look forward to T.Asfaw's second article and hope you will address the impact of EAV design on the downstream Data Warehousing and Reporting arenas.

Viewing 15 posts - 16 through 30 (of 41 total)

You must be logged in to reply to this topic. Login to reply