designing database - people properties

  • If I wanted to collect "any" information, on a whole lotta people, how would

    I design such a database ?

    Okay, Lets say I want to store information from 3 million people, the kind of information is survey based, where each survey is built and executed over 100 000 people. After that a new survey is built and executed over some other 100 000 people.

    Because the properties of future surveys are not known upfront, how would I design such a database.

    Let's start with an example:

    first survey grabs followinf info from people: (haircolor, salary, married, children)

    second survey: (type-o-car, hobbies, pets)

    THe thing I'm having trouble with is the querying afterwards? In a simple database, storing the info as columns, the queries are relatively simple, but if I use kay-value tables....

    any input, or links ??

     

    thanks

  • You might want to consider EAV. http://ycmi.med.yale.edu/nadkarni/Introduction%20to%20EAV%20systems.htm

  • This is one of possible structures for similar purposes (simplified). It is based on the condition, that every question has a predefined set of answers (no "explain in your own words" option - just "check correct answer"). Please take it only as an idea, something to think about, not a solution. A lot depends on what and how you want to do with the data.

    table surveys - header of a survey - when it was run, who did it, what was the objective, name etc.

    table questions - question_ID, survey_ID (FK to surveys), question_text (what is being asked), question_number (to allow correct ordering of questions)

    table answers - answer_ID, question_ID (FK to questions), answer_text, answer_number (numbered from 1 to N for every question)

    If you want to use the DB not only for storing results, but plan for some interactivity when entering data, you can add column "next_question". This will then allow you to navigate through the questions based on answers (like not asking about the color of a car, if the person just answered Do you have a car? with "No").

    This all was structure of the survey, now comes the other part, how to enter the results:

    table persons - person_ID + any basic info you want to store in most surveys

    table results  - person_ID, question_ID, answer_ID (all three are FK to respective tables)

    In your first example, you would insert one row into table surveys, that will identify the survey. Then you insert one row for every question (question_ID = 333, question_text = 'haircolor', question_number = 1 etc.) and one row for every possible answer to each question 

    answer_ID = 781, question_ID = 333, answer_text = 'brown', answer_number = 1

    answer_ID = 782, question_ID = 333, answer_text = 'blonde', answer_number = 2

    answer_ID = 783, question_ID = 333, answer_text = 'green', answer_number = 3

    This should allow you to build reasonable queries.

    HTH, Vladan

  • We do something very similar the only tweak is that the questions have an 'answer type' attribute which defines how the question is answered. Currently we have three types of answer.

    1. An option E.G. 1=Blue, 2=Red, 3=Green

    2. Free Text E.G. 'I did'nt like any of the colours'

    3. Numeric E.G. 1000

  • Thanks for the advice,

    I thought of this solution numerous times, however I 'm in trouble when doing queries on such a database design, the queries seem to get overcomplicated. for example lets say I want to select all the redhaired women, driving a mercedes and are older then 30

    But now for the TRICKY part, sometimes the answers change or an answer is only valid for a couple of days.

    How would you design such a requirement.

    again an example:

    THe first 3 days, the question "cartype?" has 3 answers (mercedes, porsche and buick)

    THe next 3 days, one answertype (buick) gets removed and another is added ("bmw")

    Question is, how to design this and how would one query the answers.

    The query should answer: How many people chose what type of car, and their percentage ? (For me the big problem is the calculation of the percentage, it should not be calculated on the complete total of people surveyed, but only the total of people surveyed when that specific answer is/was available)

    hope it's clear enough?

     

     

     

  • Argh * 2

    I just lost my post and you have a fairly horrid problem - some options that sring to mind are

    1. If Buick is removed code it as a new question.

    2. Buick is ID 3 , BMW is ID 4 (thats probably what I would do).

    3. Look at BiTemporal database design (very hairy). Look for a paper written by a cult guy called Snodgrass.

     

  • I was thinking about adding a creationdate per answer (active, inactive date),

    then get the total count of surveys since that date (or betwen those dates)

    after that I grab the total count for that specific answer...

    that's probably the solution but then on a per answer basis,

    However, is it possible to make a query (by good db design) to compute this

    on a per question basis...

  • Your idea of active dates is pretty much what BiTemporal is about. Try this link http://www.cs.arizona.edu/people/rts

     

Viewing 8 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply