How do you Normalize a 300 million record with 300 columns table? Nationwide Consumer Database

  • Hello:

    Thanks in advance. How would you go about normalizing a  300 million record data set with 300 columns ? This is a consumer database with it's demographics.

    The consumers are non-duplicates. 

    Would it make sense to make another 300 million record set normlizing the demographics such as Interests, hobbies etc etc?

  • What's the use here? How often are things updated? Certainly you have duplicate data in that there are a large percentage of M and F in there, so you could break these out.

    I'd look at query patterns, what is likely to be queried. If I am often querying by lots of varied fields, maybe this works. But if I can break out some sets of columns to other tables, because they aren't always queried, and I can index and examine a subset, maybe that helps.
    If this is just read only data, maybe you want to just index most fields.

  • dtran1127 - Friday, August 24, 2018 3:15 PM

    Hello:

    Thanks in advance. How would you go about normalizing a  300 million record data set with 300 columns ? This is a consumer database with it's demographics.

    The consumers are non-duplicates. 

    Would it make sense to make another 300 million record set normlizing the demographics such as Interests, hobbies etc etc?

    300 million consumers?
    It's difficult to say without seeing the details but it sounds like a star schema might be what you're looking for. You would have a fact table with 300 columns and 300 million rows. The columns would just be integer Ids with FKs to 300 other dimension tables.

  • Thanks for that. I will look into star schema. It does look like something that might be better for my type of work. I'm assuming it's because integer id's will run a lot quicker correct?

  • dtran1127 - Friday, August 24, 2018 4:03 PM

    Thanks for that. I will look into star schema. It does look like something that might be better for my type of work. I'm assuming it's because integer id's will run a lot quicker correct?

    Yes, integers don't take up much space. You only need to join to the tables relevant to the query you are running. Also, you shouldn't allow nulls on any of the columns. Instead have a negative value which relates to a negative value on the dimension table.

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply