How do you Normalize a 300 million record with 300 columns table? Nationwide Consumer Database

  • dtran1127

    SSC Veteran

    Points: 283

    Hello:

    Thanks in advance. How would you go about normalizing a  300 million record data set with 300 columns ? This is a consumer database with it's demographics.

    The consumers are non-duplicates. 

    Would it make sense to make another 300 million record set normlizing the demographics such as Interests, hobbies etc etc?

  • Steve Jones - SSC Editor

    SSC Guru

    Points: 717443

    What's the use here? How often are things updated? Certainly you have duplicate data in that there are a large percentage of M and F in there, so you could break these out.

    I'd look at query patterns, what is likely to be queried. If I am often querying by lots of varied fields, maybe this works. But if I can break out some sets of columns to other tables, because they aren't always queried, and I can index and examine a subset, maybe that helps.
    If this is just read only data, maybe you want to just index most fields.

  • Jonathan AC Roberts

    SSCoach

    Points: 17171

    dtran1127 - Friday, August 24, 2018 3:15 PM

    Hello:

    Thanks in advance. How would you go about normalizing a  300 million record data set with 300 columns ? This is a consumer database with it's demographics.

    The consumers are non-duplicates. 

    Would it make sense to make another 300 million record set normlizing the demographics such as Interests, hobbies etc etc?

    300 million consumers?
    It's difficult to say without seeing the details but it sounds like a star schema might be what you're looking for. You would have a fact table with 300 columns and 300 million rows. The columns would just be integer Ids with FKs to 300 other dimension tables.

  • dtran1127

    SSC Veteran

    Points: 283

    Thanks for that. I will look into star schema. It does look like something that might be better for my type of work. I'm assuming it's because integer id's will run a lot quicker correct?

  • Jonathan AC Roberts

    SSCoach

    Points: 17171

    dtran1127 - Friday, August 24, 2018 4:03 PM

    Thanks for that. I will look into star schema. It does look like something that might be better for my type of work. I'm assuming it's because integer id's will run a lot quicker correct?

    Yes, integers don't take up much space. You only need to join to the tables relevant to the query you are running. Also, you shouldn't allow nulls on any of the columns. Instead have a negative value which relates to a negative value on the dimension table.

Viewing 5 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply