How do you Normalize a 300 million record with 300 columns table? Nationwide Consumer Database

Question

How do you Normalize a 300 million record with 300 columns table? Nationwide Consumer Database

dtran1127

Old Hand

Points: 343
More actions
August 24, 2018 at 3:15 pm

#415502

Hello:
Thanks in advance. How would you go about normalizing a 300 million record data set with 300 columns ? This is a consumer database with it's demographics.
The consumers are non-duplicates.
Would it make sense to make another 300 million record set normlizing the demographics such as Interests, hobbies etc etc?

Viewing 5 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply

Steve Jones - SSC Editor SSC Guru Points: 742834 More actions · Answer 1

What's the use here? How often are things updated? Certainly you have duplicate data in that there are a large percentage of M and F in there, so you could break these out.

I'd look at query patterns, what is likely to be queried. If I am often querying by lots of varied fields, maybe this works. But if I can break out some sets of columns to other tables, because they aren't always queried, and I can index and examine a subset, maybe that helps.
If this is just read only data, maybe you want to just index most fields.

Jonathan AC Roberts SSCoach Points: 18405 More actions · Answer 2

dtran1127 - Friday, August 24, 2018 3:15 PM
Hello:
Thanks in advance. How would you go about normalizing a 300 million record data set with 300 columns ? This is a consumer database with it's demographics.
The consumers are non-duplicates.
Would it make sense to make another 300 million record set normlizing the demographics such as Interests, hobbies etc etc?

300 million consumers?
It's difficult to say without seeing the details but it sounds like a star schema might be what you're looking for. You would have a fact table with 300 columns and 300 million rows. The columns would just be integer Ids with FKs to 300 other dimension tables.

dtran1127 Old Hand Points: 343 More actions · Answer 3

Thanks for that. I will look into star schema. It does look like something that might be better for my type of work. I'm assuming it's because integer id's will run a lot quicker correct?

Jonathan AC Roberts SSCoach Points: 18405 More actions · Answer 4

dtran1127 - Friday, August 24, 2018 4:03 PM
Thanks for that. I will look into star schema. It does look like something that might be better for my type of work. I'm assuming it's because integer id's will run a lot quicker correct?

Yes, integers don't take up much space. You only need to join to the tables relevant to the query you are running. Also, you shouldn't allow nulls on any of the columns. Instead have a negative value which relates to a negative value on the dimension table.