RE: mapping an Excel Spreadsheet to a Table

SSC Guru

Points: 1003894

November 19, 2017 at 6:31 pm

#1968710

etl2016 - Sunday, November 19, 2017 3:31 PM
Hi
I have a client who will send me an Excel spreadsheet and I have to load that to a table.
1) This excel spreadsheet will be sent only 2 or 3 times an year, not daily.
2) It contains under 5k records.
3) This table will be used as a Reference Lookup table by those programs which need it
4) This table needs to maintain history of changes, should there be any.
Below is the model I am going to design, please share your thoughts if anything can be added to make it more efficient.
1) Design it as SCD-2, with current-ness identifier column
INSERTs
---------
2) Nominate best possible Excel column as Natural Key, and rest of the columns are used for Change Detection
3) Any NK that is present in the Excel but not in my table will be used for INSERTs
UPDATEs
----------
4) Any row in Excel with its NK that is ALREADY present in my table, will be recognised as an UPDATE, if any of the non-NK columns has changed in comparison with what is in my table
PK-FK Referential Integrity
------------------------------
5) Since the scope of this table is a mere Reference Lookup table, it is NOT functionally related to any other tables (other than for lookup purposes). So, Foreign Key relations can be ignored.
DATA QUALITY
----------------
6) Since this is a hand-written Excel spreadsheet, there is high chance for data quality issues, am getting a consensus from source about the possibility of errors and NULLs/Blanks. Am going to use CHECK constraints to emphasize Data Quality while loading to table.
Please share your thoughts adding-to or correcting the above list.
thank you

If you're going to used TYPE 2 SCDs, the DON'T use a "current-ness identifier column". You should have two columns to identify the start and end dates of when the row was valid. The start date should be when the row (identified by some key) is first inserted. The end date should be the starting date of when the row next appears in the spreadsheet OR it should be CONVERT(DATETIME,''9999') for the "current" (latest) row so you don't have to muck around with double-checks in the criteria you'll eventually need to write against the table.

Obviously, when you end up with a new row for a given key, you'll need either for the loading process to find and update the end dates of existing rows or a trigger to do the same. Despite the coming objections of some of my peers, I recommend the trigger method so that someone else can't screw things up for you. Done correctly, it will be as fast or faster than a separate chunk of code in your import process.

As for data quality, I recommend NEVER loading data from a spreadsheet directly into the final table. Always use a staging table to pre-validate the data.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)