Best way to design and apply metadata

Question

Best way to design and apply metadata

zysirhc

SSC Enthusiast

Points: 195
More actions
June 8, 2021 at 12:18 pm

#3893365

I read some info regarding metadata for data warehouse that defined as data about data (got a little bit confused).
No samples were provided and on how to implement in data warehouse.
Does it helps in identifying where the data come from (e.g. flatfiles, source systems, etc.) ? Hope you could provide a good sample structures and implementation of meta data and its significant use.
Thank you.

Viewing 3 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply

Mr. Brian Gale SSC-Insane Points: 24989 More actions · Answer 1

In the end, it REALLY depends on what you want to capture. What data is interesting to you and will save your bacon in the event something seems wrong. Can you prove the data accuracy?

The approach also changes based on experience and requirements. My general approach is that my ETL for the data warehouse is to build staging tables on the DW first that are a clone of the live tables at a specific point in time and then build the reporting data off of the staging tables. The staging tables contain all of the data from the source plus a column for the current datetime so I can see when they were refreshed and what SSIS package generated the data so if there is a change or bug in it, I can quickly and easily narrow down the problem without having to snoop through 100's of SSIS packages.

For the reporting tables, I follow the same approach - they contain a denormalized version of the data built off of the staging tables and include a datetime of when it was created as well as what SSIS package loaded it. If it makes sense, I also include a list of "blame tables" in my reporting tables so it is easy to trace back where the source data came from. This way if an end user says "this widget's price is wrong. It should be $a.bc" I can do a quick check and tell them something like "The price is $x.yz due to the widget currency being USD and the report showing things in CAD. The exchange rate at that time was ABC, which matches up with the value you indicated."

With the staging table, it is named after the source table so I don't usually include the source table name in that table. In my scenario though, all of my ETL is from SQL Server databases; there is no Oracle, flat file, dbase, MySQL, PostGreSQL, etc. If those were included in my ETL, I would be sure to capture that metadata so I could trace back the staging table to the source of truth.

Basically, you want to be able to go from the data you present to the end user all the way back to the source of the data so you can prove that what you are presenting is what the data was at that point in time. If you can't prove it to the end users, they will not trust the data.

Now as for sample structures, it really depends on what your requirements are. There is no one "magic" setup that will work for everyone. MAYBE you aren't using SSIS to capture the data in which case having a column for SSIS package doesn't make sense. MAYBE you don't care about where the data came from because you capture that in the table name (like I do) and it all comes from SQL Server data sources.

Refresh cycle MAY be interesting to capture as well if you have multiple refresh cycles.

Basically, you capture the data that is useful and interesting to you and your team. I would not present the timestamp it was captured to the end users because they likely don't care. But maybe they do.

The above is all just my opinion on what you should do.
As with all advice you find on a random internet forum - you shouldn't blindly follow it. Always test on a test server to see if there is negative side effects before making changes to live!
I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

zysirhc SSC Enthusiast Points: 195 More actions · Answer 2

You are right. Things would go different depends on requirements. On my current project, we pulled records/tables from different sources and most issues are naming conventions (tables, columns) , data formats/template and candidate source of truth (duplicate table contents but different table names). In order to address this, we add a consolidation process layer to merge tables with same data essence and be a candidate source table which entails of having a mapping table for value look-ups (adding the name of the source table origin and its source id/natural key). Even on the dimension-fact conversion, i add columns for reference key and reference table for tracking purposes (can this be considered as a metadata?). Most of my data process are stored-procedure scripts (conversion, consolidation, mappings). I haven't implemented a log mechanism for each process but somehow you gave me an idea that it should be included for better tracking/tracing purposes. I just don't want to be surprised by the client asking for the availability of metadata for their data warehouse (co'z my number 1 issue is, am i doing it right?).