Today I’m going to continue discussing the data dictionary tables we’ve set up at the Boston Public Schools. The SQL for creating the tables and inserting the baseline example records is in my previous data dictionary post. The data inserted uses a few tables in the AdventureWorks database, and I’ll continue with this example.
At BPS, there were well over a hundred TableInfo records and several thousand DataInfo records that needed to be updated. We started with documenting the tables – here’s a SQL script for getting the necessary update statements:
-- Script to get the update statements for all tables: SELECT ' update TableInfo set tableGroup = '''', description = ''' + description + ''' where tablename = ''' + tablename + '''' FROM TableInfo -- Output from running the above script: UPDATE TableInfo SET tableGroup = 'Location', description = 'Address' WHERE tablename = 'Address' UPDATE TableInfo SET tableGroup = 'Person', description = 'Contact' WHERE tablename = 'Contact' UPDATE TableInfo SET tableGroup = 'Location', description = 'StateProvince' WHERE tablename = 'StateProvince'
We spent a few days focusing on building this out as any data analysis effort will begin at the table level before the field level. At Boston Public Schools, we divided tables into a number of different groups, including Registration, Assignment, Transportation, Enrollment, Attendance, Assessment, Administrative, and a handful of others. The groupings will help identify what tables to focus on when we’re working on a particular functional area for integration into the new Student Information System. For the description column, we generally provided a 1-2 sentence high-level description of what the table is.
After finalizing the table dictionary, we started focusing on the individual fields in these tables. Here’s how we got the update statements for DataInfo along with some example output:
-- SQL to get update statements: SELECT ' update DataInfo set [description] = ''' + columnname + ''', tableref = '''', lutyperef = '''', comments = '''' where tableid = ' + CAST(ti.tableid AS VARCHAR(5)) + ' and columnname = ''' + columnname + '''' + ' -- Group: ' + tableGroup + ', Table: ' + tablename FROM DataInfo di INNER JOIN TableInfo ti ON ti.tableid = di.tableid WHERE di.[description] = columnname AND ti.tablename NOT IN ( -- This check prevents getting update statements for tables that have already been updated in some way. SELECT tablename FROM DataInfo d INNER JOIN TableInfo t ON d.tableid = d.tableid WHERE [columnname] <> d.[description] GROUP BY tablename HAVING COUNT(*) > 0 -- This threshhold can be set higher if desired. ) ORDER BY tableGroup, ti.tablename, columnorder -- first 2 update statements for table Address: UPDATE DataInfo SET [description] = 'AddressID', tableref = '', lutyperef = '', comments = '' WHERE tableid = 1 AND columnname = 'AddressID' -- Group: Uncategorized, Table: Address UPDATE DataInfo SET [description] = 'AddressLine1', tableref = '', lutyperef = '', comments = '' WHERE tableid = 1 AND columnname = 'AddressLine1' -- Group: Uncategorized, Table: Address -- first 2 update statements for table Contact: UPDATE DataInfo SET [description] = 'ContactID', tableref = '', lutyperef = '', comments = '' WHERE tableid = 2 AND columnname = 'ContactID' -- Group: Uncategorized, Table: Contact UPDATE DataInfo SET [description] = 'NameStyle', tableref = '', lutyperef = '', comments = '' WHERE tableid = 2 AND columnname = 'NameStyle' -- Group: Uncategorized, Table: Contact
Notice that there are 4 values being updated: description, tableref, lutyperef, and comments. Here’s some information about each:
Have we finished the documentation process at Boston Public Schools? No, it is still ongoing, although at this point only lower priority tables remain. So far it has been a very interesting exercise – I’ve learned a good deal about the data and table groupings that I didn’t know previously.