Once in a while most of us get a task to "get some data out of some database". I don't know if it's just me, but I keep asked to work with databases I know nothing about. I usually have no documentation and no knowledge about the data model, naming conventions, nothing much at all. Usually I only have some vague information about the things I am looking for, e.g. "all customer's data from our legacy web shop database" in which case I already expect to find some usual customers and orders related tables, like Customers, Companies, Countries, Orders, etc. However one can easily be lost among all the tables and columns.
So in order to make some starting point from which I could learn more about these databases, I've made three little stored procedures, each getting me more detailed information about the objects I'm exploring. I imagine these things might be more elegant to do in other tools, but I only have SQL Server 2000 at my disposal.
The idea is this: you have a database, but don't know which tables to look at, so first try to find the interesting ones (or at least narrow your candidate list). When you know which tables to analyze, then you need to know if these tables are somehow linked to others, so you check dependencies on them - as much as you need to (which might not always be very straightforward or easy using queries). Then you would like to get a feeling about which columns are most likely be worth checking out, so you try your luck finding similarly named columns.
The results will of course very much depend on each situation, but since the Items table usually has some columns named like %item%, then it might be worth checking the tables with similar names. During the process, a list of candidate tables is to be analyzed: how big, what columns, how many distinct values etc. While not always 100% accurate, the figures given are enough for the purpose - which is what matters anyhow. At the end, there will be some manual work with scrolling through the results, but nothing of a overkill.
Since we are looking for tables which are somewhat central to the database, where all the"good data" lies, have lots of rows and lots of dependencies, we can guess that such tables will have at least a few similarly named columns. They are good candidates to review. If I'm looking at a standard Orders table, there will most likely be a column named something very close to [Customer No.]. Exactly the same column name will then probably be used elsewhere in the database, everywhere a customer is important. And that is what we're looking for in the first place.
Big tables and dependencies
The first procedure is an already known BigTables procedure, based on the system procedure sp_spaceused. I've modified and slightly expanded Bill Graziano's code. This procedure lists top tables by their size (row count, space reserved/used and index size), while also displaying the number of dependant objects of each table. This is providing an important hint about the table, because "big tables" might only be some kind of dumps of data, let's say images (lots of disk space), or some tally tables (lots of rows) etc. But if I know that one particular table is used in some views or stored procedures, I can more accurately narrow my focus.
--List big tables-- CREATE PROCEDURE spBigTables As DECLARE @id int DECLARE @pages int DECLARE @used dec(15) DECLARE @tTableName sysname CREATE TABLE #spt_space ( objid int not null, rows int null, reserved dec(15) null, data dec(15) null, indexp dec(15) null, unused dec(15) null, dependants int null ) CREATE TABLE #tDepends (oType smallint, oobjname sysname, oowner varchar(50), osequence smallint) --loop through user tables DECLARE c_TABLEs CURSOR STATIC FORWARD_ONLY READ_ONLY FOR SELECT ID FROM sysobjects WHERE xtype = 'U' OPEN c_TABLEs FETCH NEXT FROM c_TABLEs INTO @id WHILE @@FETCH_STATUS = 0 BEGIN /* : from sp_spaceused */ INSERT INTO #spt_space (objid, reserved) SELECT objid = @id, SUM(reserved) FROM sysindexes WHERE indid in (0, 1, 255) AND id = @id SELECT @pages = SUM(dpages) FROM sysindexes WHERE indid < 2 AND id = @id SELECT @pages = @pages + ISNULL(SUM(used), 0) FROM sysindexes WHERE indid = 255 AND id = @id UPDATE #spt_space SET data = @pages WHERE objid = @id SET @used = (SELECT SUM(used) FROM sysindexes WHERE indid in (0, 1, 255) AND id = @id) UPDATE #spt_space SET indexp = @used - data WHERE objid = @id UPDATE #spt_space SET unused = reserved - @used WHERE objid = @id UPDATE #spt_space SET rows = i.rows FROM sysindexes i WHERE i.indid < 2 AND i.id = @id AND objid = @id --You will receive the error below which is from sp_msdependencies using dump tran. -- Server: Msg 3021, Level 16, State 1, Line 1 -- Cannot perform a backup or restore operation within a transaction. -- Server: Msg 3013, Level 16, State 1, Line 1 -- BACKUP LOG is terminating abnormally. SET @tTableName = (SELECT name FROM sysobjects WHERE id=@id) INSERT INTO #tDepends EXEC sp_MSdependencies @tTableName, null, 1315327 UPDATE #spt_space SET dependants = (SELECT COUNT(*) FROM #tDepends) --USE this if you rely on sysdepends table or don't want errror from sp_MSdependencies, but covers less dependencies. -- SET dependants = ( -- SELECT COUNT(DISTINCT o.name) -- FROM sysobjects o, master.dbo.spt_values v, sysdepends d -- WHERE o.id = d.id and o.xtype = SUBSTRING(v.name,1,2) COLLATE database_default and v.type = 'O9T' AND d.depid = @id and deptype < 2 -- ) WHERE objid = @id TRUNCATE TABLE #tDepends FETCH NEXT FROM c_TABLEs INTO @id END SELECT TOP 25 TABLE_Name = (SELECT LEFT(name,25) FROM sysobjects WHERE id = objid), rows, reservedKB = STR(reserved * d.low / 1024.) + ' ' + 'KB', dataPrcnt = STR(data / nullif(reserved,0) * 100) + ' %', indexPrcnt = STR(indexp / nullif(reserved,0) * 100) + ' %', unusedPrcnt = STR(unused/nullif(reserved,0) * 100) + ' %', dependants FROM #spt_space, master.dbo.spt_values d WHERE d.number = 1 AND d.type = 'E' --page size ORDER BY rows DESC DROP TABLE #spt_space CLOSE c_TABLEs DEALLOCATE c_TABLEs GO --/- /*Ignore Msgs 3021, 3013; check the results EXEC spBigTables
Sample results are on the picture below:
Note: while right-clicking on a table in Enterprise Manager or Management Studio and displaying dependencies is the most straightforward thing to do to check dependant objects, doing so with a query in SQL 2000 is not so elegant. There's a system stored procedure sp_depends which uses sysdepends table and a much more revealing, but undocumented procedure sp_msdependencies, which consists of a pile of code I'm not even trying to grasp, so I'm just using sp_msdependencies to fill a temporary table. Unfortunately, this might trigger some not really harmful errors, so I also included a snippet for using sp_depends, just in case you don't like to see any errors whatsoever). I hate to say it, but: ignore the errors and review query results.
Searching for similarly named columns
The purpose of the second procedure is finding tables with similar column names. It lists all columns of a given table, avoiding some data types (images, timestamps etc.), and not minding the columns not used in indexes in linked tables. That is for narrowing our results. If you don't get enough rows with potentially linked tables, you can try commenting out the line with the join on sysindexes table and/or trying partial matches to column names with %s.
CREATE PROCEDURE spColumnSearch (@tTableName VARCHAR(150)) As SELECT name INTO #tmpCols FROM syscolumns WHERE 1=1 AND id = (SELECT id FROM sysobjects WHERE name=@tTableName and xtype='U') AND xtype not in (34, 35, 99, 189) --skipping image, text, ntext, timestamp SELECT c.name As Column_name, (SELECT data_type FROM INFORMATION_SCHEMA.COLUMNS WHERE table_name=o.name AND column_name=c.name) As Column_type, o.name As Table_name, (SELECT isnull(rowcnt,0) FROM sysindexes WHERE id = (SELECT id FROM sysobjects WHERE name=o.name and xtype='U') AND indid IN(0,1)) As RowsCount FROM sysobjects o, syscolumns c, #tmpCols tc WHERE o.id=c.id and o.xtype='U' and c.name=tc.name --only columns, which are part of some index: and c.id in (select k.id from sysindexkeys k, sysindexes x where k.id=c.id and k.colid=c.colid and k.indid=x.indid and (x.status & 64) = 0 and c.id=x.id) ORDER BY c.name, o.name DROP TABLE #tmpCols GO --/-- /*Example for table Orders: EXEC spColumnSearch 'Orders'
Sample results are on the picture below:
Note: also check foreign keys, of course.
Narrowing on a table level
The third procedure returns some handy data about one chosen table, which might be of value especially if you would like to cancel out empty or almost empty columns (you've seen your tables with few hundred columns, most of those not really used). It provides data about the number of indexes, dependencies (again, review query results, ignore possible errors from sp_msdependencies), rows, columns and distinct entries in a column.
CREATE PROCEDURE spTableQuality (@tTableName VARCHAR(150)) As DECLARE @tTableID int SET @tTABLEID = (SELECT id FROM sysobjects WHERE name=@tTableName and xtype='U') SELECT name INTO #tmpCols FROM syscolumns WHERE 1=1 AND id = @tTABLEID AND xtype not in (34, 35, 36, 99, 189) --skipping image, text, uniqueidentifier, ntext, timestamp CREATE TABLE #tDepends (oType smallint, oobjname sysname, oowner varchar(50), osequence smallint) INSERT INTO #tDepends EXEC sp_MSdependencies @tTableName, null, 1315327 SELECT @tTableName As TableName, (SELECT COUNT(*) FROM INFORMATION_SCHEMA.COLUMNS WHERE table_name=@tTableName) As ColumnsCount, (SELECT COUNT(*) FROM sysindexes si WHERE si.id = @tTableID and (si.status & 64) = 0) As IndexesCount, (SELECT COUNT(*) FROM #tDepends) As DependenciesCount, --Use this if you don't like errors from sp_msdependencies, but might show smaller number of dependant objects. -- (SELECT COUNT(DISTINCT o.name) -- FROM sysobjects o, master.dbo.spt_values v, sysdepends d -- WHERE o.id = d.id and o.xtype = SUBSTRING(v.name,1,2) COLLATE database_default and v.type = 'O9T' and d.depid = @tTableID and deptype < 2) As DependenciesCount, (SELECT ISNULL(rowcnt,0) FROM sysindexes WHERE id = @tTABLEID AND indid IN(0,1)) As RowsCount CREATE TABLE #tTABLEQuality ( Column_Name sysname, Column_Type varchar(50), Count_Distincts int ) DECLARE @tColName VARCHAR(255) DECLARE cCols CURSOR STATIC FORWARD_ONLY READ_ONLY FOR SELECT name FROM #tmpCols OPEN cCols FETCH NEXT FROM cCols INTO @tColName WHILE @@FETCH_STATUS = 0 BEGIN EXECUTE( 'INSERT INTO #tTABLEQuality (Column_Name, Column_Type, Count_Distincts) SELECT t.name, (SELECT data_type FROM INFORMATION_SCHEMA.COLUMNS WHERE table_name='''+@tTableName+''' AND column_name='''+@tColName+'''), COUNT(distinct isnull([' + @tColName + '],0)) FROM [' + @tTableName + '], #tmpCols t WHERE t.name=''' + @tColName + ''' GROUP BY t.name') FETCH NEXT FROM cCols INTO @tColName END CLOSE cCols DEALLOCATE cCols SELECT * FROM #tTABLEQuality order by Count_Distincts asc, Column_Name DROP TABLE #tTABLEQuality DROP TABLE #tmpCols DROP TABLE #tDepends GO --/- /*Example for table KommTran: EXEC spTableQuality 'KommTran'
Sample results are on the picture below:
Note: with the results here one can easily get a sense of what's the deal with the table "KommTrans". Everything's stored in T_DATE field. Other columns are most likely of little use (regarding the content-oriented user).
It might take few seconds for all this to finish on let's say hundreds of tables, but be aware when dealing with larger numbers, as it might be a good idea to fine tune the queries to your situation (I've run it on 200 GB database with 30,000 tables: it took 5 hours to check dependencies!) - more precisely: re-think how you want to check object dependencies. These procedures can be something to start from, but they were valuable for me a couple of times so far. I've done this in SQL 2000 - it works on 2005 also, but there might be easier ways to do this in SQL 2005. One funny note: sp_msdependencies is behaving quite well in SQL 2005, at least in my experiments I got no errors.