SQLServerCentral Article

NULL Column Analysis

,

As usually, most of my articles started with a specific task for my company. I am not looking for the article but the articles are finding me. Ones upon a time (starts like all those magic stories) I placed a question to myself – “How many columns in the corporate database are not used at all?”. Not used means the value for the column is NULL for every row and wasn’t ever inserted/updated for all rows in a table. For example, column special_note in a table CLIENT never was used and probably never will be used because database has the table notes where all notes are stored.

If such statistical information will be obtained it allows one to make an analysis and may help with limiting the number of columns in each table, find bugs in an application if the application is using the column(s), and help with the future database design.

Let’s try to develop a code for the simple analysis of NULL columns, not NULL columns, and number of records in a table. This generic query shows the number of records in a table:

Select count(*) from tablenm

Another generic query shows how many records have NULL value for the column:

Select count(*)
from tablenm where colnm is NULL

Basically, this is all we need to know to be able gathering statistical data for the database. Of course, this is must be dynamic code to run it in any database with any unknown number of tables and columns. Let’s check the next code for one database.

It makes sense to gather information only for the columns that allows NULL (column isnullable = 1 in table syscolumns). And it may worth to limit end result with some other restrictions. For example, our databases include few tables with ERR prefix which are not used as a data storage but used by the error handler. Such tables could be eliminated from the analysis.

First, create a temporary permanent table in pubs.

Create table pubs..tmp_count
(dbnm varchar(50), tbl varchar(255), col varchar(255), nulls int, tot int)
go

It is possible to create the analysis table in the database we going to analyze, but generic table in pubs or another database will help with the analysis for all databases on a server and will combine statistical information into one table. Script below shows how to gather statistical data for one database.

----------------------------------------------------------------------------------
-- begin batch
----------------------------------------------------------------------------------
BEGIN
declare @sql Nvarchar(1000), @minid int, @maxid int,
@table_n varchar(255), @col varchar(255), @dbnm varchar(50)
declare @tmp1 table (id
int identity(1,1), tbl varchar(255), col varchar(255))

set @dbnm = db_name()
truncate table pubs..tmp_count
insert into @tmp1(tbl,col)
select so.name, sc.name from syscolumns sc
 inner
join sysobjects so on so.id = sc.id
where so.name not like 'ERR_%' and so.type = 'U' and
sc.isnullable = 1
select @minid = 1, @maxid = max(id) from @tmp1
while (@minid <=@maxid)
 begin
select @table_n = tbl,@col = col from @tmp1
where id = @minid
select @sql = ' insert into pubs..tmp_count(dbnm, tbl, col , nulls , tot)'
select @sql = @sql + ' select ''' + @dbnm + ''', ''' + @table_n + ''', ''' +
@col + ''', (select count(*) from [' + @table_n + '] where ['+ @col + '] is null) '
select @sql = @sql + ' , (select count(*) from [' + @table_n +'])'
exec ( @sql )
set @minid = @minid + 1
 end
END
-----------------------------------------------------------------------------------
-- end batch
-----------------------------------------------------------------------------------

Let’s check the result.

select left(dbnm,10) as dbnm, left(tbl,20) as tbl,
left(col,20) as col,tot, nulls, tot - nulls as NotNulls
 from pubs..tmp_count order by tot – nulls
dbnm       tbl                  col           tot     nulls       NotNulls
---------- -------------------- ------------- ------- ----------- -----------
pubs       discounts            stor_id       3       2           1
pubs       discounts            lowqty        3       2           1
pubs       discounts            highqty       3       2           1
pubs       publishers           state         8       2           6
pubs       stores               stor_name     6       0           6 
pubs       stores               stor_address  6       0           6
pubs       stores               city          6       0           6
pubs       stores               state         6       0           6
pubs       stores               zip           6       0           6 
pubs       publishers           country       8       0           8
pubs       pub_info             logo          8       0           8
pubs       pub_info             pr_info       8       0           8

Editor's Note: Results abbreviated

What is the next step to modify the script? Of course, to get a data points from all user databases. Production server I am managing has 200+ databases and even physically I can’t analyze the metadata by running the analysis script on every database.

There are multiple ways to get it done. The simplest way is to use undocumented Microsoft stored procedure sp_msforeachdb which perform the same actions for all databases.

-----------------------------------------------------------------
-- begin batch
-----------------------------------------------------------------
declare @sqlscrt Nvarchar(3000)
truncate table pubs..tmp_count
set @sqlscrt = '
BEGIN
declare @sql Nvarchar(1000), @minid int, @maxid int,
@table_n varchar(255), @col varchar(255), @dbnm varchar(50)
declare @tmp1 table (id
int identity(1,1), tbl varchar(255), col varchar(255))
set @dbnm = db_name()
insert into @tmp1(tbl,col)
select so.name, sc.name from syscolumns sc
 inner join sysobjects so on so.id = sc.id
where so.name not like ''ERR_%'' and so.type = ''U''
and sc.isnullable = 1
select @minid = 1, @maxid = max(id) from @tmp1
while (@minid <=@maxid)
 begin
   select @table_n = tbl,@col = col from @tmp1
   where id = @minid
   select @sql = '' insert into pubs..tmp_count(dbnm, tbl, col , nulls , tot)''
select @sql = @sql + '' select '''''' + @dbnm + '''''', '''''' + @table_n
+ '''''', '''''' +
      @col + '''''', (select count(*) from ['' + @table_n + ''] where [''+ @col
+ ''] is null) ''
select @sql = @sql + '' , (select count(*) from ['' + @table_n +''])''
exec ( @sql )
set @minid = @minid + 1
 end
END '
set @sqlscrt = N'use[?] ' + @sqlscrt
exec sp_msforeachdb @command1 = @sqlscrt
-----------------------------------------------------------
-- end batch
-----------------------------------------------------------

Let’s check the result.

select left(dbnm,25) as dbnm, left(tbl,20) as tbl,
       left(col,20) as col,tot, nulls, tot - nulls as NotNulls
 from pubs..tmp_count
 order by dbnm, tot – nulls

I do not show the result of the query because it has a lot of rows. Query run time was close to 25 minutes in the server with about 150 databases. In many cases, I don’t need to analyze all databases but only some of them. So I did modify the code above to satisfy the criteria to be able to run the script for the specified databases or exclude few databases (for example, system databases).

---------------------------------------------------------------
-- begin batch
---------------------------------------------------------------
BEGIN
declare @sql Nvarchar(1000), @minid int, @maxid int
     , @table_n varchar(255), @col varchar(255)
declare @minid1 int, @maxid1 int, @dbnm varchar(50)
     , @cmd varchar(8000)
truncate table pubs..tmp_count
declare @tmpdb table (tid int identity(1,1), dbnm varchar(50) )
create table #tmp1 (id int identity(1,1)
  , dbnm varchar(50) null, tbl varchar(255), col varchar(255))
insert into @tmpdb(dbnm)
select name from master..sysdatabases
where name in ('pubs', 'admin')
    --add all databases for analysis or exclude those you don’t need
select @minid1 = min(tid), @maxid1 = max(tid) from @tmpdb
while (@minid1 <= @maxid1)
 BEGIN
   select @dbnm = dbnm from @tmpdb
      where tid = @minid1
   truncate table #tmp1
    select @cmd = 'select so.name, sc.name from ' + @dbnm + '..syscolumns sc
      inner join ' + @dbnm + '..sysobjects so on so.id = sc.id
      where so.type = ''U'' and sc.isnullable = 1'
    insert into #tmp1(tbl,col)
    exec(@cmd)
    select @minid = 1, @maxid = max(id) from #tmp1
    while (@minid <=@maxid)
     begin
       select @table_n = tbl,@col = col from #tmp1 where id = @minid
       select @sql = ' insert into pubs..tmp_count(dbnm,tbl, col , nulls , tot)'
       select @sql = @sql + ' select ''' + @dbnm + ''',''' + @table_n +
            ''', ''' + @col
            + ''', (select count(*) from ' + @dbnm + '..' + @table_n + ' where '+
            @col + ' is null) '
        select @sql = @sql + ' , (select count(*) from ' + + @dbnm + '..' +@table_n +')'
        exec ( @sql )
        select @minid = @minid + 1
     end
    select @minid1 = @minid1 + 1
 end
drop table #tmp1
END
----------------------------------------------------------------------------------
-- end batch
----------------------------------------------------------------------------------

Another obstacle that if a table is owned by specific user and permissions are not granted the script will not recognize the fail for such tables. But in production environment most (if not all) tables are owned by dbo and it should not make any problems.

If you decide, the stored procedure with various parameters can be created. It is possible to use the same batch for the various types of analysis. For example, you can set query to gather statistical information to know the specific column(s) setting or values.

Conclusion

This analysis is not something you need for your daily job. But it is very helpful for the database architect and/or database performance review. I was able to prove that about 20% of columns are not used in years, can be eliminated in old databases and we don’t need to use them for the new database design. It made tables smaller, design simpler, and, potentially, database performance better.

Rate

5 (2)

You rated this post out of 5. Change rating

Share

Share

Rate

5 (2)

You rated this post out of 5. Change rating