SQLServerCentral Article

Finding Values With Numerous Columns

,

Defining the Problem
A properly normalized table should virtually never result in a situation where it unclear in which column a particular type of value would be. However, both database administrators and developers frequently have to deal with tables which are not normalized. Sometimes, this can be a result of deliberate denormalization which can in some cases have a value. In other cases it can be the result of the way the data is brought into the database, such as what can happen when a spreadsheet is brought into SQL but has not yet been normalized and entered into production. But for one reason or another, it often occurs.

In these cases, it may be impossible to determine which column the data you are trying to locate is held. As an almost canonical example, there may be a need to search for a specific phone number, but the table may be laid out with numerous columns for different types of phone numbers.

The Specific Solution
When a certain piece of data needs to be found from a table with repeating columns, the direct and specific approach is to write a query with a string of "or" statements in the where clause and enumerate every column. For instance:

SELECT 
*
FROM
phonebook
WHERE
phone1 = '234-5678'
OR phone2 = '234-5678'
OR workphone = '234-5678'
OR cellphone = '234-5678'

This may often be the best solution for this situation. It finds the value and permits the developer to precisely specify which columns to check. Unfortunately, it becomes very awkward to implement when that number of columns becomes large.

More General Solutions
Another way of addressing the problem completely is to normalize the data and remove repeating columns. This will completely remove the specific phone number problem discussed, and is probably the best overall solution when it is appropriate.

However, there are times when the data cannot be normalized. Such as when it is already tightly coupled with a data application or other concerns prevent the schema from being altered at all. In other circumstances, the value of deliberately denormalizing the data may be viewed as being greater than the advantages that normalization brings. (Whether or not and when this may be true is far beyond the current topic).

Then there are times when normalizing the data would genuinely not solve the problem. For instance, which relatively unusual, it may be necessary to match a part of a text string without knowing what it represents at all.

For those cases, instead of writing a specific select statement with repeated or clauses, it can be useful to have a general script which will search for the value. For instance:

CCREATE PROCEDURE [dbo].[FindValue] 
@TableName NVARCHAR(128), /* Must be a valid table or view name,
must not be quoted or contain a schema*/
@Value NVARCHAR(4000), /*May contain wildcards*/
@schema NVARCHAR(128) = 'dbo' /*May be left out*/

AS

/*
Sample Execution

Exec FindValue
@TableName = 'spt_monitor',
@Value = '8',
@schema = 'dbo'

*/

/*
If given a string it will finds all rows where any char, varchar, or
their Unicode equivalent which contain that string in the selected
table or view. Note that this only works on objects which have entries
in information_schema.columns, which excludes certain system objects. If
given a numeric value it will check those text types for a match as well
as numeric types. If given a possible date, it will also check date type.
The string that is being searched for may contain wildcard characters such as %.
This will NOT search text, ntext, xml, or user defined fields. This may
return a row more than once if the search string is found in more than one
column in that row.
*/

/**************************** Declare Variables ***********************/
DECLARE @columns TABLE
(ColumnName NVARCHAR(128))

DECLARE @sql NVARCHAR(MAX)

/************************** Populate Table Variable *****************/
/*Takes the names of string type columns for the selected table */
INSERT INTO @columns
(ColumnName)
SELECT
Column_name
FROM
INFORMATION_SCHEMA.COLUMNS
WHERE
Table_schema = @schema
AND Table_name = @TableName
AND data_type IN ('char', 'nchar', 'varchar', 'nvarchar')

/* If it is numeric, also check the numeric fields */
IF ISNUMERIC(@value) = 1
INSERT INTO @columns
(ColumnName)
SELECT
Column_name
FROM
INFORMATION_SCHEMA.COLUMNS
WHERE
Table_schema = @schema
AND Table_name = @TableName
AND data_type IN ('int', 'numeric', 'bigint', 'money',
'smallint', 'smallmoney',
'tinyint', 'float', 'decimal', 'real')

IF ISDATE(@value) = 1
INSERT INTO @columns
(ColumnName)
SELECT
Column_name
FROM
INFORMATION_SCHEMA.COLUMNS
WHERE
Table_schema = @schema
AND Table_name = @TableName
AND data_type IN ('datetime', 'smalldatetime')

/********************* Prepare dynamic SQL Statement to Execute **********/
SELECT
@sql =
CASE
WHEN @sql IS NULL
THEN 'Select ''' + ColumnName
+ ''' as ContainingColumn, * From '
+ QUOTENAME(@Schema) + '.' + QUOTENAME(@TableName)
+ ' where ' + ColumnName + ' like ''' + @Value + ''' '
WHEN @sql IS NOT NULL
THEN @sql + 'UNION ALL Select ''' + ColumnName
+ ''' as ContainingColumn, * From ' + QUOTENAME(@Schema)
+ '.' + QUOTENAME(@TableName)
+ ' where ' + ColumnName + ' like ''' + @Value + ''' '
END
FROM
@columns

/******************* Execute Statement and display results ***********/
--print @sql /* This may be uncommented for testing purposes */
EXEC (@sql)

This has the disadvantage that it may search columns that are not needed, for instance, it will search address fields even if it is given a phone number, and this can both make it return false positives and cause it to take longer than necessary. However, this can be a small price to pay compared to hand writing a select statement with numerous "or" clauses.

Modifying the general solution.
Another thing about the sample procedure is that while it works very well for text values, and is relatively good with certain types of numeric values, it does not work very well with approximate data types such as float and will only match dates if the formatted in the default way and are exact matches.
For broader matches on dates and floats, this can be easily modified to look only at columns with the specified data type and could use "between" to give it a limited amount of flexibility in matching the values.

Another area where the procedure could be modified is to bring it into conformance with SQL Server 2008 by adding the separate date and time data types to it. Also, in some cases rather than storing a procedure for frequent use it is good to instead have the script handy for the rare occasions when it is needed. For that, the create statement can be removed and the parameter can be turned into a declared variable set at the beginning of the script. Then it can be used and discarded without remaining resident in the database itself.

Conclusion

Often, the best way of handling repeating columns is to avoid them entirely in the original design phase. When repeating columns have not been avoid during the original design, the best approach is normally to change the design to eliminate them and normalize the tables. In some cases, it is either impossible or unwise to eliminate repeating columns or there are non-repeating columns that still need to all be searched for a single value. In these cases, there is a choice between hand writing an select statement with "or" conditions or using a script to generate that statement automatically, and when that number of columns that needs to be searched is large, it often makes sense to have that script or procedure readily available.

References:
Wikipedia on Normalization ( http://en.wikipedia.org/wiki/Database_normalization )

Tony Davis on "The Mytho of Overnormalization" ( http://www.simple-talk.com/community/blogs/tony_davis/archive/2008/07/21/63094.aspx )

Resources

Rate

3.65 (26)

You rated this post out of 5. Change rating

Share

Share

Rate

3.65 (26)

You rated this post out of 5. Change rating