Revenge of the BLOB

,

Binary objects and unstructured data

In the past, my opinion was that a BLOB or Binary Large Object really had no

place in a relational database. My

experiences with these image & binary data types are that they are slow

performing, and a pain to work with in T-SQL. Any operations working with these BLOBs were usually performed in a separate business layer outside of SQL. This suggested to me it made more sense to put things like files or images where they were designed to be placed - in the file system.

BLOBs tax relational database performance, and though there are many ways to

mitigate this I still believe it is better to have a pointer to the data rather

than storing the data itself. However, sometimes you cannot get away from using a BLOB, or inheriting it from another architecture. The interesting thing about BLOBs is that they are slowly taking over the database world.The offset of this is that more tools are being provided to allow easier manipulation of this data.

This article will describe some of the new features of SQL Server 2005 and SQL Server 2008, and provide an example of pulling a BLOB out of an image column in SQL Server 2005.

Bacronyms and marketing speak

In my last article, Why SQL Server is a 4-letter word, I discussed acronyms. The term blob is a bacronym. This is an acronym that was constructed after-the-fact to cover up that it really was named after "The Blob that ate Cincinnati." The person who first applied the name Blob to a database object was Jim Starkey. He tells a hilarious Dilbert-ish story about the origins of the term here, and how marketers like acronyms.

Acronyms are professional and great for marketing buzz, yet they feel a bit impersonal to me. Would you rather read about the rich-media framework from Microsoft called WPF/E or a cool new technology called Microsoft Silverlight? Which letters makes you feel warmer and fuzzier when they are put

together,  I-B-M or A-p-p-l-e?

That's a lot of rows

A problem I had recently was with a 3rd-party financial system. The vendor provides a proprietary database with a table containing over 300,000 records, 2.8GB of data, and one tiny clustered index on a unique identifier column. Of the 16 columns, it is a column of IMAGE type which stores the most data. Digging further into the column, it is storing a binary representation of 5000 4-byte float or money data type values. New tables are created after each execution of the tool with a unique identifier appended to the table name. The customer would like have all of the data available for querying at

any time after the table is loaded. Since this is 1,665,000,000 unfiltered rows per day there could be a problem with scalability and performance if the data is staged.

Not a lot for a cube

Developing an Analysis Services 2005 cube to aggregate the data would be the

recommended approach in this situation, since it could handle this scale of data and the combinations of queries given to it.However, the customer is most comfortable querying subsets of the data in SQL Server, and the timeframe does not allow for an OLAP-based approach.The interim

solution is to build a parameterized stored procedure and table function in SQL

Server 2005 to de-BLOB the BLOB and return a subset of the rows.

The T-SQL approach

In SQL Server 2000, there is an SDK utility called TextCopy that can manipulate

a BLOB within an image column.

However this is outside the SQL Server environment. In SQL Server 2005 you can use the Import Column data flow task in Integration Services. Again, this is outside of the Query Analyzer window.  Within SQL Server, you can use the new VARBINARY(MAX) data type to convert the IMAGE data type to something workable.

Size up the Data

The first item of business is determining the size of the data column. 

Since I know based on legacy code that the byte size is 4, I would like

to see the number of records stored in each column. 

The function to use here is DATALENGTH.

SELECT convert(varbinary(MAX),value) as ColumnValue
 ,DATALENGTH(value) as Length
 , DATALENGTH(value) /4as 
RecordCount
FROM [BigBlobTable]

Dividing the number by 4 gives us 5000 records in the column.

The Deblobber Table Function

I am still looking for a good way to remove the WHILE function, and the Table

Function too, however this does the trick for now. 

The Deblobber Code

CREATE FUNCTION [dbo].[Deblobber] ( @blob AS VARBINARY(MAX) )
RETURNS @Values TABLE ( Value MONEY )
AS 
BEGIN
DECLARE @length AS INT
 
DECLARE @i AS INT
SET @i = 1
SET @length = DATALENGTH(@blob)
WHILE @i < @length
BEGIN
INSERT  INTO @Values
SELECT  CAST(SUBSTRING(@blob, @i, 4) AS MONEY)
SET @i = @i + 4
END
 
 RETURN 
END

The Deblobber table function is used in the stored procedure below. Names have been changed to protect the innocent. In no way should these names be used in a professional environment.

The Select Statement Code

CREATE PROCEDURE [dbo].[GetBigBlob] (@TableID AS VARCHAR(MAX) = '17627', @PickListID 
AS VARCHAR(MAX) = '12021855,12233944')
AS
-- Create a link to the table BigBlobTable_17627
IF EXISTS ( SELECT  name
FROM    sys.synonyms
WHERE   Name = N'BigBlob' ) 
 DROP SYNONYM N'BigBlob'
EXEC ( '
create synonym N'BigBlob for N'BigBlobTable_' + @TableID + ']' )

-- Turn off row counts for performance
SET nocount ON;
-- Define a Common Table Expression (CTE)
WITH BigBlobCTE as ( 
-- Define an auto-incrementing row identifier
SELECT    ROW_NUMBER() OVER ( ORDER BY [UniqueID] ASC ) AS 
CounterID,
                    
CONVERT(VARBINARY(MAX), value) AS ColumnValue           
FROM      BigBlob
-- Use COALESCE function to ignore NULL parameter values
-- Use Split function to parse comma-delimited list of parameters
          
WHERE     (CHARINDEX(',',@PickListID ) = 0 AND COALESCE(@PickListID 
, PickListID ) = UniqueID)  OR (CHARINDEX(',',@PickListID ) > 0 AND 
UniqueID IN (SELECT value FROM dbo.fnSplit(@PickListID,',')))
)
-- Return values from the CTE
SELECT  resultValues.rowid,
      
resultValues.PnLValue,

     BigBlobCTE.CounterID         [positionUID],  FROM   BigBlobCTE -- JOIN to the User-Defined Table Function CROSS APPLY dbo.Deblobber(BigBlobCTE.ColumnValue) AS resultValues

The Results

The procedure above should, for each blob, return a table, union these tables

together, and return them as a result set with the original data row.

The key SQL Server 2005 features demonstrated in this stored procedure are:

  • Synonyms – Friendly aliases for objects
  • Common Table Expressions (CTEs) – WITH syntax presents a

    table expression above the SQL SELECT statement rather than nested below in

    a derived table. Used for easier readability and reuse of the expression.

  • User-Defined Table (UDT) function Joins - CROSS APPLY

    syntax allows UDTs to be joined to other tables.

  • Dynamic Row Numbering - ROW_NUMBER() syntax provides a

    dynamic row count over a given sort order.

  • Large Binary data type - VARBINARY(MAX) removes 8000

    character limitation of SQL 2000 and allows for manipulating Large Binary

    Objects (LOBs) within SELECT statements.

Problems

There are a few potential issues with this solution.

  •  Concurrency

    issues with the synonym could allow this procedure to run only once at a

    time

  • Pulling the data for a large number of records still

    poses a performance issue

 

Dynamic SQL could solve the concurrency issue, however it was determined that it

will be run by a single user on an ad-hoc basis so it is not a priority.

 

Performance issues should be mitigated as there is no longer a need to pull all

data.Instead, data can be joined

and filtered.

 

Looking forward to SQL Server 2008

Microsoft is offering some potentially great enhancements to working with BLOB

data in SQL Server 2008, including the following:

  • Integrated Full Text Search - no more service, it's

    built-in.

  • Sparse Columns - providing the ability to create millions

    of BLOB columns.

  • New Index Types - Increasing the performance of querying.
  • Data compression - Shrinking the size and providing

    various storage alternatives.

  • And last but not least, a

    filestream datatype in SQL 2008.

For more information on these new enhancements, take a look at the white paper

released in August, 2007, entitled

Managing Unstructured Data with SQL Server 2008. 

Could it be that the SQL

Storage Team borrowed some unstructured data ideas from someone on the

OneNote Team?

Author's Profile

Andrew Sears is a Senior

Solutions Consultant and Microsoft Certified Technology Specialist: Business

Intelligence Developer (MCITBID) at T4G Limited, with over 10 years experience

in the Analytics group implementing Reporting, Data Warehousing and Business

Intelligence Solutions.

T4G Limited

is a leading

full-service, project-based technology services company and a member of the

Microsoft Gold Certified Partner Excellence Program.

Andrew currently maintains

various blogs on topics including

Silverlight,PerformancePoint,

and

SQL Server

Business Intelligence Solutions.

References

Rate

3.1 (10)

Share

Share

Rate

3.1 (10)