SQLServerCentral Article

Owning the Spelling Suggestion Feature

,

There are a couple of barriers in the way of building your own spelling suggestion feature. One, you need data:. you need some kind of look-up table populated with words. Two, one needs some components that use the look-up to evaluate input and offer some user interfaces. I suspect that's why people head straight for the third-party solution. I'd like to comment on these barriers and help break them down some.

What if the user interface were nothing more than a comment like "Did you mean <spell corrected word>?" marked up on a web page? There goes the interface barrier. That kind of interface is probably within reach of many developers. It's also pretty common and therefore practical. It could be very useful on an e-commerce product search page, for instance. However, what's also needed is the look-up table and the data. Is that insurmountable?

Some will find a data provider such as a web service, sign up, sigh relief and move on. But wait. Maybe the world would be a better place if more of us were able to become our own spelling suggestion data provider. If everybody uses an outside Web service and it goes down, the feature stops working all across the land. Like oil, maybe the price would drop if we weren't so reliant on outside providers. Perhaps a provider doesn't offer some of the technical and slang terms used in a particular industry.

For fun and/or profit, let's explore how to create the spelling suggestion dataset.

The first step is to create a look-up table. I created a table similar to this:

---------------------------
-- Create Spelling Look-up Table
---------------------------
CREATE TABLE [tblSpellingLookup] (
[Misspelling] [varchar] (100) ,
[Correction] [varchar] (100))

The next step is to gather the raw materials: words. One might get these from product descriptions. A description is a string of words that can be parsed into a set of records using a set-based operation. I drew my inspiration for this operation by reading a great article by Robyn Page and Phil Factor: http://www.simple-talk.com/sql/t-sql-programming/the-helper-table-workbench/

Robyn suggests using a helper table for set-based operations which is composed of nothing more than records having a sequence of numbers from 1 on up.

The helper table can look like this:

---------------------------
-- Create Helper Table
---------------------------
CREATE TABLE [dbo].[tblNumbers](
[number] [int])
-- Populate the helper table with a sequence of numbers:
DECLARE @size
SET @size = 10000
DECLARE @ii
SET @ii = 1
WHILE (@ii<=@size) 
BEGIN
INSERT INTO NUMBERS(NUMBER) SELECT @II
SELECT @II=@II+1
END
END

Then some T-SQL can break apart a product description.

---------------------------
-- Collect Words
---------------------------
DECLARE @description VARCHAR(255)
SELECT @description= 'this is a great product'
SELECT SUBSTRING(@description+' ', number, 
CHARINDEX(' ', @description+' ', number) - number)
FROM Numbers
WHERE number <= LEN(@description)
AND SUBSTRING(' ' + @description,
number, 1) = ' '
ORDER BY number RETURN

This yields a set of records:
this
is
a
great
product

This step will perhaps also involve cleaning out HTML tags, punctuation and so forth.

I found that product descriptions were mostly spelled correctly. However, I did take an additional step to ensure proper spelling with the help of Microsoft Office COM programming. I suspect there's even better approaches out there.

Getting the correctly spelled words to populate my [Correction] column was pretty straightforward. What about the misspellings one may ask? This was the exciting part for me. Common misspellings can be derived programatically from the correct spellings using T-SQL and the helper table used earlier.

There are four main misspelling types that can be derived programatically:
1. Missing letter. Example: "and" misspelled "nd."
2. Double-typed letter. Example: "and" misspelled "aand."
3. Transposed letters. Example: "and" misspelled "nad."
4. QWERTY keyboard wrong letter: "and" misspelled "anf."

Wouldn't it be cool to develop T-SQL that returns every misspelling of a word from these misspelling types? Let's start with missing letter.

---------------------------
-- Missing Letter
---------------------------
DECLARE @word VARCHAR(255)
SELECT @word= 'and'
SELECT
CASE number
WHEN 1 THEN
SUBSTRING(@word,2,LEN(@word)-1)
ELSE
SUBSTRING(@word,1,number - 1) + SUBSTRING(@word,number + 1, LEN(@word))
END
FROM Numbers
WHERE number <= LEN(@word)
ORDER BY number

This yields:
nd
ad
an

Let's move on to double-typed letter.

---------------------------
-- Double-typed Letter
---------------------------
DECLARE @word VARCHAR(255)
SELECT @word= 'and'
SELECT
CASE number
WHEN 1 THEN
SUBSTRING(@word,number,1) + @word
ELSE
SUBSTRING(@word,1,number - 1) + REPLICATE(SUBSTRING(@word,number,1),2) + SUBSTRING(@word,number + 1, LEN(@word))
END
FROM Numbers
WHERE number <= LEN(@word)
ORDER BY number

This yields:
aand
annd
andd

And now on to transposed letters.

---------------------------
-- Transposed Letters
---------------------------
DECLARE @word VARCHAR(255)
SELECT @word= 'and'
SELECT
CASE number
WHEN 1 THEN
SUBSTRING(@word,2,1) + SUBSTRING(@word,1,1) + SUBSTRING(@word,3,LEN(@word)-1)
ELSE
SUBSTRING(@word,1,number-1) + SUBSTRING(@word,number+1,1) + SUBSTRING(@word,number,1) + SUBSTRING(@word,(number + 2),LEN(@word)- number)
END
FROM Numbers
WHERE number <= LEN(@word) -1
ORDER BY number

This yields:
nad
adn

The QWERTY keyboard wrong letter query benefits from creating a look-up table variable that associates each letter with the surrounding letters on the keyboard.

---------------------------
-- QWERTY Wrong Letter
---------------------------
DECLARE @word VARCHAR(255)
SELECT @word= 'and'
DECLARE @QuertyLookup TABLE 
(
thisLetter VARCHAR(1),
misspelledLetter VARCHAR(1)
)
INSERT INTO @QuertyLookup
SELECT 'q',SUBSTRING('12wsa',number,1) FROM Numbers WHERE number <= LEN('12wsa') 
UNION SELECT 'a',SUBSTRING('qwsz',number,1) FROM Numbers WHERE number <= LEN('qwsz')
UNION SELECT 'z',SUBSTRING('asx',number,1) FROM Numbers WHERE number <= LEN('asx')
UNION SELECT 'w',SUBSTRING('23qedsa',number,1) FROM Numbers WHERE number <= LEN('23qedsa')
UNION SELECT 's',SUBSTRING('qwedxza',number,1) FROM Numbers WHERE number <= LEN('qwedxza')
UNION SELECT 'x',SUBSTRING('asdcz',number,1) FROM Numbers WHERE number <= LEN('asdcz')
UNION SELECT 'e',SUBSTRING('34wrfds',number,1) FROM Numbers WHERE number <= LEN('34wrfds')
UNION SELECT 'd',SUBSTRING('werfcxs',number,1) FROM Numbers WHERE number <= LEN('werfcxs')
UNION SELECT 'c',SUBSTRING('sdfvx',number,1) FROM Numbers WHERE number <= LEN('sdfvx')
UNION SELECT 'r',SUBSTRING('45etgfd',number,1) FROM Numbers WHERE number <= LEN('45etgfd')
UNION SELECT 'f',SUBSTRING('ertgvcd',number,1) FROM Numbers WHERE number <= LEN('ertgvcd')
UNION SELECT 'v',SUBSTRING('dfgbc',number,1) FROM Numbers WHERE number <= LEN('dfgbc')
UNION SELECT 't',SUBSTRING('r56yhgf',number,1) FROM Numbers WHERE number <= LEN('r56yhgf')
UNION SELECT 'g',SUBSTRING('rtyhbvf',number,1) FROM Numbers WHERE number <= LEN('rtyhbvf')
UNION SELECT 'b',SUBSTRING('fghnv',number,1) FROM Numbers WHERE number <= LEN('fghnv')
UNION SELECT 'y',SUBSTRING('t67ujhg',number,1) FROM Numbers WHERE number <= LEN('t67ujhg')
UNION SELECT 'h',SUBSTRING('tyujnbg',number,1) FROM Numbers WHERE number <= LEN('tyujnbg')
UNION SELECT 'n',SUBSTRING('ghjmb',number,1) FROM Numbers WHERE number <= LEN('ghjmb')
UNION SELECT 'u',SUBSTRING('y78ikjh',number,1) FROM Numbers WHERE number <= LEN('y78ikjh')
UNION SELECT 'j',SUBSTRING('yuikmnh',number,1) FROM Numbers WHERE number <= LEN('yuikmnh')
UNION SELECT 'm',SUBSTRING('jk,n',number,1) FROM Numbers WHERE number <= LEN('jk,n')
UNION SELECT 'i',SUBSTRING('u89olkj',number,1) FROM Numbers WHERE number <= LEN('u89olkj')
UNION SELECT 'k',SUBSTRING('uiol,mj',number,1) FROM Numbers WHERE number <= LEN('uiol,mj')
UNION SELECT ',',SUBSTRING('kl.m',number,1) FROM Numbers WHERE number <= LEN('kl.m')
UNION SELECT 'o',SUBSTRING('i90p;lk',number,1) FROM Numbers WHERE number <= LEN('i90p;lk')
UNION SELECT 'l',SUBSTRING('iop;.,k',number,1) FROM Numbers WHERE number <= LEN('ip;lk')
UNION SELECT '.',SUBSTRING('l;/,',number,1) FROM Numbers WHERE number <= LEN('l;/,')
UNION SELECT 'p',SUBSTRING('o0-[;l',number,1) FROM Numbers WHERE number <= LEN('o0-[;l')
UNION SELECT ';',SUBSTRING('op[''/.l',number,1) FROM Numbers WHERE number <= LEN('op[''/.l')
UNION SELECT '/',SUBSTRING(';''.',number,1) FROM Numbers WHERE number <= LEN(';''.')
UNION SELECT '''',SUBSTRING('[];',number,1) FROM Numbers WHERE number <= LEN('[];')
UNION SELECT '-',SUBSTRING('=[p',number,1) FROM Numbers WHERE number <= LEN('=[p')
UNION SELECT '[',SUBSTRING('-=]'';p',number,1) FROM Numbers WHERE number <= LEN('-=]'';p')
UNION SELECT ']',SUBSTRING('[=''',number,1) FROM Numbers WHERE number <= LEN('[=''')
ORDER BY 1
SELECT
CASE number
WHEN 1 THEN
[misspelledLetter] + SUBSTRING(@word,number+1,LEN(@word)-number)
ELSE
SUBSTRING(@word,1,number - 1) + [misspelledLetter] + SUBSTRING(@word,number + 1, LEN(@word)-number)
END
FROM Numbers, @QuertyLookup
WHERE number <= LEN(@word)
AND [thisLetter] = SUBSTRING(@word,number,1)
ORDER BY number

This yields:
snd
znd
qnd
wnd
abd
amd
agd
ahd
ajd
anw
anx
ans
anc
ane
anf
anr

I hope this exercise inspires others to use set-based operations using the helper table technique and to share practical uses with the rest of us.

As an aside, I recently got a phone call from a very well known internet search vendor boasting to me about how their spell suggestion feature was helping a client generate more sales. He assumed this feature was a major selling point. For many, it is. Internally, I was able to think "Yes. I implemented that myself recently. What else have you got?"

Yours,

Bill Nicolich

Rate

3.65 (17)

You rated this post out of 5. Change rating

Share

Share

Rate

3.65 (17)

You rated this post out of 5. Change rating