SQLServerCentral Article

Keyword Searching in SQL Server

,

Introduction

As the concept of search continues to grow in scope, I tend to find myself eschewing multiple levels of categorization in favor of larger groups. I'll give you two examples of what I mean.

Think of your inbox. I used to have folders for different systems, folders for different clients, folders for different colleagues, and so on. Now I tend to leave everything in my inbox and use the various search features of my mail client or other search applications to retrieve messages as I need them. I'm not sure if this is due to the fact that the searching capabilities of these applications are better than they used to be or if I'm just more prone to searching for things since that's what all the hip kids are doing these days.

Think of a data driven web site. A retail site possibly. As you drill down into the categories of goods, you'll usually have a breadcrumb trail somewhere on the page. Do we really drill that far down nowadays or do we immediately go to the search box and type in what we're looking for? If it's the latter case, what's the point of having a bunch of piddly little categories like Electronics > Digital Cameras > Memory Cards > 512MB? Why not just dump everything in one big category and let the search functionality take care of the process of finding something?

I like to be able to find something on the fly when necessary and I don't want to have to remember how I originally categorized it. What's the login info for that FTP site? Categorization is tedious at best, and often times the same person will choose different categorizations for a given item at different times. The following is a simple way to implement basic keyword search functionality using SQL Server and nothing but.

Here are our objects we'll end up with:

  • LogEntries (Table)
  • Keywords (Table)
  • LogEntry_Keyword (Table)
  • sp_logentry (Sproc)
  • sp_search (Sproc)

Setting up our tables

I'm no cognitive scientist, so I can't tell you exactly how our brains keep track of bits of information. But the things that we need to remember are often very fragmented and defy categorization. I'd venture a guess that our brains are like logs and we make entries into that log. So let's have a table called LogEntries. The most basic column is the actually entry itself - LogEntry varchar(8000):

"Remember that you increased the budget for the Razzmatazz account by $10K."

And when we make that mental note, it would be nice to remember when we made it - DateEntered datetime:

'YYYY-MM-DD hh:mm:ss'

But when we remember it, do we actually think of the exact phrase we originally thought of? Probably not. We probably remember something more like "razzmatazz account budget increased". So now we have some keywords (or tags) associated with our entry. Tags varchar(200):

'increased budget account'

CREATE TABLE LogEntries
(
  EntryId int IDENTITY(1,1) NOT NULL,
  Entry varchar(8000) NOT NULL,
  Tags varchar(200) NOT NULL,
  DateEntered datetime NOT NULL DEFAULT (GETDATE()),
  CONSTRAINT PK_LogEntries PRIMARY KEY CLUSTERED (EntryId ASC)
)

Each of the individual tags within the Tags column is a keyword describing the data in the LogEntry. So let's have a Keywords table.

CREATE TABLE Keywords
(
  KeywordId int IDENTITY(1,1) NOT NULL,
  Keyword varchar(50) NOT NULL,
  CONSTRAINT PK_Keywords PRIMARY KEY CLUSTERED (KeywordId ASC)
)
CREATE UNIQUE NONCLUSTERED INDEX IX_Keywords ON Keywords
(
  Keyword ASC
)

When we make an entry into our log, we'd like to also make inserts into a LogEntry_Keyword table - a table used to calculate the number of occurrences of a keyword or set of keywords in a given entry.

CREATE TABLE LogEntry_Keyword
(
  EntryId int NOT NULL,
  KeywordId int NOT NULL,
  Hits int NOT NULL,
  CONSTRAINT PK_LogEntry_Keyword PRIMARY KEY CLUSTERED
  (
    EntryId ASC,
    KeywordId ASC
  )
)

Now for the trigger

In most cases, we'd probably prefer to process the tags of our entries by using another application. The String.Split function comes to mind. Then we'd hash the keywords where the hash value is the number of occurrences and we'd later do our insert into the LogEntry_Keyword table.

But we're keeping it simple and want to make everything self contained. Here's how the trigger does that.

CREATE TRIGGER trgInsertLogEntry
ON LogEntries
FOR INSERT
AS
------------------
-- Declarations --
------------------
DECLARE @tags      AS varchar(200)  -- This will hold the tags string (e.g. 'increase budget such 'n' such account')
DECLARE @keyword   AS varchar(50)   -- An individual keyword from the tags string (e.g. 'increase')
DECLARE @keywordId AS int           -- The keyword id from the keywords table
DECLARE @found     AS int           -- Whether or not the keyword wAS already in the keywords table
DECLARE @entryId   AS int           -- The entry id of the logentry being inserted (@@identity)
-------------------------------------------------------------------------
-- Temp table for current keyword data for the newly inserted LogEntry --
-------------------------------------------------------------------------
CREATE TABLE #kw
(
  kwid int PRIMARY KEY,  -- The id from the Keywords table
  hits int               -- The number of occurrences of this keyword
)
-------------------------------------------
-- Data from the newly inserted LogEntry --
-------------------------------------------
SET @entryId  = @@identity                   -- Get the newly inserted logentry id
SET @tags     = (SELECT tags FROM INSERTED)  -- Get the newly inserted tag
SET @tags     = LTRIM(RTRIM(@tags)) + ' xyz' -- Add a fake keyword to the end that won't get inserted
-------------------------------------------------------------------------
-- While there are still keywords in the newly inserted LogEntry's tag --
-------------------------------------------------------------------------
WHILE (CHARINDEX(' ', @tags) > 0)
BEGIN
  -- Get the next keyword from the tags string
  SET @keyword = SUBSTRING(@tags, 0, CHARINDEX(' ', @tags))
  -- Get the KeywordId from the Keywords table
  SELECT @keywordid = (SELECT KeywordId FROM Keywords WHERE Keyword = @keyword)
  -- Insert the keyword if necessary
  IF (@keywordId IS NULL)
  BEGIN
    INSERT INTO Keywords VALUES (@keyword)
    SET @keywordId = @@identity
  END
  -- See if the keyword id is in the temp table yet
  SELECT @found = (SELECT COUNT(*) FROM #kw WHERE kwid = @keywordId)
  -- If not found insert it
  IF (@found = 0)
    INSERT INTO #kw VALUES (@keywordId, 1)
  -- If found update the hit count
  IF (@found != 0)
    UPDATE #kw SET hits = hits + 1 WHERE kwid = @keywordId
  -- Update the tags by lopping off the keyword just processed
  SET @tags = substring(@tags, charindex(' ', @tags) + 1, len(@tags) - charindex(' ', @tags))
END
---------------
-- End while --
---------------
-------------------------------------------------------------------------------
-- Insert the keywords and their occurrences into the LogEntry_Keyword table --
-------------------------------------------------------------------------------
INSERT INTO logentry_keyword
SELECT @entryid, kwid, hits
FROM #kw

So that takes care of everything we need to implement our project. Now let's add a few things to make inserting entries easier and finding things easier.

Getting the data in there

The whole purpose behind this exercise is to be able to associate a piece of data with string identifiers. When creating the data we want to be able to say (to ourselves):

Okay, I need to remember that at today's meeting we discussed ways of improving communication within the company. The first way was to blah blah blah...

And then say it more succinctly in keyword form:

meeting ways improve communication within company

That's two things. Two arguments for our sp_logentry function:

CREATE PROC sp_logentry(@entry AS varchar(8000), @tags AS varchar(200)) AS
INSERT INTO LogEntries (Entry, Tags)
VALUES (@entry, @tags)

sp_logentry 'Okay, I need to remember...blah blah blah...', 'meeting ways improve communication within company'

Getting the data out of there

And now, the last thing we need to do is write our sproc to retrieve our data. This will look quite familiar.

CREATE PROC sp_search(@keywords AS varchar(50)) AS
DECLARE @kws as varchar(50)
DECLARE @kw as varchar(50)
DECLARE @kwid as int
-----------------------------------------
-- Temp table for current keyword data --
-----------------------------------------
CREATE TABLE #kw
(
kwid int PRIMARY KEY
)
-- Add a fake keyword that won't get inserted
SET @kws = LTRIM(RTRIM(@keywords)) + ' xyz'
------------------------------------
-- While there are still keywords --
------------------------------------
WHILE (CHARINDEX(' ', @kws) > 0)
BEGIN
  SET @kw = SUBSTRING(@kws, 0, CHARINDEX(' ', @kws))                    -- Get the tag from the string
  SELECT @kwid = (SELECT keywordid FROM keywords WHERE keyword = @kw)   -- Find the keyword id
  IF (@kwid IS NOT NULL)                                                -- If found insert the id into the temp table
    INSERT INTO #kw VALUES (@kwid)
  SET @kws = SUBSTRING(@kw, CHARINDEX(' ', @kws) + 1, LEN(@kws)         -- Update the keywords
             - CHARINDEX(' ', @kws) - 1)
END

SELECT le.EntryId, Entry, Tags, DateEntered, SUM(Hits) h
FROM LogEntries le
JOIN LogEntry_Keyword lek ON lek.EntryId = le.EntryId
JOIN #kw tkw ON tkw.kwid = lek.keywordid
GROUP BY le.EntryId, Entry, Tags, DateEntered
ORDER BY h DESC

sp_search 'company meeting'

Conclusions

Yes, this is probably more of an exercise than something you'll put into practice. But the basic idea is straightforward and useful when expounded upon. If you decide to do more with it here are some ideas as well as some things to look out for.

  • Hook it up to a basic UI for a popup notepad.
  • Take care to validate your tags string to make sure it's a space delimited string.
  • I might also suggest adding a having count(le.EntryId) >= (select count(*) from #kw) clause into the search sproc in the event you want to only include entries featuring all of the searched keywords.
  • With a little more validation, you can get rid of the Tags column and process the LogEntry column by itself.

Rate

4.6 (5)

You rated this post out of 5. Change rating

Share

Share

Rate

4.6 (5)

You rated this post out of 5. Change rating