TSQL Spam-killer

It’s very refreshing to take a break from a tedious bit of routine code to try out something unusual or tricky. Sometimes one can be quite startled by finding out how easy it is to do in TSQL.

Jonathan Snook’s blog is my favourite “developer’s blog” of all. It is packed with ideas. I was reading it the other day and came across How I built an effective blog comment spam blocker. This, I thought, was great fun, because it told you how to build it but didn’t give you the code. The algorithm scores a string according to how likely it is to be spam. For everything in a comment that the filter likes like, the string gets a point. For everything it don’t like, the string loses points. If the string totals 1 or higher, it lands on the site as a valid comment. If it scores a 0, it’s set for moderation, If it’s below 0, it’s marked as spam. Some of the things that score negative are rather amusing ( .pl or .cn sites for example) but done from experience.

With Jonathan’s permission, here is a TSQL implementation. I’ve simplified it slightly, to fit it in a blog, by using temporary tables for the word-banks, dodgy URL suffixes and so on. Normally, you’d put these in permanent tables and fine-tune the system as the language of spam changes, without having to alter the code.

The system seems pretty effective. If a spam slips through, or a legitimate comment falls foul of the system, you can tweak it very simply . It is in place in the forthcoming Simple-Talk/SSC Wiki which I’m currently helping to build.

As a strange by-product of developing this, I’ve taken a sudden liking to spam, purely to test the efficiency of the algorithm. Unfortunately, Simple-Talk is a martyr to it, but I can’t find a way of plugging this routine into community server. Still, it means we have a vast test-bank of the stuff to check the routine with!

It occurs to me that there must be a number of ways of writing the SQL code to Jonathan Snooks algorithm. What is the fastest and most effective way? Perhaps we should have a ‘Lionel-style’ competition.

For anyone without Firefox who can’t copy n’ paste, here is the source file