Blog Post

Sniffing out bogus data

,

/* Imagine that you have some names and addresses. You are anxious to look at the addresses that may be spurious. As well as looking at such things as the vowel-to-consonant ratio and the digraph frequencies, (see my previous posts) you might want to also look at the occurence of sequences where you get three or more occurences of the same letter. Names like AAAAAA crop up quite a lot when people don't want to give you their names or addresses. It is not a big deal to cope with this, using a brute-gorce approach. Here is a technique based on the idea of generating the matches on the fly using a 'numbers' table which I'm assuming you have already. All you need to do is to define your alphabetic characters.

Normally, you'd create a rule or a constraint and prevent the stuff getting into the database in the first place, but it is tricky to create a defence before you know what's going to attack your database; Because of this, you'll always find something that slips through your first-line defenses.*/

/* so we'll create a typical name address database and fill it with 50,000 addresses just to check that our algorithm is going to work. I use SQL Data Generator just because makes this sort of operation laughably easy. */

CREATE TABLE MyAddresses(
  
MyAddress_id INT IDENTITY(1,1),
  
Forename VARCHAR(50) NOT NULL,
  
Surname VARCHAR(50) NOT NULL,
  
Address1 VARCHAR(50) NOT NULL,
  
city VARCHAR(50) NOT NULL,
  
[State] VARCHAR(50) NOT NULL,
  
zip VARCHAR(20) NOT NULL,
  
email VARCHAR(50) NOT NULL
   )

--insert lots of rows
--
-- (I use SQL Data Generator here but I've supplied some test data.
--
--and some evil entries to search for
INSERT INTO MyAddresses(Forename,Surname,Address1,City,State,zip,email)
  
SELECT 'aaaaaagddfhgdf','derteeeeeeertert','ddfgdgdfgdfgdfg','nnnnnn','rrrr','   ','mmmm'
INSERT INTO MyAddresses(Forename,Surname,Address1,City,State,zip,email)
  
SELECT 'aaagdf','eeeerfsdf','fghfghfghfghfg','df','vvvvvv','','sfs@gddfm'


DECLARE  @alphabetic VARCHAR(255)--our characters whose repetition we want to look for.
SELECT @alphabetic='abcdefghijklmnopqrstuvwxyz'

SELECT * FROM
    
(
    
SELECT MyAddress_ID,
      
[entry]=forename+' '+ surname+' '+address1+' '+ city+' '+ state+' '+ zip
    
FROM MyAddresses
    
)a
CROSS JOIN
    
(
    
SELECT [pattern]=REPLICATE(SUBSTRING(@alphabetic,number,1),3)--number of times
--you can change it to any value you want, of course.
    
FROM numbers
    
WHERE number <=LEN(@alphabetic)
   )
b
WHERE [entry] LIKE '%'+pattern+'%'
--1 min 47 sec




/* OK. That wasn't fast, but if you can think of a method that is faster, then try it out on the test harness and, if it really motors, let's hear about it! */


Rate

You rated this post out of 5. Change rating

Share

Share

Rate

You rated this post out of 5. Change rating