Here is my first attempt to make soundex more selective:
Algorithm sqlserver uses:
Replace consonants with digits as follows (but do not change the first letter):
b, f, p, v => 1
c, g, j, k, q, s, x, z => 2
d, t => 3
l => 4
m, n => 5
r => 6
Collapse adjacent identical digits into a single digit of that value.
Remove all non-digits after the first letter.
Return the starting letter and the first three remaining digits.
If needed, append zeroes to make it a letter and three digits.
Lets extend this to 5 in length to make it more selective.
On lastname data I see a selectivity increase from 1 in 3000 to 1 in 10000.
Function contains no error checking and is not optimized.
drop function dbo.soundex5;
go
create function Soundex5( @name varchar(30))
returns varchar(5)
as
begin
declare
@out varchar(20),
@i int,
@vc varchar(1);
set @name = @name + '00000';
set @out = left(@name,1);
set @i = 2;
while @i <= len(@name)
begin
set @vc = substring(@name, @i, 1)
set @out = @out +
case
when charindex(@vc, 'bfpv') > 0 then '1'
when charindex(@vc, 'cgjkqsxz') > 0 then '2'
when charindex(@vc, 'dt') > 0 then '3'
when charindex(@vc, 'l') > 0 then '4'
when charindex(@vc, 'mn') > 0 then '5'
when charindex(@vc, 'r') > 0 then '6'
when charindex(@vc, '0') > 0 then '0'
else ''
end;
set @i = @i + 1;
end
-- ignore similar consecutive sounds
set @out = REPLACE(@out,'11','1');
set @out = REPLACE(@out,'22','2');
set @out = REPLACE(@out,'33','3');
set @out = REPLACE(@out,'44','4');
set @out = REPLACE(@out,'55','5');
set @out = REPLACE(@out,'66','6');
return left(@out,5);
end
go