December 30, 2004 at 11:14 am
I must agree with Frank, you might want to consider another tool. When I used to work in junk mail, er...Direct Marketing...we'd get millions of names a week to process and used a DOS based package (hey, what works works) to do parsing and genderizing, weeding out companies and duplicate removal, Name (Bill Jones, William Jones, Billy Jones). I can't remember the name of that one, but look around, they're out there.
http://www.semaphorecorp.com/nc/namechop.html
http://www.infoplan.com.au/nameparser/
http://www.guysoftware.com/parserat.htm
-------
at us, very deafly, a most stares
collosal hoax of clocks and calendars
eecummings
December 30, 2004 at 12:35 pm
Thanks for the links. I will check them out. I got a script from SQLTeam that works:
Select
[LName] =
case
when charindex(' ', Col001) = 0 then Col001
Else subString(Col001, 1, charindex(' ', Col001))
End
,[FName] =
case
when charindex(' ', Col001) = 0 then null
when charindex(' ', Col001, charindex(' ', Col001)+1) = 0 then subString(Col001, charindex(' ', Col001)+1, 50)
else subString(Col001, charindex(' ', Col001)+1, charindex(' ', Col001, charindex(' ', Col001)+1) - charindex(' ', Col001))
end
,[MName] = --this leaves everything after 2nd space combined ie: R LEE H
case
when charindex(' ', Col001) = 0 then null
when charindex(' ', Col001, charindex(' ', Col001)+1) = 0 then null
else convert(varChar(25), substring(Col001, charindex(' ', Col001, charindex(' ',Col001)+1)+1, 50))
end
From table
December 30, 2004 at 12:50 pm
Yes, I have seen that. It's exactly what I meant with ugly T-SQL solution above.
--
Frank Kalis
Microsoft SQL Server MVP
Webmaster: http://www.insidesql.org/blogs
My blog: http://www.insidesql.org/blogs/frankkalis/[/url]
December 30, 2004 at 12:55 pm
I hope Yukon will offer more. This seems like it would be a common problem.
December 31, 2004 at 12:12 pm
Hello cgrunner,
The parsing-names problem has existed since names have been automated. I don't think it's susceptible to a technology quick-fix. Typical name data may include: honorific titles (DR., etc.), hyphenated last names that may or may not include the hyphen, generation designations (SR, JR, III, etc.), two persons included in the name (SMITH BOB AND JANE), etc. etc.
Depending on how the names are used, and the business impact of a bad name, the best solution I've seen is to apply a simple formatting rule, followed by a manual review of potential bad names.
You seem to be reasonably sure that the sequence of names will be {last, first, middle}. In that case, I'd suggest a procedure like this:
A. Run all names through the script you got from SQLTeam. It looks like that will work adequately for most names.
B. Identify potential problem names and route these for manual review. As as starter, these probably should include:
- blank first name
- imbedded spaces found in the middle name
If your organization can implement such a procedure, then a good data element to add to your parsed-names table might be "reviewed_by" and "reviewed_ts". Potential bad names that have been reviewed can be excluded from subsequent bad-names lists.
Good luck
Bob Monahon
Viewing 5 posts - 16 through 19 (of 19 total)
You must be logged in to reply to this topic. Login to reply