Parsing the name field with T-SQL

  • I must agree with Frank, you might want to consider another tool.  When I used to work in junk mail, er...Direct Marketing...we'd get millions of names a week to process and used a DOS based package (hey, what works works) to do parsing and genderizing, weeding out companies and duplicate removal, Name  (Bill Jones, William Jones, Billy Jones).  I can't remember the name of that one, but look around, they're out there.

    http://www.semaphorecorp.com/nc/namechop.html

    http://www.infoplan.com.au/nameparser/

    http://www.guysoftware.com/parserat.htm


    -------
    at us, very deafly, a most stares
    collosal hoax of clocks and calendars

    eecummings

  • Thanks for the links.  I will check them out.  I got a script from SQLTeam that works:

     

    Select

    [LName] =

    case

    when charindex(' ', Col001) = 0 then Col001

    Else subString(Col001, 1, charindex(' ', Col001))

    End

    ,[FName] =

    case

    when charindex(' ', Col001) = 0 then null

    when charindex(' ', Col001, charindex(' ', Col001)+1) = 0 then subString(Col001, charindex(' ', Col001)+1, 50)

    else subString(Col001, charindex(' ', Col001)+1, charindex(' ', Col001, charindex(' ', Col001)+1) - charindex(' ', Col001))

    end

    ,[MName] = --this leaves everything after 2nd space combined ie: R LEE H

    case

    when charindex(' ', Col001) = 0 then null

    when charindex(' ', Col001, charindex(' ', Col001)+1) = 0 then null

    else convert(varChar(25), substring(Col001, charindex(' ', Col001, charindex(' ',Col001)+1)+1, 50))

    end

    From table

  • Yes, I have seen that. It's exactly what I meant with ugly T-SQL solution above.

    --
    Frank Kalis
    Microsoft SQL Server MVP
    Webmaster: http://www.insidesql.org/blogs
    My blog: http://www.insidesql.org/blogs/frankkalis/[/url]

  • I hope Yukon will offer more.  This seems like it would be a common problem.

  • Hello cgrunner,

    The parsing-names problem has existed since names have been automated. I don't think it's susceptible to a technology quick-fix. Typical name data may include: honorific titles (DR., etc.), hyphenated last names that may or may not include the hyphen, generation designations (SR, JR, III, etc.), two persons included in the name (SMITH BOB AND JANE), etc. etc.

    Depending on how the names are used, and the business impact of a bad name, the best solution I've seen is to apply a simple formatting rule, followed by a manual review of potential bad names.

    You seem to be reasonably sure that the sequence of names will be {last, first, middle}.  In that case, I'd suggest a procedure like this:

    A. Run all names through the script you got from SQLTeam. It looks like that will work adequately for most names.

    B. Identify potential problem names and route these for manual review.  As as starter, these probably should include:

     - blank first name

     - imbedded spaces found in the middle name

    If your organization can implement such a procedure, then a good data element to add to your parsed-names table might be "reviewed_by" and "reviewed_ts".  Potential bad names that have been reviewed can be excluded from subsequent bad-names lists.

    Good luck


    Regards,

    Bob Monahon

Viewing 5 posts - 16 through 19 (of 19 total)

You must be logged in to reply to this topic. Login to reply