Name Parser

  • I am working with a number of large data extracts (10 - 50 million records) and need to parse full name fields into their component parts (e.g., first name, middle name, last name, suffix). The format of the full name fields may vary across extracts, but rarely vary within an extract. Here are some example formats:

    Dr. Mary Ann A. Van Ness Jr.

    Van Ness Jr., Dr. Mary Ann A.

    Van Ness Jr. Dr. Mary Ann A.

    I am generally able to code around most variations (e.g., combine compound first names and last names, account for common honorifics, etc.), but this is a time-consuming process, and I was wondering if there is a software tool or SQL function available (commercial or free) that is able to parse full names with a high degree of accuracy. I understand that there are situations where the name parser will fail (e.g., sometimes Mary Ann will be a compound first name, and sometimes Ann will be the middle name), but would like to know if there is a good solution out there that is reasonably priced. I've looked into IBM's name parser (http://publib.boulder.ibm.com/infocenter/gnrgna/v3r1m0/topic/com.ibm.gnr.gna.ic.doc/topics/gnr_np_con_parsingnamesusingnameparser.html).

  • This is a tricky one. You could use Regular expressions for this; there is a good script for this in the Regular Expressions Cookbook by O'Reilly which I will try to dig up later today (I don't have the book with me).

    To run regular expressions in SQL you can use the regex CLRs that ship with Master Data Services. This article[/url] has good steps for creating the required Microsoft.MasterDataServices.DataQuality assembly.

    (Cem's article does not include the code for the assembly, you can get the code from here.)

    Instead, however, of running the code to create the mdq.XmlTransformyou you would run the code to create one or more of the following:

    mdq.RegexExtract

    mdq.RegexIsMatch

    mdq.RegexIsValid

    mdq.RegexMask

    mdq.RegexMatches

    mdq.RegexReplace

    mdq.RegexSplit

    That code to create any of the above CLR functions can also be found at the link above... E.g the code for mdq.RegexMatches is located at http://sqlboost.com/Util_mdq_RegexMatches.aspx and looks like this:

    CREATE FUNCTION [mdq].[RegexMatches]

    (@input NVARCHAR(4000), @pattern NVARCHAR(4000), @mask TINYINT)

    RETURNS TABLE (

    [Sequence] INT NULL,

    [Token] NVARCHAR(4000) NULL

    )

    WITH EXECUTE AS CALLER

    AS

    EXTERNAL NAME [Microsoft.MasterDataServices.DataQuality].[Microsoft.MasterDataServices.DataQuality.SqlClr].[RegexMatches]

    Again, I will see if I can dig up that regular expression I mentioned earlier. I hope this helps.

    Edit: added link, fixed typo

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Thank you, Alan... that helps. Looking forward to taking a look at the regular expression if you can find it.

  • jamoore (4/22/2013)


    Thank you, Alan... that helps. Looking forward to taking a look at the regular expression if you can find it.

    What I have here won't be a complete solution but may be something you can take and run with. I found the script I was talking about which is solves a variation of the problem you are dealing with. I don't know how much you know about regular expressions but, with a little knowledge, you should have an easier time coding around most goofy name scenarios.

    I tested the regex script during lunch using mdq.regexReplace (which, unless I am mistaken, uses the .NET flavor of Regex). If you run the code to install the [Microsoft.MasterDataServices.DataQuality] assembly you can create the regexReplace CLR and use it like I did below.

    Let me know if this helps get you in the right direction and if you have some questions.

    /*****************************************************

    (1) Create the function

    *****************************************************/

    IF EXISTS ( SELECT * FROM INFORMATION_SCHEMA.ROUTINES

    WHERE SPECIFIC_SCHEMA='dbo' AND SPECIFIC_NAME='RegexReplace' AND ROUTINE_TYPE='FUNCTION')

    DROP FUNCTION dbo.RegexReplace;

    GO

    CREATE FUNCTION dbo.RegexReplace(@input nvarchar(4000), @pattern nvarchar(4000), @REPLACE nvarchar(4000), @mask tinyint)

    /*

    Takes four parameters:

    @input: the original string

    @pattern: pattern(s) to match

    @REPLACE: how to rearrange the patterns when they are matched

    @mask: I don't know... I just started playing with this CLR today

    */

    RETURNS nvarchar(4000) WITH EXECUTE AS CALLER, RETURNS NULL ON NULL INPUT

    AS

    EXTERNAL NAME [Microsoft.MasterDataServices.DataQuality].[Microsoft.MasterDataServices.DataQuality.SqlClr].[RegexReplace]

    GO

    /*****************************************************

    (2) Create sample data

    *****************************************************/

    IF OBJECT_ID('tempdb..#names') IS NOT NULL

    DROP TABLE #names;

    CREATE TABLE #names (name_id int identity primary key, name varchar(100))

    INSERT INTO #names (name)

    SELECT 'Robert Downey, Jr.' UNION ALL

    SELECT 'John F. Kennedy' UNION ALL

    SELECT 'Scarlett O''hara' UNION ALL

    SELECT 'J.R.R. Tolkien' UNION ALL

    SELECT 'Catherine Zeta-Zones';

    /*****************************************************

    (3) Run the CLR

    Note: the regex replace script that I am using here

    turns:

    "Robert Downey, Jr." into"Downey, Robert, Jr."

    "John F. Kennedy" into"Kennedy, John F."

    "Scarlett O'hara" into "O'hara, Scarlett", etc..

    *****************************************************/

    WITH cleanedNames AS

    (SELECTname,

    dbo.RegExReplace(n.name,'^(.+?) ([^\s,]+)(,? (?:[JS]r\.?|III?|IV))?$','$2, $1$3','') AS new_name

    FROM #names n)

    SELECTname AS original_name,

    new_name,

    RIGHT(new_name,LEN(new_name)-CHARINDEX(',',new_name)-1) AS FirstName,

    LEFT(new_name,CHARINDEX(',',new_name)-1) AS LastName

    FROM cleanedNames;

    DROP TABLE #names

    output:

    original_namenew_nameFirstNameLastName

    -------------------------------------------------------------------------------

    Robert Downey, Jr.Downey, Robert, Jr.Robert, Jr.Downey

    John F. KennedyKennedy, John F.John F.Kennedy

    Scarlett O'haraO'hara, ScarlettScarlettO'hara

    J.R.R. TolkienTolkien, J.R.R.J.R.R.Tolkien

    Catherine Zeta-ZonesZeta-Zones, CatherineCatherineZeta-Zones

    Edit: added result set, fixed code typo.

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • how about "Mr, Generator Van De Graf"?

    or

    "Rt Hon Wg Cdr Rev Dr Michael Jones III" (He is a member of the UK House of Commons, a decomissioned Wing Commander, holds a PHD, is a man-of-the-cloth and has the same name as his father and grand-father - I made him up .. )

    Not rubbishing the regex because I hate the things but you may also need to output results that can't be cleaned for manual processing

  • The regex expression works in most cases, but our current non-regex solution is a bit more robust and accounts for irish surnames and italian surnames when no apostrophe is present (e.g., O Hara, D Agostino). Also, I do not see how the regex solution would handle compound last names (e.g., De Los Dios; Van der Graaf), but maybe I am missing something.

    Generally, my solution is to identify the number of spaces in the full name with the following code:

    COALESCE(LEN('x' + LTRIM(Full_Name)),'') - LEN(COALESCE(REPLACE(Full_Name,' ', ''), ''))

    We make the assumption that each full name is populated with at least a first name and a last name. If there is only one space in the field, then I am confident in assigning the first part (using Charindex) as the first name, and the second part as the last name. If there are two spaces, then the following situations are possible:

    1. First Name, Middle Name, Last Name

    2. First Name, Last Name, Suffix

    3. Honorific, First Name, Last Name

    4. Compound First Name (Part 1), Compound First Name (Part 2), Last Name

    5. First Name, Compound Last Name (Part 1), Compound Last Name (Part 2)

    I wrote several hundred lines of code to deal with these scenarios, and it works well. In most cases, full names with one or two spaces account for 75% or more of the populations that I am reviewing, and I also wrote several hundred lines of code to deal with scenarios where 3 or more spaces exist. I suppose that I could build a function that automates that code, but I wanted to know if an "off-the-shelf" solution existed that might be better than (or as good as) what I wrote.

    Thanks again for your help.

  • People seem to like this tool.

    There's a free trial, so it can't hurt to try. I've never used it, personally.

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply