Identify and compare text between characters

  • Experts,

    interested in cleaning interesting (bad) data ;-).

    I want to compare the text between 1st occurance of @ symbol and 1st occurance of any special character with 2nd occurance of @ symbol and next occurance of any special character.

    Example:

    I have employee email data in a column. My concern is only the domain part of emails.

    Employee ID Email

    ------------ ----------

    1 <mmooney@gmail.com>;"mmoney@gmail.com"

    2 <chrisgardener@yahoo.com>)']chrisgardener@comcast.net)]

    3 "natan@hotmail.com':>"natan@hotmail.com?'

    4 "jennifer@walmart.com}';jennifer@fedex.com

    I want to identify the 1st record and 3rd record as good data (because both domains are same); and 2nd record and 4th record as exceptions (because both domains are different)

    What would be the best way to identify good ones and exceptions? Thanks in advance!

  • Hi chaseurpuli,

    I think the data in "empid 1" is not matching due to spellings mis-match. Please make sue whatever you are asking is really correct. Please try to better format your input and output in future posts. Thanks.

    Anyway following is just a starting point. please check if this helps in your real scenario.

    use tempdb

    GO

    create table #emp

    (

    empno int,

    email varchar(1024)

    )

    INSERT INTO #emp values (1, '<mmooney@gmail.com>;"mmoney@gmail.com"');

    INSERT INTO #emp values (2, '<chrisgardener@yahoo.com>)'']chrisgardener@comcast.net)]');

    INSERT INTO #emp values (3, '"natan@hotmail.com'':>"natan@hotmail.com?''');

    INSERT INTO #emp values (4, '"jennifer@walmart.com}'';jennifer@fedex.com');

    select

    empno,

    email,

    substring(email,2,CHARINDEX('.com', email)+2),

    substring(email,CHARINDEX('.com', email)+4, LEN(email)),

    case when charindex(ltrim(rtrim(substring(email,2,CHARINDEX('.com', email)+2))), ltrim(rtrim(substring(email,CHARINDEX('.com', email)+4, LEN(email))))) > 0 then 1 else 0 end

    from #emp

    drop table #emp

    Cheers.

  • this thread is a duplicate of your previous thread here:

    http://www.sqlservercentral.com/Forums/Topic1320423-392-1.aspx

    both dwain.c and Chris@home, both posters I respect enourmously, gave you a couple of solutions, but you reposted your same question again.

    did you have trouble understanding the solutions they provided?

    the best thing is to reply to the same, original thread with any requests for clarifications.

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • Looks to me like this thread was the original post looking for help and no one responded here until today.

  • I'm going to recommend doing this in something other than T-SQL. Like a .NET assembly. Where you can use RegEx functionality for it.

    However, if you insist on doing it in T-SQL, here's one way:

    SELECT Emails,

    SUBSTRING(Emails, First@, First@End - First@) AS FirstDomain,

    SUBSTRING(Emails, Second@, Second@End - Second@) AS SecondDomain

    FROM ( VALUES ( '<mmooney@gmail.com>;"mmoney@gmail.com"'),

    ( '<chrisgardener@yahoo.com>)'']chrisgardener@comcast.net)]'), ( '"natan@hotmail.com'':>"natan@hotmail.com?'''),

    ( '"jennifer@walmart.com}'';jennifer@fedex.com') ) AS Vals (Emails)

    CROSS APPLY (SELECT TOP 1

    Number AS [First@]

    FROM dbo.Numbers

    WHERE Number <= LEN(Emails)

    AND SUBSTRING(Emails, Number, 1) = '@'

    ORDER BY Number) AS P1

    CROSS APPLY (SELECT TOP 1

    Number AS [Second@]

    FROM dbo.Numbers

    WHERE Number <= LEN(Emails)

    AND SUBSTRING(Emails, Number, 1) = '@'

    AND Number > [First@]

    ORDER BY Number) AS P2

    CROSS APPLY (SELECT TOP 1

    Number AS [First@End]

    FROM dbo.Numbers

    WHERE Number <= LEN(Emails)

    AND SUBSTRING(Emails, Number, 1) LIKE '[^0-9a-z@]'

    AND Number > [First@]

    ORDER BY Number) AS P3

    CROSS APPLY (SELECT TOP 1

    Number AS [Second@End]

    FROM dbo.Numbers

    WHERE Number <= LEN(Emails)

    AND SUBSTRING(Emails, Number, 1) LIKE '[^0-9a-z@]'

    AND Number > [Second@]

    ORDER BY Number) AS P4;

    This requires the presence of a Numbers table, which is simply a table with integer numbers in it. I have one with all numbers from 0 to 10-thousand, which I use for this kind of thing.

    Note that this does not correct for malformed e-mails, like if you had an entry with "bob@hotmail". Note there's no ".com" on the end of that. It won't work, but the script will get an invalid length parameter error. If you need to deal with malformed e-mails in this, then you'll need to compensate for that. How you compensate for it will depend on what's wrong with the data, so I can't really tell you what to do, since I don't have your data.

    A .NET RegEx will almost certainly be faster, will have better error-handling options available, and will be easier for others to read, refactor, debug, etc. But this should work to the spec you gave.

    Edit: Improved code layout.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Lowell,

    As Lynn said, this was the original post and as no one replied till last night, I have created a re-post of it so that it will be in the first page :-). Sorry for the duplication.

  • chaseurpuli (6/25/2012)


    Lowell,

    As Lynn said, this was the original post and as no one replied till last night, I have created a re-post of it so that it will be in the first page :-). Sorry for the duplication.

    You really shouldn't repost just because noone had responded. Please remember we are volunteers on this site and use our free time to provide assistance. It is simply possible no one who could answer your quest had read it, or were busy with other questions.

    If you want to bring your question to the top again, just post to your original thread. That will bump it.

  • Sorry Lynn, New to this forum. I don't repost from nect time. Thanks.:-)

  • Thanks GSquared, I will go with T-SQL as I am not familiar with .NET Assembly. Very helpful.

  • iBar,

    Your script is of great help. Anyways, my data doesn't necessarily need to have just '.com' in the domain part. It may end with .net, .co, .edu ...etc. The solution given by 'ChrisM@Work' in the post 'http://www.sqlservercentral.com/Forums/Topic1320423-392-1.aspx' perfectly fits to my case. Thanks much.

  • chaseurpuli (6/25/2012)


    iBar,

    Your script is of great help. Anyways, my data doesn't necessarily need to have just '.com' in the domain part. It may end with .net, .co, .edu ...etc. The solution given by 'ChrisM@Work' in the post 'http://www.sqlservercentral.com/Forums/Topic1320423-392-1.aspx' perfectly fits to my case. Thanks much.

    Oh there is lot of background to this issue ....and requirement was not that clear to me.

    Anyway, good luck with your solution.

    Cheers.

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply