getting Email addresses from Text

  • Hi Guys,

    I have written a very hasty piece of code and have not been given much time to review this part but, because it is so hasty, I know in my water that this is not best practice, though it works and is picking up 85% of the email addresses (even with rubbish as well). It should be able to be written so that it picks up 9x% of clean data and I know that some of you have done similar work before.

    I am trying to pull out a single email address from data, Test data can be as shown below. As you can see, its pretty dirty. It doesn’t pick just the email address. It ignores any with more than one email address (while ideally the first would be best option) and I end up with just 1,7,8 and 9 with more than just email address.

    Ideally it should be all with possibly the exception of 4 which I don’t really know how to deal with and possibly 5.

    A quick table

    Create Table DBO.Email_Checker

    (

    KeyField Int,

    Email_Address Varchar(75)

    )

    Some Test Data

    Insert Into DBO.Email_Checker

    Values (1,'Freddy strange has an email address of freddy@gogo.com');

    Insert Into DBO.Email_Checker

    Values (2,'Marge.flint@dhlink.co.uk / johnflint@gmail.com');

    Insert Into DBO.Email_Checker

    Values (3,'Email: User1@retail.wholesale.co.uk & User1@Google');

    Insert Into DBO.Email_Checker

    Values (4,'Jerremyblack@gmail.comsnowwhite@ubisux.com');

    Insert Into DBO.Email_Checker

    Values (5,'Email USER2@giraffe,gaff');

    Insert Into DBO.Email_Checker

    Values (6,'LinkLink@hydrager.eu – manklink@hydrager.eu');

    Insert Into DBO.Email_Checker

    Values (7,'Johns sharp jsharp@fritty.co.jp');

    Insert Into DBO.Email_Checker

    Values (8,'lankyman@gloopy.gov.gb');

    Insert Into DBO.Email_Checker

    Values (9,'Mobile: 09120923123 Email hankerchief@dragon.boat.com');

    The Ugly Code -

    select KeyField, Ltrim(Rtrim(Lower(Email_Address))) AS Email

    from DBO.Email_Checker

    Where Lower(Email_Address) LIKE '%_@__%.__%'

    AND LEN(Email_Address) - LEN(REPLACE(Email_Address,'@','')) =1

    Drop Table

    --Drop Table DBO.Email_Checker

    Any ideas?

  • its not much better but here is another way that retrives any email address if its the last test in your string and has a space.

    select reverse(substring(reverse(Email_Address),0,charindex(' ',reverse(Email_Address))))

    from DBO.Email_Checker

    ***The first step is always the hardest *******

  • As you said the data here is horrible to start with. Here is one way to deal with it. Use the DelimitedSplit8K function to first parse each string element apart based on spaces. Then take the list of all string segments and determine which ones are valid emails. The code for the DelimitedSpklit8K can be found by following the link in my signature about splitting strings.

    The email part is in fact quite timely. This topic is being discussed right now in another thread. http://www.sqlservercentral.com/Forums/Topic1406771-386-1.aspx

    I would recommend using the one that Lowell posted using System.Net.MailAddress as a way to validate the address. Of course this assumes you are able to use CLR.

    Using those two functions I was able to return 11 of the possible 12 emails in your sample code. The one that gets missed (Jerremyblack@gmail.comsnowwhite@ubisux.com) is just simply never going to be able to be determined with code.

    Here is the code that I came up with for this.

    --this will split everything into segments first

    ;with cte as

    (

    select * from email_Checker

    cross apply dbo.DelimitedSplit8K(Email_address, ' ')

    )

    , ParseEmail as

    (

    select *, dbo.IsValidEmail(Item) as IsValidEmail from cte

    )

    select * from ParseEmail

    where IsValidEmail = 1

    Make sure you read and understand the code in both of these functions. The splitter function is super fast and easy to use, but understanding it might take a bit of reading and rereading. The CLR function is pretty straight forward.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • I just went through a similar exercise in this thread:

    http://www.sqlservercentral.com/Forums/Topic1402134-391-1.aspx

    It's farily long but I think you'll find some useful examples in it. Towards the end of it is an example that snatches out multiple email addresses from a string.


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply