SQL Clone
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 


getting Email addresses from Text


getting Email addresses from Text

Author
Message
Gattaca
Gattaca
Grasshopper
Grasshopper (15 reputation)Grasshopper (15 reputation)Grasshopper (15 reputation)Grasshopper (15 reputation)Grasshopper (15 reputation)Grasshopper (15 reputation)Grasshopper (15 reputation)Grasshopper (15 reputation)

Group: General Forum Members
Points: 15 Visits: 70
Hi Guys,
I have written a very hasty piece of code and have not been given much time to review this part but, because it is so hasty, I know in my water that this is not best practice, though it works and is picking up 85% of the email addresses (even with rubbish as well). It should be able to be written so that it picks up 9x% of clean data and I know that some of you have done similar work before.

I am trying to pull out a single email address from data, Test data can be as shown below. As you can see, its pretty dirty. It doesn’t pick just the email address. It ignores any with more than one email address (while ideally the first would be best option) and I end up with just 1,7,8 and 9 with more than just email address.

Ideally it should be all with possibly the exception of 4 which I don’t really know how to deal with and possibly 5.

A quick table
Create Table DBO.Email_Checker
(
KeyField Int,
Email_Address Varchar(75)
)


Some Test Data

Insert Into DBO.Email_Checker
Values (1,'Freddy strange has an email address of freddy@gogo.com');
Insert Into DBO.Email_Checker
Values (2,'Marge.flint@dhlink.co.uk / johnflint@gmail.com');
Insert Into DBO.Email_Checker
Values (3,'Email: User1@retail.wholesale.co.uk & User1@Google');
Insert Into DBO.Email_Checker
Values (4,'Jerremyblack@gmail.comsnowwhite@ubisux.com');
Insert Into DBO.Email_Checker
Values (5,'Email USER2@giraffe,gaff');
Insert Into DBO.Email_Checker
Values (6,'LinkLink@hydrager.eu – manklink@hydrager.eu');
Insert Into DBO.Email_Checker
Values (7,'Johns sharp jsharp@fritty.co.jp');
Insert Into DBO.Email_Checker
Values (8,'lankyman@gloopy.gov.gb');
Insert Into DBO.Email_Checker
Values (9,'Mobile: 09120923123 Email hankerchief@dragon.boat.com');


The Ugly Code -

select KeyField, Ltrim(Rtrim(Lower(Email_Address))) AS Email
from DBO.Email_Checker
Where Lower(Email_Address) LIKE '%_@__%.__%'
AND LEN(Email_Address) - LEN(REPLACE(Email_Address,'@','')) =1


Drop Table

--Drop Table DBO.Email_Checker


Any ideas?
SGT_squeequal
SGT_squeequal
SSC-Addicted
SSC-Addicted (459 reputation)SSC-Addicted (459 reputation)SSC-Addicted (459 reputation)SSC-Addicted (459 reputation)SSC-Addicted (459 reputation)SSC-Addicted (459 reputation)SSC-Addicted (459 reputation)SSC-Addicted (459 reputation)

Group: General Forum Members
Points: 459 Visits: 1060
its not much better but here is another way that retrives any email address if its the last test in your string and has a space.



select reverse(substring(reverse(Email_Address),0,charindex(' ',reverse(Email_Address))))
from DBO.Email_Checker




*************************************************************

The first is always the hardestw00t
Sean Lange
Sean Lange
One Orange Chip
One Orange Chip (25K reputation)One Orange Chip (25K reputation)One Orange Chip (25K reputation)One Orange Chip (25K reputation)One Orange Chip (25K reputation)One Orange Chip (25K reputation)One Orange Chip (25K reputation)One Orange Chip (25K reputation)

Group: General Forum Members
Points: 25879 Visits: 17510
As you said the data here is horrible to start with. Here is one way to deal with it. Use the DelimitedSplit8K function to first parse each string element apart based on spaces. Then take the list of all string segments and determine which ones are valid emails. The code for the DelimitedSpklit8K can be found by following the link in my signature about splitting strings.

The email part is in fact quite timely. This topic is being discussed right now in another thread. http://www.sqlservercentral.com/Forums/Topic1406771-386-1.aspx

I would recommend using the one that Lowell posted using System.Net.MailAddress as a way to validate the address. Of course this assumes you are able to use CLR.

Using those two functions I was able to return 11 of the possible 12 emails in your sample code. The one that gets missed (Jerremyblack@gmail.comsnowwhite@ubisux.com) is just simply never going to be able to be determined with code.

Here is the code that I came up with for this.


--this will split everything into segments first
;with cte as
(
select * from email_Checker
cross apply dbo.DelimitedSplit8K(Email_address, ' ')
)
, ParseEmail as
(
select *, dbo.IsValidEmail(Item) as IsValidEmail from cte
)

select * from ParseEmail
where IsValidEmail = 1



Make sure you read and understand the code in both of these functions. The splitter function is super fast and easy to use, but understanding it might take a bit of reading and rereading. The CLR function is pretty straight forward.

_______________________________________________________________

Need help? Help us help you.

Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

Need to split a string? Try Jeff Modens splitter.

Cross Tabs and Pivots, Part 1 – Converting Rows to Columns
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs
Understanding and Using APPLY (Part 1)
Understanding and Using APPLY (Part 2)
dwain.c
dwain.c
SSCertifiable
SSCertifiable (7.2K reputation)SSCertifiable (7.2K reputation)SSCertifiable (7.2K reputation)SSCertifiable (7.2K reputation)SSCertifiable (7.2K reputation)SSCertifiable (7.2K reputation)SSCertifiable (7.2K reputation)SSCertifiable (7.2K reputation)

Group: General Forum Members
Points: 7213 Visits: 6431
I just went through a similar exercise in this thread:
http://www.sqlservercentral.com/Forums/Topic1402134-391-1.aspx

It's farily long but I think you'll find some useful examples in it. Towards the end of it is an example that snatches out multiple email addresses from a string.


My mantra: No loops! No CURSORs! No RBAR! Hoo-uh!

My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.


Need to UNPIVOT? Why not CROSS APPLY VALUES instead?
Since random numbers are too important to be left to chance, let's generate some!
Learn to understand recursive CTEs by example.
Splitting strings based on patterns can be fast!
My temporal SQL musings: Calendar Tables, an Easter SQL, Time Slots and Self-maintaining, Contiguous Effective Dates in Temporal Tables
Go


Permissions

You can't post new topics.
You can't post topic replies.
You can't post new polls.
You can't post replies to polls.
You can't edit your own topics.
You can't delete your own topics.
You can't edit other topics.
You can't delete other topics.
You can't edit your own posts.
You can't edit other posts.
You can't delete your own posts.
You can't delete other posts.
You can't post events.
You can't edit your own events.
You can't edit other events.
You can't delete your own events.
You can't delete other events.
You can't send private messages.
You can't send emails.
You can read topics.
You can't vote in polls.
You can't upload attachments.
You can download attachments.
You can't post HTML code.
You can't edit HTML code.
You can't post IFCode.
You can't post JavaScript.
You can post emoticons.
You can't post or upload images.

Select a forum

































































































































































SQLServerCentral


Search