Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase

getting Email addresses from Text Expand / Collapse
Author
Message
Posted Tuesday, January 15, 2013 9:41 AM


Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Wednesday, January 08, 2014 8:28 AM
Points: 15, Visits: 70
Hi Guys,
I have written a very hasty piece of code and have not been given much time to review this part but, because it is so hasty, I know in my water that this is not best practice, though it works and is picking up 85% of the email addresses (even with rubbish as well). It should be able to be written so that it picks up 9x% of clean data and I know that some of you have done similar work before.

I am trying to pull out a single email address from data, Test data can be as shown below. As you can see, its pretty dirty. It doesn’t pick just the email address. It ignores any with more than one email address (while ideally the first would be best option) and I end up with just 1,7,8 and 9 with more than just email address.

Ideally it should be all with possibly the exception of 4 which I don’t really know how to deal with and possibly 5.

A quick table
Create Table DBO.Email_Checker
(
KeyField Int,
Email_Address Varchar(75)
)


Some Test Data
Insert Into DBO.Email_Checker
Values (1,'Freddy strange has an email address of freddy@gogo.com');
Insert Into DBO.Email_Checker
Values (2,'Marge.flint@dhlink.co.uk / johnflint@gmail.com');
Insert Into DBO.Email_Checker
Values (3,'Email: User1@retail.wholesale.co.uk & User1@Google');
Insert Into DBO.Email_Checker
Values (4,'Jerremyblack@gmail.comsnowwhite@ubisux.com');
Insert Into DBO.Email_Checker
Values (5,'Email USER2@giraffe,gaff');
Insert Into DBO.Email_Checker
Values (6,'LinkLink@hydrager.eu – manklink@hydrager.eu');
Insert Into DBO.Email_Checker
Values (7,'Johns sharp jsharp@fritty.co.jp');
Insert Into DBO.Email_Checker
Values (8,'lankyman@gloopy.gov.gb');
Insert Into DBO.Email_Checker
Values (9,'Mobile: 09120923123 Email hankerchief@dragon.boat.com');


The Ugly Code -
select KeyField, Ltrim(Rtrim(Lower(Email_Address))) AS Email
from DBO.Email_Checker
Where Lower(Email_Address) LIKE '%_@__%.__%'
AND LEN(Email_Address) - LEN(REPLACE(Email_Address,'@','')) =1


Drop Table
--Drop Table DBO.Email_Checker


Any ideas?
Post #1407340
Posted Tuesday, January 15, 2013 12:57 PM


SSC Veteran

SSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC Veteran

Group: General Forum Members
Last Login: Monday, April 21, 2014 2:36 PM
Points: 268, Visits: 995
its not much better but here is another way that retrives any email address if its the last test in your string and has a space.


select reverse(substring(reverse(Email_Address),0,charindex(' ',reverse(Email_Address))))
from DBO.Email_Checker




*************************************************************

The first is always the hardest
Post #1407441
Posted Tuesday, January 15, 2013 1:30 PM


SSChampion

SSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampion

Group: General Forum Members
Last Login: Today @ 3:28 PM
Points: 12,034, Visits: 11,062
As you said the data here is horrible to start with. Here is one way to deal with it. Use the DelimitedSplit8K function to first parse each string element apart based on spaces. Then take the list of all string segments and determine which ones are valid emails. The code for the DelimitedSpklit8K can be found by following the link in my signature about splitting strings.

The email part is in fact quite timely. This topic is being discussed right now in another thread. http://www.sqlservercentral.com/Forums/Topic1406771-386-1.aspx

I would recommend using the one that Lowell posted using System.Net.MailAddress as a way to validate the address. Of course this assumes you are able to use CLR.

Using those two functions I was able to return 11 of the possible 12 emails in your sample code. The one that gets missed (Jerremyblack@gmail.comsnowwhite@ubisux.com) is just simply never going to be able to be determined with code.

Here is the code that I came up with for this.

--this will split everything into segments first
;with cte as
(
select * from email_Checker
cross apply dbo.DelimitedSplit8K(Email_address, ' ')
)
, ParseEmail as
(
select *, dbo.IsValidEmail(Item) as IsValidEmail from cte
)

select * from ParseEmail
where IsValidEmail = 1

Make sure you read and understand the code in both of these functions. The splitter function is super fast and easy to use, but understanding it might take a bit of reading and rereading. The CLR function is pretty straight forward.


_______________________________________________________________

Need help? Help us help you.

Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

Need to split a string? Try Jeff Moden's splitter.

Cross Tabs and Pivots, Part 1 – Converting Rows to Columns
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs
Understanding and Using APPLY (Part 1)
Understanding and Using APPLY (Part 2)
Post #1407466
Posted Tuesday, January 15, 2013 10:46 PM


Hall of Fame

Hall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of Fame

Group: General Forum Members
Last Login: Today @ 5:39 PM
Points: 3,596, Visits: 5,113
I just went through a similar exercise in this thread:
http://www.sqlservercentral.com/Forums/Topic1402134-391-1.aspx

It's farily long but I think you'll find some useful examples in it. Towards the end of it is an example that snatches out multiple email addresses from a string.



My mantra: No loops! No CURSORs! No RBAR! Hoo-uh!

My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.


Need to UNPIVOT? Why not CROSS APPLY VALUES instead?
Since random numbers are too important to be left to chance, let's generate some!
Learn to understand recursive CTEs by example.
Splitting strings based on patterns can be fast!
Post #1407599
« Prev Topic | Next Topic »

Add to briefcase

Permissions Expand / Collapse