Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase ««12

Regular Expressions Expand / Collapse
Author
Message
Posted Tuesday, January 15, 2013 10:54 AM


SSChampion

SSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampion

Group: General Forum Members
Last Login: Today @ 9:14 AM
Points: 12,918, Visits: 32,088
ok i'm not sure if this is doing a valid performance test or not;

i had initially thought that creating an object would be slower than regex, but that's not what i'm seeing. On my Dev machine with no load on it at all, but limited to a max memory of 1024 Meg, i'm getting results for 1 Million random rows of data in about 35 seconds for the .Net item, but the Regex seems to go for more than eight minutes; i wonder if it's because regex is not System.IDisposable?

does anyone see similar results, or am i missing something obvious?
results:
--------------------------------------------------------------------------------
CLEAR CACHE SO NEITHER QUERY GETS AN ADVANTAGE
--------------------------------------------------------------------------------
DBCC execution completed. If DBCC printed error messages, contact your system administrator.
DBCC execution completed. If DBCC printed error messages, contact your system administrator.
========== BASELINE ==========

SQL Server Execution Times:
CPU time = 125 ms, elapsed time = 526 ms.
================================================================================
--------------------------------------------------------------------------------
CLEAR CACHE SO NEITHER QUERY GETS AN ADVANTAGE
--------------------------------------------------------------------------------
DBCC execution completed. If DBCC printed error messages, contact your system administrator.
DBCC execution completed. If DBCC printed error messages, contact your system administrator.
========== Using Mail Object Function ==========

(1000000 row(s) affected)

SQL Server Execution Times:
CPU time = 23665 ms, elapsed time = 30350 ms.
================================================================================
--------------------------------------------------------------------------------
CLEAR CACHE SO NEITHER QUERY GETS AN ADVANTAGE
--------------------------------------------------------------------------------
DBCC execution completed. If DBCC printed error messages, contact your system administrator.
DBCC execution completed. If DBCC printed error messages, contact your system administrator.
========== Using RegEx Function ==========
--Gave up after Eight Plus minutes

this is what i used for testing:
I generate some random data, that sometimes matches the expected format with @ and .com in it.

WITH
cteBase AS
( --=== Gen enough 8 character random codes
SELECT TOP 1000000 LEFT(NEWID(),8) AS RandomCode
FROM Master.sys.All_Columns ac1
CROSS JOIN Master.sys.All_Columns ac2
)
,
cteManip AS
( --=== Number the codes so we can select unique random codes
SELECT RandomCode
+ CASE WHEN RandomCode LIKE '[A-Z]%' THEN '@' ELSE '' END
+ RandomCode
+ CASE WHEN RandomCode LIKE '[A-Z]%' THEN '.com' ELSE '' END
AS RandomCode
FROM cteBase

) --=== Select 1 million random codes
SELECT identity(int,1,1) as ID,RandomCode
INTO #MySampleEmails
FROM cteManip


DECLARE @HOLDER VARCHAR(20);

PRINT REPLICATE('-',80);
PRINT 'CLEAR CACHE SO NEITHER QUERY GETS AN ADVANTAGE';
PRINT REPLICATE('-',80);
DBCC FREEPROCCACHE;
DBCC DROPCLEANBUFFERS;

PRINT '========== BASELINE ==========';
SET STATISTICS TIME ON;
SELECT @HOLDER = COUNT(*)
FROM #MySampleEmails;
SET STATISTICS TIME OFF;
PRINT REPLICATE('=',80);

PRINT REPLICATE('-',80);
PRINT 'CLEAR CACHE SO NEITHER QUERY GETS AN ADVANTAGE';
PRINT REPLICATE('-',80);
DBCC FREEPROCCACHE;
DBCC DROPCLEANBUFFERS;

PRINT '========== Using Mail Object Function ==========';
SET STATISTICS TIME ON;
SELECT *,dbo.IsValidEmail1(RandomCode) As IsValidEmail
into #deleteme1
FROM #MySampleEmails
SET STATISTICS TIME OFF;

PRINT REPLICATE('=',80);
PRINT REPLICATE('-',80);
PRINT 'CLEAR CACHE SO NEITHER QUERY GETS AN ADVANTAGE';
PRINT REPLICATE('-',80);
DBCC FREEPROCCACHE;
DBCC DROPCLEANBUFFERS;


PRINT '========== Using RegEx Function ==========';
SET STATISTICS TIME ON;
SELECT *,dbo.IsValidEmail2(RandomCode) As IsValidEmail
into #deleteme2
FROM #MySampleEmails

SET STATISTICS TIME OFF;

PRINT REPLICATE('=',80);
PRINT REPLICATE('-',80);
PRINT 'CLEAR CACHE SO NEITHER QUERY GETS AN ADVANTAGE';
PRINT REPLICATE('-',80);
DBCC FREEPROCCACHE;
DBCC DROPCLEANBUFFERS;

drop table #deleteme1
drop table #deleteme2
drop table #MySampleEmails



Lowell

--There is no spoon, and there's no default ORDER BY in sql server either.
Actually, Common Sense is so rare, it should be considered a Superpower. --my son
Post #1407379
Posted Tuesday, January 15, 2013 12:25 PM


SSChampion

SSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampion

Group: General Forum Members
Last Login: Today @ 11:46 AM
Points: 13,481, Visits: 12,338
Lowell I dropped the million to pull only 1,000 and it is painfully obvious that the MailAddress method blows the doors off the regex version for performance.

Here is the output.


(1000 row(s) affected)
--------------------------------------------------------------------------------
CLEAR CACHE SO NEITHER QUERY GETS AN ADVANTAGE
--------------------------------------------------------------------------------
DBCC execution completed. If DBCC printed error messages, contact your system administrator.
DBCC execution completed. If DBCC printed error messages, contact your system administrator.
========== BASELINE ==========

SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.
================================================================================
--------------------------------------------------------------------------------
CLEAR CACHE SO NEITHER QUERY GETS AN ADVANTAGE
--------------------------------------------------------------------------------
DBCC execution completed. If DBCC printed error messages, contact your system administrator.
DBCC execution completed. If DBCC printed error messages, contact your system administrator.
========== Using Mail Object Function ==========

SQL Server Execution Times:
CPU time = 62 ms, elapsed time = 74 ms.

(1000 row(s) affected)
================================================================================
--------------------------------------------------------------------------------
CLEAR CACHE SO NEITHER QUERY GETS AN ADVANTAGE
--------------------------------------------------------------------------------
DBCC execution completed. If DBCC printed error messages, contact your system administrator.
DBCC execution completed. If DBCC printed error messages, contact your system administrator.
========== Using RegEx Function ==========

SQL Server Execution Times:
CPU time = 7379 ms, elapsed time = 7501 ms.

(1000 row(s) affected)
================================================================================
--------------------------------------------------------------------------------
CLEAR CACHE SO NEITHER QUERY GETS AN ADVANTAGE
--------------------------------------------------------------------------------
DBCC execution completed. If DBCC printed error messages, contact your system administrator.
DBCC execution completed. If DBCC printed error messages, contact your system administrator.


Until yesterday I had never had the regex code that I posted used in a CLR but I can't think of anything that would help it perform faster. Seems to me that the MailAddress version will perform better in addition to being more thoroughly tested. Although I did run this a lot of times with the following query .

select d1.RandomCode, d1.IsValidEmail, d2.IsValidEmail 
from #deleteme1 d1
left join #Deleteme2 d2 on d1.Randomcode = d2.randomCode
where d1.IsValidEmail <> isnull(d2.IsValidEmail, 2)

select d1.RandomCode, d1.IsValidEmail, d2.IsValidEmail
from #deleteme2 d2
left join #Deleteme1 d1 on d1.Randomcode = d2.randomCode
where d2.IsValidEmail <> isnull(d1.IsValidEmail, 2)


I was wondering if I would find any of these random codes that returned different results but I didn't seem to have any luck finding any. I wonder if turning them into table valued CLR's would help performance over the scalar udf. I am swamped right now but this sounds like a fun project.


_______________________________________________________________

Need help? Help us help you.

Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

Need to split a string? Try Jeff Moden's splitter.

Cross Tabs and Pivots, Part 1 – Converting Rows to Columns
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs
Understanding and Using APPLY (Part 1)
Understanding and Using APPLY (Part 2)
Post #1407421
Posted Tuesday, January 15, 2013 9:05 PM


SSCertifiable

SSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiable

Group: General Forum Members
Last Login: Today @ 6:43 AM
Points: 7,127, Visits: 12,656
I was really surprised at how poorly the RegEx did compared to MailAddress. I am now wondering what MailAddress uses to determine validity.

Here are a couple TVFs to go along with the SVFs:

using System;
using System.Data;
using System.Data.SqlClient;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
using System.Text.RegularExpressions;
using System.Collections;
using System.Net.Mail;

public partial class UserDefinedFunctions
{
    
private static readonly Regex re = new Regex("^([0-9a-zA-Z]([-.\\w]*[0-9a-zA-Z])*@([0-9a-zA-Z][-\\w]*[0-9a-zA-Z]\\.)+[a-zA-Z]{2,9})$");

    
[Microsoft.SqlServer.Server.SqlFunction]
    
public static SqlBoolean IsValidEmailRegEx(SqlString email)
    
{
        
return re.IsMatch(email.ToString());
    
}

    [SqlFunction(FillRowMethodName = "EmailIsValidRegEx",
        TableDefinition = "EmailIsValid bit")]
    
public static IEnumerable EmailValidityRegEx(SqlString emailAddress)
    
{
        ArrayList EmailCollection
= new ArrayList();
        
EmailCollection.Add(emailAddress);
        
return EmailCollection;
    
}

    
//FillRow method. The method name has been specified above as
    //a SqlFunction attribute property
    
public static void EmailIsValidRegEx(object emailAddress,
                                    
out SqlBoolean emailIsValid)
    
{
        emailIsValid
= re.IsMatch(emailAddress.ToString());
    
}

    
}
;



using System;
using System.Data;
using System.Data.SqlClient;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
using System.Collections;
using System.Net.Mail;

public partial class UserDefinedFunctions
{
    [Microsoft.SqlServer.Server.SqlFunction]
    
public static SqlBoolean IsValidEmailSystemNet(SqlString email)
    
{
        
try
        
{
            System.Net.Mail.MailAddress ma
= new System.Net.Mail.MailAddress(email.ToString());
            
return (SqlBoolean)true;
        
}
        
catch
        
{
            
return (SqlBoolean)false;
        
}
    }

    [SqlFunction(FillRowMethodName = "EmailIsValidSystemNet",
        TableDefinition = "EmailIsValid bit")]
    
public static IEnumerable EmailValiditySystemNet(SqlString emailAddress)
    
{
        ArrayList EmailCollection
= new ArrayList();
        
EmailCollection.Add(emailAddress);
        
return EmailCollection;
    
}

    
//FillRow method. The method name has been specified above as
    //a SqlFunction attribute property
    
public static void EmailIsValidSystemNet(object emailAddress,
                                    
out SqlBoolean emailIsValid)
    
{
        
try
        
{
            MailAddress ma
= new MailAddress(emailAddress.ToString());
            
emailIsValid = (SqlBoolean)true;
        
}
        
catch
        
{
            emailIsValid
= (SqlBoolean)false;
        
}
    }

}
;


On my lowly 32-bit 2.4 GHz Intel Core 2 Duo laptop w/4GB RAM the MailAddress SVF took the prize although I wonder if on a machine with more CPUs if there would be a tipping where the TVF might finish in a shorter amount of time:

SQL Server parse and compile time: 
CPU time = 0 ms, elapsed time = 0 ms.

SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.

(1000 row(s) affected)
--------------------------------------------------------------------------------
CLEAR CACHE SO NEITHER QUERY GETS AN ADVANTAGE
--------------------------------------------------------------------------------
DBCC execution completed. If DBCC printed error messages, contact your system administrator.
DBCC execution completed. If DBCC printed error messages, contact your system administrator.
========== BASELINE ==========

SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.
================================================================================
--------------------------------------------------------------------------------
CLEAR CACHE SO NEITHER QUERY GETS AN ADVANTAGE
--------------------------------------------------------------------------------
DBCC execution completed. If DBCC printed error messages, contact your system administrator.
DBCC execution completed. If DBCC printed error messages, contact your system administrator.
========== Using Mail Object Function ==========

SQL Server Execution Times:
CPU time = 16 ms, elapsed time = 61 ms.

(1000 row(s) affected)
================================================================================
--------------------------------------------------------------------------------
CLEAR CACHE SO NEITHER QUERY GETS AN ADVANTAGE
--------------------------------------------------------------------------------
DBCC execution completed. If DBCC printed error messages, contact your system administrator.
DBCC execution completed. If DBCC printed error messages, contact your system administrator.
========== Using RegEx Function ==========

SQL Server Execution Times:
CPU time = 7145 ms, elapsed time = 7437 ms.

(1000 row(s) affected)
================================================================================
--------------------------------------------------------------------------------
CLEAR CACHE SO NEITHER QUERY GETS AN ADVANTAGE
--------------------------------------------------------------------------------
DBCC execution completed. If DBCC printed error messages, contact your system administrator.
DBCC execution completed. If DBCC printed error messages, contact your system administrator.
========== Using Table-Valued Mail Object Function ==========

SQL Server Execution Times:
CPU time = 16 ms, elapsed time = 80 ms.

(1000 row(s) affected)
================================================================================
--------------------------------------------------------------------------------
CLEAR CACHE SO NEITHER QUERY GETS AN ADVANTAGE
--------------------------------------------------------------------------------
DBCC execution completed. If DBCC printed error messages, contact your system administrator.
DBCC execution completed. If DBCC printed error messages, contact your system administrator.
========== Using Table-Valued RegEx Function ==========

SQL Server Execution Times:
CPU time = 7207 ms, elapsed time = 7429 ms.

(1000 row(s) affected)
================================================================================
--------------------------------------------------------------------------------
CLEAR CACHE SO NEITHER QUERY GETS AN ADVANTAGE
--------------------------------------------------------------------------------
DBCC execution completed. If DBCC printed error messages, contact your system administrator.
DBCC execution completed. If DBCC printed error messages, contact your system administrator.



USE test;
SET STATISTICS TIME OFF;

IF OBJECT_ID(N'tempdb..#MySampleEmails') IS NOT NULL
DROP TABLE #MySampleEmails;
GO

WITH cteBase
AS (
--=== Gen enough 8 character random codes
SELECT TOP 1000
LEFT(NEWID(), 8) AS RandomCode
FROM Master.sys.All_Columns ac1
CROSS JOIN Master.sys.All_Columns ac2
),
cteManip
AS (
--=== Number the codes so we can select unique random codes
SELECT RandomCode + CASE WHEN RandomCode LIKE '[A-Z]%' THEN '@'
ELSE ''
END + RandomCode + CASE WHEN RandomCode LIKE '[A-Z]%' THEN '.com'
ELSE ''
END AS RandomCode
FROM cteBase
)
--=== Select 1 million random codes
SELECT IDENTITY( INT,1,1 ) AS ID,
RandomCode
INTO #MySampleEmails
FROM cteManip;

---------------------------------------

IF OBJECT_ID(N'tempdb..#deleteme1') IS NOT NULL
DROP TABLE #deleteme1;
GO
IF OBJECT_ID(N'tempdb..#deleteme2') IS NOT NULL
DROP TABLE #deleteme2;
GO
IF OBJECT_ID(N'tempdb..#deleteme3') IS NOT NULL
DROP TABLE #deleteme3;
GO
IF OBJECT_ID(N'tempdb..#deleteme4') IS NOT NULL
DROP TABLE #deleteme4;
GO

DECLARE @HOLDER VARCHAR(20);

PRINT REPLICATE('-', 80);
PRINT 'CLEAR CACHE SO NEITHER QUERY GETS AN ADVANTAGE';
PRINT REPLICATE('-', 80);
DBCC FREEPROCCACHE;
DBCC DROPCLEANBUFFERS;

PRINT '========== BASELINE ==========';
SET STATISTICS TIME ON;
SELECT @HOLDER = COUNT(*)
FROM #MySampleEmails;
SET STATISTICS TIME OFF;
PRINT REPLICATE('=', 80);

PRINT REPLICATE('-', 80);
PRINT 'CLEAR CACHE SO NEITHER QUERY GETS AN ADVANTAGE';
PRINT REPLICATE('-', 80);
DBCC FREEPROCCACHE;
DBCC DROPCLEANBUFFERS;

PRINT '========== Using Mail Object Function ==========';
SET STATISTICS TIME ON;
SELECT *,
dbo.IsValidEmailSystemNet(RandomCode) AS IsValidEmail
INTO #deleteme1
FROM #MySampleEmails
OPTION (MAXDOP 0);
SET STATISTICS TIME OFF;

PRINT REPLICATE('=', 80);
PRINT REPLICATE('-', 80);
PRINT 'CLEAR CACHE SO NEITHER QUERY GETS AN ADVANTAGE';
PRINT REPLICATE('-', 80);
DBCC FREEPROCCACHE;
DBCC DROPCLEANBUFFERS;


PRINT '========== Using RegEx Function ==========';
SET STATISTICS TIME ON;
SELECT *,
dbo.IsValidEmailRegEx(RandomCode) AS IsValidEmail
INTO #deleteme2
FROM #MySampleEmails;

SET STATISTICS TIME OFF;

PRINT REPLICATE('=', 80);
PRINT REPLICATE('-', 80);
PRINT 'CLEAR CACHE SO NEITHER QUERY GETS AN ADVANTAGE';
PRINT REPLICATE('-', 80);
DBCC FREEPROCCACHE;
DBCC DROPCLEANBUFFERS;

PRINT '========== Using Table-Valued Mail Object Function ==========';
SET STATISTICS TIME ON;
SELECT mse.ID,
mse.RandomCode,
x.EmailIsValid
INTO #deleteme4
FROM #MySampleEmails mse
CROSS APPLY dbo.EmailValiditySystemNet(mse.RandomCode) x;

SET STATISTICS TIME OFF;

PRINT REPLICATE('=', 80);
PRINT REPLICATE('-', 80);
PRINT 'CLEAR CACHE SO NEITHER QUERY GETS AN ADVANTAGE';
PRINT REPLICATE('-', 80);
DBCC FREEPROCCACHE;
DBCC DROPCLEANBUFFERS;

PRINT '========== Using Table-Valued RegEx Function ==========';
SET STATISTICS TIME ON;
SELECT mse.ID,
mse.RandomCode,
x.EmailIsValid
INTO #deleteme3
FROM #MySampleEmails mse
CROSS APPLY dbo.EmailValidityRegEx(mse.RandomCode) x;

SET STATISTICS TIME OFF;

PRINT REPLICATE('=', 80);
PRINT REPLICATE('-', 80);
PRINT 'CLEAR CACHE SO NEITHER QUERY GETS AN ADVANTAGE';
PRINT REPLICATE('-', 80);
DBCC FREEPROCCACHE;
DBCC DROPCLEANBUFFERS;

-----------------------------------------------------------------------------
-----------------------------------------------------------------------------
IF OBJECT_ID(N'tempdb..#deleteme1') IS NOT NULL
DROP TABLE #deleteme1;
GO
IF OBJECT_ID(N'tempdb..#deleteme2') IS NOT NULL
DROP TABLE #deleteme2;
GO
IF OBJECT_ID(N'tempdb..#deleteme3') IS NOT NULL
DROP TABLE #deleteme3;
GO
IF OBJECT_ID(N'tempdb..#deleteme4') IS NOT NULL
DROP TABLE #deleteme4;
GO
IF OBJECT_ID(N'tempdb..#MySampleEmails') IS NOT NULL
DROP TABLE #MySampleEmails;
GO



__________________________________________________________________________________________________
There are no special teachers of virtue, because virtue is taught by the whole community. --Plato
Post #1407569
« Prev Topic | Next Topic »

Add to briefcase ««12

Permissions Expand / Collapse