July 16, 2010 at 1:17 pm
Jeff Moden (7/16/2010)
Julie Breutzmann (7/16/2010)
I've modified Lowell's function and performed a time-comparison test. Ihaven't thoroughly tested the function, since if it's too slow it won't be useful anyway. I used and input file of about 2000 records and compared on first name, last name, and address line 1 against our database of about 800,000 records. It takes less than 1 second with a simple comparison and over 10 mintes using the function. However, the function DID find 69 matches that the simple comparison did not.That's what I was afraid of. There is a way to do this same thing using an iTVF (inline Table Valued Function) and a Cross Apply that should speed things up by (possibly) and order of magnitude. The problem is that I won't be at a machine with SQL Server on it until Monday and any code that I would write would be untested until then.
This is not a rush project. I'd appreciate anything you can come up with when you're able to get to it.
Julie
July 16, 2010 at 1:42 pm
I'll try to remember it on Monday, then, Julie. Thanks for your patience.
--Jeff Moden
Change is inevitable... Change for the better is not.
July 19, 2010 at 8:37 am
Jeff Moden (7/15/2010)
Eugene Elutin (7/14/2010)
Julie Breutzmann (7/14/2010)
...What would be an efficient way to code this?
...
If you need high-performance heavy string manipulation functions write them as CLR functions in c#. T-SQL UDF will always be slower...
Maybe in this case... maybe not. I've beaten certain CLR functions.
Any examples? In "some" degree it depends on the developer proficience in C#...
But again, as per "Never say never", I should keep to "Never say always", you might be right...;-)
July 19, 2010 at 4:08 pm
Eugene Elutin (7/19/2010)
Jeff Moden (7/15/2010)
Eugene Elutin (7/14/2010)
Julie Breutzmann (7/14/2010)
...What would be an efficient way to code this?
...
If you need high-performance heavy string manipulation functions write them as CLR functions in c#. T-SQL UDF will always be slower...
Maybe in this case... maybe not. I've beaten certain CLR functions.
Any examples? In "some" degree it depends on the developer proficience in C#...
But again, as per "Never say never", I should keep to "Never say always", you might be right...;-)
Unfortunately, no. Matt Miller and I went round and round a long time ago and I actually beat some "Regex" CLR's but I can't put my hands on those particular threads anymore. I'm not sure Matt can either but I believe he'll vouch for my claims.
--Jeff Moden
Change is inevitable... Change for the better is not.
July 19, 2010 at 5:35 pm
Jeff Moden (7/19/2010)
Eugene Elutin (7/19/2010)
Jeff Moden (7/15/2010)
Eugene Elutin (7/14/2010)
Julie Breutzmann (7/14/2010)
...What would be an efficient way to code this?
...
If you need high-performance heavy string manipulation functions write them as CLR functions in c#. T-SQL UDF will always be slower...
Maybe in this case... maybe not. I've beaten certain CLR functions.
Any examples? In "some" degree it depends on the developer proficience in C#...
But again, as per "Never say never", I should keep to "Never say always", you might be right...;-)
Unfortunately, no. Matt Miller and I went round and round a long time ago and I actually beat some "Regex" CLR's but I can't put my hands on those particular threads anymore. I'm not sure Matt can either but I believe he'll vouch for my claims.
I believe you. However, Regex is not the fasterst way to do thing in C#, short and elegant - yes, fast - not really - too havy library. The fasterst string manipulations would be byte by byte. Not so elegant, a lot of manual coding, but -very fast. Lets say this: it's not easy to beat proper implementation in C#. It may be possible and I have heard such claims, but have not seen good example yet...
One more thing, just as example. In C#, Replace function is case sensitive. To do case-insensitive replace, Regex is often used. However the faster option is to do such replacement char-by-char (byte-by-byte). You will be surprised: this will beat in-build case-sensitive Replace function as well in all cases except when nothing to replace:
http://www.codeproject.com/KB/string/fastestcscaseinsstringrep.aspx
Would be intresting to compare it with SQL Replace...
July 20, 2010 at 9:20 pm
Eugene Elutin (7/19/2010)
The fasterst string manipulations would be byte by byte. Not so elegant, a lot of manual coding, but -very fast.
Now, I'll definitely agree with THAT! Done correctly, it will be very fast, indeed! I also agree that it would be interesting to have a simple race between the built in REPLACE (T-SQL) and a CLR.
--Jeff Moden
Change is inevitable... Change for the better is not.
July 21, 2010 at 1:40 pm
I agree that this "may" perform better as CLR; however, if I were stuck doing this in TSQL. I would create a blacklist table, that housed the characters I wanted to scrub and its replacement value. This gives you a lot more flexibility and makes the process easier to maintain. You could potentially do something like this. I have not tested large scale performance but this should be better than a scalar function.
*I am assuming that you have a numbers table*
USE [tempdb]
GO
SET NOCOUNT ON;
GO
CREATE TABLE dbo.Blacklist(ExlChar CHAR(1),RepChar CHAR(1));
INSERT INTO Blacklist VALUES ('.','')
INSERT INTO Blacklist VALUES (' ','')
INSERT INTO Blacklist VALUES ('-','')
INSERT INTO Blacklist VALUES ('_','')
INSERT INTO Blacklist VALUES ('~','')
INSERT INTO Blacklist VALUES ('ž','z')
INSERT INTO Blacklist VALUES ('ó','o')
INSERT INTO Blacklist VALUES ('š','s')
INSERT INTO Blacklist VALUES ('í','i')
INSERT INTO Blacklist VALUES ('é','e')
--Create sample table
CREATE TABLE dbo.T(Id INT IDENTITY,col VARCHAR(50));
INSERT INTO t VALUES ('Haínes')
INSERT INTO t VALUES ('Haines')
INSERT INTO t VALUES (' -Haines')
INSERT INTO t VALUES ('-Haines ')
GO
--Create an Inline TVF function
CREATE FUNCTION dbo.fn_CleanStr()
RETURNS TABLE
RETURN(
SELECT
t1.Id,
t1.Col,
(
SELECT LTRIM(RTRIM(COALESCE([RepChar],SUBSTRING(LTRIM(RTRIM(col)),n,1))))
FROM t t2
INNER JOIN dbo.Numbers n ON n <= LEN(LTRIM(col))
LEFT JOIN dbo.Blacklist b ON [ExlChar] = SUBSTRING(LTRIM(RTRIM(col)),n,1)
WHERE t2.Id = t1.Id
ORDER BY t2.Id, t2.[col], n.n
FOR XML PATH(''), TYPE
).value('.','varchar(max)') AS CleanStr
FROM t t1
)
GO
--Begin solution
SELECT fn.*
FROM dbo.fn_CleanStr() AS fn
WHERE fn.CleanStr = 'Haines'
July 22, 2010 at 5:51 am
Jeff Moden (7/20/2010)
...it would be interesting to have a simple race between the built in REPLACE (T-SQL) and a CLR.
In a simple race single T-SQL REPLACE outperforms CLR simple byte-by-byte version : it is 3 times faster. Which is not suprising as CLR call not coming cheap enough...
However, for multiple replacements on the same string dedicated CLR replace function which handles multiple replacements in a single call catching up with T-SQL one. After number of replacements reached 15 - CLR starts constantly outperform REPLACE.
Also, I beleive that CLR function (based on the code from the link supplied in my previos post) can be tuned further...
using System;
using System.Collections.Generic;
using System.Text;
using System.Data.SqlClient;
using System.Data.SqlTypes;
namespace QuickReplaceX
{
public class ReplaceX
{
[Microsoft.SqlServer.Server.SqlFunction]
public static SqlString Replace(string original, string pattern, string replacement)
{
return (SqlString)doReplace(original, pattern, replacement);
}
[Microsoft.SqlServer.Server.SqlFunction]
public static SqlString ReplaceMltPattern(string original, string patterns, string replacement, char separator)
{
string res = original;
string[] aPatterns = patterns.Split(separator);
for (int i = 0; i < aPatterns.Length; i++)
{
res = doReplace(res, aPatterns, replacement);
}
return (SqlString) res;
}
[Microsoft.SqlServer.Server.SqlFunction]
public static SqlString ReplaceMltAll(string original, string patterns, string replacements, char separator)
{
string res = original;
string[] aPatterns = patterns.Split(separator);
string[] aReplacements = replacements.Split(separator);
if (aPatterns.Length != aReplacements.Length)
{
throw new ApplicationException("ReplaceMltAll: number of patterns must match number of replacements");
}
for (int i = 0; i < aPatterns.Length; i++)
{
res = doReplace(res, aPatterns, aReplacements);
}
return (SqlString)res;
}
private static string doReplace(string original, string pattern, string replacement)
{
int count, position0, position1;
count = position0 = position1 = 0;
string upperString = original.ToUpper();
string upperPattern = pattern.ToUpper();
int inc = (original.Length / pattern.Length) *
(replacement.Length - pattern.Length);
char[] chars = new char[original.Length + Math.Max(0, inc)];
while ((position1 = upperString.IndexOf(upperPattern,
position0)) != -1)
{
for (int i = position0; i < position1; ++i)
chars[count++] = original;
for (int i = 0; i < replacement.Length; ++i)
chars[count++] = replacement;
position0 = position1 + pattern.Length;
}
if (position0 == 0) return original;
for (int i = position0; i < original.Length; ++i)
chars[count++] = original;
return new string(chars, 0, count);
}
}
}
and to register:
CREATE ASSEMBLY [QuickReplaceX]
AUTHORIZATION [dbo]
FROM '[full path]\QuickReplaceX.dll'
WITH PERMISSION_SET = SAFE
GO
CREATE FUNCTION QReplace(@original NVARCHAR(MAX), @pattern NVARCHAR(MAX), @replacement NVARCHAR(MAX))
RETURNS NVARCHAR(MAX) EXTERNAL NAME [QuickReplaceX].[QuickReplaceX.ReplaceX].[Replace]
GO
CREATE FUNCTION QReplaceMltPattern(@original NVARCHAR(MAX), @patterns NVARCHAR(MAX), @replacement NVARCHAR(MAX), @separator nchar(1) )
RETURNS NVARCHAR(MAX) EXTERNAL NAME [QuickReplaceX].[QuickReplaceX.ReplaceX].[ReplaceMltPattern]
GO
CREATE FUNCTION QReplaceMltAll(@original NVARCHAR(MAX), @patterns NVARCHAR(MAX), @replacements NVARCHAR(MAX), @separator nchar(1))
RETURNS NVARCHAR(MAX) EXTERNAL NAME [QuickReplaceX].[QuickReplaceX.ReplaceX].[ReplaceMltAll]
GO
Dont forget to enable CLR
July 22, 2010 at 1:31 pm
Thank you Adam. The function that I posted is actually using a partial blacklist table, so I'll look at merging our thoughts.
Thanks Eugene for your assistance. I'll try to understand and implement your solution, also.
It may be a while before I have a chance to explore these options. I am suddenly swamped with other things of higher priority. I will post my results back to this forum when I am able to do so.
Julie
July 22, 2010 at 4:23 pm
Eugene Elutin (7/22/2010)
Jeff Moden (7/20/2010)
...it would be interesting to have a simple race between the built in REPLACE (T-SQL) and a CLR.
In a simple race single T-SQL REPLACE outperforms CLR simple byte-by-byte version : it is 3 times faster. Which is not suprising as CLR call not coming cheap enough...
However, for multiple replacements on the same string dedicated CLR replace function which handles multiple replacements in a single call catching up with T-SQL one. After number of replacements reached 15 - CLR starts constantly outperform REPLACE.
Also, I beleive that CLR function (based on the code from the link supplied in my previos post) can be tuned further...
Thanks for the testing Eugene.
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 10 posts - 16 through 25 (of 25 total)
You must be logged in to reply to this topic. Login to reply
This website stores cookies on your computer.
These cookies are used to improve your website experience and provide more personalized services to you, both on this website and through other media.
To find out more about the cookies we use, see our Privacy Policy