How to use scalar function without WHILE

  • Hey,

    I have a scalar function, which calculates the similarity of two strings. I use the following query, to compare the entries of one table against the value 'Test' and return the entries, which have a value > 50:

    ;WITH cte1 AS (

    SELECT b.FirstName,

    (SELECT fn_similarity('Test', b.FirstName)) AS [Value],

    b.LastName

    FROM [AdventureWorks2012].[Person].[Person] b

    )

    SELECT *

    FROM cte1

    WHERE [Value] > 50.00

    ORDER BY [Value] DESC

    Now I want to use this query against the first 50 entries of the [Person] table, so that the resultset includes all the values of the first 50 persons and the entries, which are similar to them.

    At the moment I use a WHILE-loop and write the five single resultsets in a temporary table. Is there another way / a better way, maybe via a join?

    Best regards

  • -- you have to test each row against all the other rows in the table i.e. CROSS JOIN:

    ;WITH [50RandomRows] AS (SELECT TOP 50 * FROM [AdventureWorks2012].[Person].[Person] ORDER BY NEWID())

    SELECT

    a.BusinessEntityID,

    a.FirstName,

    b.BusinessEntityID,

    b.FirstName,

    x.[Value]

    FROM [50RandomRows] a

    CROSS JOIN [50RandomRows] b

    CROSS APPLY (

    SELECT fn_similarity('Test', b.FirstName) AS [Value]

    ) x

    WHERE a.BusinessEntityID <> b.BusinessEntityID -- don't match to the same row

    AND x.[Value] > 50.00

    An inline table-valued function is likely to perform considerably better than this UDF.


    [font="Arial"]Low-hanging fruit picker and defender of the moggies[/font]

    For better assistance in answering your questions, please read this[/url].


    Understanding and using APPLY, (I)[/url] and (II)[/url] Paul White[/url]

    Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]

  • Perfect, thank you very much, that helped alot!

    Best regards!

  • You're welcome. If you've run the code you will have noticed that you get duplicates - the same name pair appears twice in the result set, a.firstname + b.firstname and b.firstname + a.firstname.

    If you change this

    WHERE a.BusinessEntityID <> b.BusinessEntityID

    to this

    WHERE a.BusinessEntityID > b.BusinessEntityID

    you will eliminate the dupes, and make the query a little faster too. It's always going to be expensive (read "slow"), and you would be well advised, as I mentioned earlier, to change the UDF to an iTVF. Post up the function code if you're not sure how to do this.


    [font="Arial"]Low-hanging fruit picker and defender of the moggies[/font]

    For better assistance in answering your questions, please read this[/url].


    Understanding and using APPLY, (I)[/url] and (II)[/url] Paul White[/url]

    Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]

  • Thank you very much ChrisM@home, that helped me alot!

    Best regards!

  • vip.blade (1/19/2014)


    I have a scalar function, which calculates the similarity of two strings.

    Hi.

    Would you post that function, please? I could use something like that and I might be able to turn it into an iTVF for you.

    Thanks.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • There are some points that confuse me.

    You say you need the first 50 entries from your table. How do you define the first 50? What order do you have? Why don't you have a TOP 50? Would those be the first 50 with a value greater than 50? or only the rows with a value greater than 50 within the first 50 rows?

    And what confuses me more. Why do you use a subquery to call the scalar function? instead of something as simple as:

    SELECT b.FirstName,

    fn_similarity('Test', b.FirstName) AS [Value],

    b.LastName

    FROM [AdventureWorks2012].[Person].[Person] b

    It won't help the performance but it's easier to read.

    I support the idea of sharing the code of the function to help you to improve it.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Hi,

    thanks for your fast response! It's a function to calculate the Levenshtein-Distance from this article: http://www.sqlservercentral.com/articles/Fuzzy+Match/92822/

    As far as I know there is no way to transform it into an iTVF and the only way to get a faster calculation is to use a clr. ChrisM@home helped me a lot with his approach and it is far better than a while loop.

    @luis Cazares: to get the first 50 rows I used / use a cte (this was not visible in my first post, sorry)

    Best regards!

  • vip.blade (1/27/2014)


    Hi,

    thanks for your fast response! It's a function to calculate the Levenshtein-Distance from this article: http://www.sqlservercentral.com/articles/Fuzzy+Match/92822/

    As far as I know there is no way to transform it into an iTVF and the only way to get a faster calculation is to use a clr. ChrisM@home helped me a lot with his approach and it is far better than a while loop.

    @luis Cazares: to get the first 50 rows I used / use a cte (this was not visible in my first post, sorry)

    Best regards!

    That amazing, comprehensive, detailed work of art written by Thomas Keller!

    I spent a good few hours working through the article and attempting to write and iTVF equivalent, before realising that it was turning into a diminishing returns exercise. This is what I came up with, more or less. You will need to tinker with [placeholder for any other shortcuts] using LIKE to compare reference and target. In practice, it was more than good enough for matching and acceptable for performance. Have fun and - if you can - post up any improvements.

    USE [Matching]

    GO

    /****** Object: UserDefinedFunction [dbo].[IF_Levenshtein02] Script Date: 27/01/2014 20:12:53 ******/

    SET ANSI_NULLS ON

    GO

    SET QUOTED_IDENTIFIER ON

    GO

    -- this will score around 10,000 word pairs per second on 2010 laptop technology

    ALTER FUNCTION [dbo].[IF_Levenshtein02]

    (

    @Reference VARCHAR(20), @Target VARCHAR(20)

    )

    RETURNS TABLE WITH SCHEMABINDING AS

    RETURN

    ( -- output query

    SELECT [Score %] = CASE

    WHEN @Reference = @Target THEN CAST(100 AS NUMERIC(5,2))

    WHEN 0 = 1 THEN CAST(100 AS NUMERIC(5,2))-- placeholder for any other shortcuts

    ELSE

    (SELECT

    [Score %] = CAST(SUM(LetterScore)*100.0/MAX(WordLength*WordLength) AS NUMERIC(5,2))

    FROM ( -- do

    SELECT

    seq = t1.n,

    ref.Letter,

    v.WordLength,

    LetterScore = v.WordLength - ISNULL(MIN(tgt.n),v.WordLength)

    FROM ( -- v

    SELECT

    Reference = LEFT(@Reference + REPLICATE('_',WordLength),WordLength),

    Target = LEFT(@Target + REPLICATE('_',WordLength),WordLength),

    WordLength = WordLength

    FROM ( -- di

    SELECT WordLength = MAX(WordLength)

    FROM (VALUES (DATALENGTH(@Reference)),(DATALENGTH(@Target))) d (WordLength)

    ) di

    ) v

    CROSS APPLY ( -- t1

    SELECT TOP(WordLength) n

    FROM (VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12),(13),(14),(15),(16),(17),(18),(19),(20)) t2 (n)

    ) t1

    CROSS APPLY (SELECT Letter = SUBSTRING(Reference,t1.n,1)) ref

    OUTER APPLY ( -- tgt

    SELECT TOP(WordLength) n = ABS(t1.n - t2.n)

    FROM (VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12),(13),(14),(15),(16),(17),(18),(19),(20)) t2 (n)

    WHERE SUBSTRING(@Target,t2.n,1) = ref.Letter

    ) tgt

    GROUP BY t1.n, ref.Letter, v.WordLength

    ) do

    )

    END

    ) -- output query


    [font="Arial"]Low-hanging fruit picker and defender of the moggies[/font]

    For better assistance in answering your questions, please read this[/url].


    Understanding and using APPLY, (I)[/url] and (II)[/url] Paul White[/url]

    Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]

  • Hi,

    thanks for your fast response! It's a function to calculate the Levenshtein-Distance from this article: http://www.sqlservercentral.com/articles/Fuzzy+Match/92822/ [/url]

    As far as I know there is no way to transform it into an iTVF and the only way to get a faster calculation is to use a clr. ChrisM@home helped me a lot with his approach and it is far better than a while loop.

    @luis Cazares: to get the first 50 rows I used / use a cte (this was not visible in my first post, sorry)

    First and foremost, the link you posted to Thomas Keller's article is how to calculate the Damerau–Levenshtein Distance. The Damerau–Levenshtein Distance (DLD) is a more complex version of the Levenshtein Distance. I mention that so that everyone is on the same page about what algorithm we're talking about.

    The main difference between the two is that the Levenshtein Distance (LD) calculates how many inserts, deletes, or character substitutions it would take to transform one word into another. DLD measures the sames things but considers a transposition of two adjacent characters a single edit operation. For example the LD between their and thier is 2 (change e to i, and i to e [or vise-versa]; the DLD, however, between their and thier is 1 (you can swith [or transpose] i & e. DLD is a more powerful string metric but it's also much more complex and slower.

    I mention this because, depending on how accurate your metric needs to be, the Levenshtein Distance may also be an option. There are a number of options for LD. My favorite is a CLR you can use that is super-easy to setup called mdq.Similarity. Using steps 1 & 2 in this article[/url] you can create the assembly.

    CREATE FUNCTION mdq.[Similarity](@input1 [nvarchar](4000), @input2 [nvarchar](4000), @method [tinyint], @containmentBias [float], @minScoreHint [float])

    RETURNS [float] WITH EXECUTE AS CALLER, RETURNS NULL ON NULL INPUT

    AS

    EXTERNAL NAME [Microsoft.MasterDataServices.DataQuality].[Microsoft.MasterDataServices.DataQuality.SqlClr].[Similarity]

    GO

    This CLR ships with all SQL Server 2008R2 & 20012 Master Data Services/Data Quality Services.

    With mdq.similarity you get: Levenshtein, the Jaccard similarity coefficient, a form of the Jaro-Winkler distance and the longest common subsequence algorithm. Here's a good example of mdq.Similarity[/url] in use.

    Regarding a Levenshtein iTVF:

    I tried for many months to do a loopless levenshtein and failed. It was a worth-while endeavor because I learned a ton about SQL, T-SQL, adjacency matrices and math. I can't even think about a loopless Damerau-Levenshtein.

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • ChrisM@home (1/27/2014)


    vip.blade (1/27/2014)


    Hi,

    thanks for your fast response! It's a function to calculate the Levenshtein-Distance from this article: http://www.sqlservercentral.com/articles/Fuzzy+Match/92822/

    As far as I know there is no way to transform it into an iTVF and the only way to get a faster calculation is to use a clr. ChrisM@home helped me a lot with his approach and it is far better than a while loop.

    @luis Cazares: to get the first 50 rows I used / use a cte (this was not visible in my first post, sorry)

    Best regards!

    That amazing, comprehensive, detailed work of art written by Thomas Keller!

    I spent a good few hours working through the article and attempting to write and iTVF equivalent, before realising that it was turning into a diminishing returns exercise. This is what I came up with, more or less. You will need to tinker with [placeholder for any other shortcuts] using LIKE to compare reference and target. In practice, it was more than good enough for matching and acceptable for performance. Have fun and - if you can - post up any improvements.

    USE [Matching]

    GO

    /****** Object: UserDefinedFunction [dbo].[IF_Levenshtein02] Script Date: 27/01/2014 20:12:53 ******/

    SET ANSI_NULLS ON

    GO

    SET QUOTED_IDENTIFIER ON

    GO

    -- this will score around 10,000 word pairs per second on 2010 laptop technology

    ALTER FUNCTION [dbo].[IF_Levenshtein02]

    (

    @Reference VARCHAR(20), @Target VARCHAR(20)

    )

    RETURNS TABLE WITH SCHEMABINDING AS

    RETURN

    ( -- output query

    SELECT [Score %] = CASE

    WHEN @Reference = @Target THEN CAST(100 AS NUMERIC(5,2))

    WHEN 0 = 1 THEN CAST(100 AS NUMERIC(5,2))-- placeholder for any other shortcuts

    ELSE

    (SELECT

    [Score %] = CAST(SUM(LetterScore)*100.0/MAX(WordLength*WordLength) AS NUMERIC(5,2))

    FROM ( -- do

    SELECT

    seq = t1.n,

    ref.Letter,

    v.WordLength,

    LetterScore = v.WordLength - ISNULL(MIN(tgt.n),v.WordLength)

    FROM ( -- v

    SELECT

    Reference = LEFT(@Reference + REPLICATE('_',WordLength),WordLength),

    Target = LEFT(@Target + REPLICATE('_',WordLength),WordLength),

    WordLength = WordLength

    FROM ( -- di

    SELECT WordLength = MAX(WordLength)

    FROM (VALUES (DATALENGTH(@Reference)),(DATALENGTH(@Target))) d (WordLength)

    ) di

    ) v

    CROSS APPLY ( -- t1

    SELECT TOP(WordLength) n

    FROM (VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12),(13),(14),(15),(16),(17),(18),(19),(20)) t2 (n)

    ) t1

    CROSS APPLY (SELECT Letter = SUBSTRING(Reference,t1.n,1)) ref

    OUTER APPLY ( -- tgt

    SELECT TOP(WordLength) n = ABS(t1.n - t2.n)

    FROM (VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12),(13),(14),(15),(16),(17),(18),(19),(20)) t2 (n)

    WHERE SUBSTRING(@Target,t2.n,1) = ref.Letter

    ) tgt

    GROUP BY t1.n, ref.Letter, v.WordLength

    ) do

    )

    END

    ) -- output query

    I did not see this when I posted my response. This is really amazing work Chris!

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Thanks Alan. I should point out that it's quite a long way from being a full Levenshtein implementation! I don't think it would be impossible - just too costly.

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply