Levenshtein UDF performance 'tuning' advise needed

Question

Levenshtein UDF performance 'tuning' advise needed

mike--p

Mr or Mrs. 500

Points: 550
More actions
November 6, 2014 at 1:20 am

#294986

Hello,
First off let me introduce myself. I'm Mike, 19 years old and living in the Netherlands. I'm currently working as trainee on a C#/SQL application. I'm totally new to SQL so that's why I need some help from you SQL guru's :).
So let me first explain my problem. The goal of the application is to match different files (e.g. Excel, CSV, TXT, database formats, etc.) with eachother. For example, we have a supplier list that contains product descriptions in the first file. The second file (from the administrative system) contains the same description but every description could be a little different compared to the product descriptions in the first file (e.g. extra dots, semicolons, letters, numbers, etc.). Now both of these files are imported to a SQL database table. To improve the matching the user first can remove, insert, change characters from a column (or multiple columns). After these edits the files should have a better match, however it still isn't going to be a 100% match.
This is where the levenshtein algorithm comes into play. With the levenshtein algorithm I'm able to get an approximate match between two strings, e.g. '10000' has 80% similarity to '00000'. This algorithm would be very useful to do the final match between my two imported files. So I made a levenshtein UDF from a dynamic link library. This function has two nvarchar's as parameter. The result from the function is a float that has a range from 0 - 100%. This function is then used in a query that looks like this:
SELECT T1.*, T2.* FROM myTable_1 T1 LEFT JOIN myTable_2 T2
ON (master.dbo.fn_Levenshtein(T1.[column_T1], T2.[column_T2]) >= percentage);
This query is working great! It returns the correct results I'm expecting. So what is the problem then? Performance! To match two tables with 6000 records the query takes around 1 minute and 30 seconds to finish. This is to be expected though, as the query has to iterate through 6000*6000 = 12.000.000 rows. Now I need some help to lower this number of iterations.
I was thinking about two solutions. The first solution I was thinking of is to stop iterating through the rows when two results are returned from the levenshtein comparison. So when in the first 50 rows 2 results are already found with the desired percentage, then break the iteration so we could skip the other 5950 rows.
I also was thinking about the LIKE statement. Not to use that particular function, but how the LIKE mechanism is working. As I could see in the executing plan the LIKE statement is only iterating through the row count of the table (6000 times) and not the +- 12.000.000 I get in my levenshtein function.
If anyone can help me out with 'tuning' the performance of my query I would really appreciate that!
I'll hope everything I explained is clear to understand, if not please tell me so I can try to explain myself even better.

Viewing 15 posts - 1 through 15 (of 25 total)

You must be logged in to reply to this topic. Login to reply

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 1

You may find some useful info in this thread. Post back if you have more questions.

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

mike--p Mr or Mrs. 500 Points: 550 More actions · Answer 2

I tried the function you posted in that thread, still have the same results. As you stated you have to compare all the rows from both tables with eachother. So that means you still get alot of calculations the server has to do which result in slow performance.

I thought of another way with a CASE expression. So that I only do the levenshtein calculation whenever no exact matches are found. But I also get a weird execution plan with the CASE expression.

If I use this query(both tables having 36635 rows): SELECT * FROM Vergelijkingstabel_1 AS T1 LEFT JOIN Vergelijkingstabel_2 AS T2 ON (T1.policyID = T2.policyID);

The execution plan will look like this, the query executes within a second:

If i use this query(both tables having 36635 rows): SELECT * FROM Vergelijkingstabel_1 AS T1 LEFT JOIN Vergelijkingstabel_2 AS T2 ON

CASE

WHEN T1.policyID = T2.policyID THEN 'TRUE'

ELSE 'FALSE'

END = 'TRUE';

The execution plan will look like this, I cancelled the execution after 5 minutes :-P:

To me both queries are the same (atleast doing the same thing), does someone know why the performance goes bad on the second query?

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 3

mike--p (11/6/2014)
I tried the function you posted in that thread, still have the same results. As you stated you have to compare all the rows from both tables with eachother. So that means you still get alot of calculations the server has to do which result in slow performance.

Yes you do - but only once, not twice. You don't have to match (a with b) and also (b with a).

I thought of another way with a CASE expression. So that I only do the levenshtein calculation whenever no exact matches are found. But I also get a weird execution plan with the CASE expression.
If I use this query(both tables having 36635 rows): SELECT * FROM Vergelijkingstabel_1 AS T1 LEFT JOIN Vergelijkingstabel_2 AS T2 ON (T1.policyID = T2.policyID);
The execution plan will look like this, the query executes within a second:
If i use this query(both tables having 36635 rows): SELECT * FROM Vergelijkingstabel_1 AS T1 LEFT JOIN Vergelijkingstabel_2 AS T2 ON
CASE
WHEN T1.policyID = T2.policyID THEN 'TRUE'
ELSE 'FALSE'
END = 'TRUE';
The execution plan will look like this, I cancelled the execution after 5 minutes :-P:
To me both queries are the same (atleast doing the same thing), does someone know why the performance goes bad on the second query?

In the second query, you're forcing SQL Server to evaluate an expression on which to join.

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

mike--p Mr or Mrs. 500 Points: 550 More actions · Answer 4

In the second query, you're forcing SQL Server to evaluate an expression on which to join.

Aha that explains why it's slow, thanks.

Is there any way to first check if there is an exact match between the two columns. If there is an exact match, we go and search the next record. When there isn't an exact match we start the levenshtein function. Or isn't this possible at all? Again thanks for all the help!

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 5

Sure. That's probably easiest if you post your existing query to use as a model. You may have to post some sample data too, let's see how we get on with the query first.

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

mike--p Mr or Mrs. 500 Points: 550 More actions · Answer 6

Ok I think I figured it out, at last :).

The query:

SELECT T1.*, T3.* FROM table_1 AS T1

LEFT JOIN table_2 AS T2 ON (T1.[column_T1] = T2.[column_T2)

LEFT JOIN table_2 AS T2_2 ON (Levenshtein(T1.[column_T1], T2_2.[column_T2]) > percentage) WHERE T2.[column_T2] IS NULL;

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 7

Try this:

SELECT T1.*, T2.*

FROM table_1 AS T1

INNER JOIN table_2 AS T2

ON T1.[column_T1] < T2.[column_T2]

WHERE dbo.Levenshtein(T1.[column_T1], T2_2.[column_T2]) > percentage

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

mike--p Mr or Mrs. 500 Points: 550 More actions · Answer 8

ChrisM@Work (11/6/2014)
Try this:
SELECT T1.*, T2.*
FROM table_1 AS T1
INNER JOIN table_2 AS T2
ON T1.[column_T1] < T2.[column_T2]
WHERE dbo.Levenshtein(T1.[column_T1], T2_2.[column_T2]) > percentage

I'm going to try it tommorow (well atleast in my country :)). I can see why this should be a better query though, thanks much!

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 9

SELECT T1.*, T2.*

FROM (SELECT rn = ROW_NUMBER() OVER(ORDER BY something) FROM table_1) AS T1

INNER JOIN (SELECT rn = ROW_NUMBER() OVER(ORDER BY something) FROM table_2) AS T2

ON T1.rn < T2.rn

WHERE dbo.Levenshtein(T1.[column_T1], T2_2.[column_T2]) > percentage

This might make more sense.

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

mike--p Mr or Mrs. 500 Points: 550 More actions · Answer 10

ChrisM@Work (11/6/2014)
SELECT T1.*, T2.* FROM (SELECT rn = ROW_NUMBER() OVER(ORDER BY something) FROM table_1) AS T1 INNER JOIN (SELECT rn = ROW_NUMBER() OVER(ORDER BY something) FROM table_2) AS T2 ON T1.rn < T2.rn
WHERE dbo.Levenshtein(T1.[column_T1], T2_2.[column_T2]) > percentage
This might make more sense.

Ok so I tried this one, but it didn't work as expected. With this you won't scan the whole table for a match (only until the specific row number). So when you got a string that is placed way beyond that row number, due the sorting, you won't find those particular strings as a possible match. Besides that it is still executing rather slowly. The query I posted is 50% faster. Thanks for all your help though 🙂

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 11

mike--p (11/7/2014)
ChrisM@Work (11/6/2014)
SELECT T1.*, T2.* FROM (SELECT rn = ROW_NUMBER() OVER(ORDER BY something) FROM table_1) AS T1 INNER JOIN (SELECT rn = ROW_NUMBER() OVER(ORDER BY something) FROM table_2) AS T2 ON T1.rn < T2.rn
WHERE dbo.Levenshtein(T1.[column_T1], T2_2.[column_T2]) > percentage
This might make more sense.
Ok so I tried this one, but it didn't work as expected. With this you won't scan the whole table for a match (only until the specific row number). So when you got a string that is placed way beyond that row number, due the sorting, you won't find those particular strings as a possible match. Besides that it is still executing rather slowly. The query I posted is 50% faster. Thanks for all your help though 🙂

Have a play with this simplified version to understand why it does work as intended:

SELECT T1.*, T2.*

FROM (SELECT TOP 5 rn = ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) FROM sys.columns) t1

INNER JOIN (SELECT TOP 5 rn = ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) FROM sys.columns) t2

ON T1.rn < T2.rn

ORDER BY t1.rn, t2.rn

If the row identified by rn in each table is different, then make this small change:

SELECT T1.*, T2.*

FROM (SELECT TOP 5 rn = ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) FROM sys.columns) t1

INNER JOIN (SELECT TOP 5 rn = ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) FROM sys.columns) t2

ON T1.rn <= T2.rn

ORDER BY t1.rn, t2.rn

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

mike--p Mr or Mrs. 500 Points: 550 More actions · Answer 12

Let me explain why your method isn't going to work.

The two tables look like this (the highlighted ones HAS to result in a match when the query is executed):

Then I execute this query on the tables (table 1 = vergelijkingstabel_1 and table 2 = vergelijkingstabel 2):

SELECT * FROM (SELECT *, rij = ROW_NUMBER() OVER(ORDER BY Omschrijving) FROM Vergelijkingstabel_1) AS T1

INNER JOIN (SELECT *, rij = ROW_NUMBER() OVER(ORDER BY Omschrijving) FROM Vergelijkingstabel_2) AS T2

ON T1.rij <= T2.rij

WHERE master.dbo.fn_Levenshtein(T1.Omschrijving, T2.Omschrijving) > 80;

The result set looks like this:

8

As you can see I don't get the match which are highlighted on the first picture, even though 'QQQ313' 'AQQ313' has a match higher than 80% (83.3% to be exactly). This is because the sorting is giving the 'AQQ313' a lower row number than 'QQQ313'. In this picture you can see the row numbers:

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 13

Are you sure your join is correct?

DROP TABLE #Table1

CREATE TABLE #Table1 (Bestelnummer INT, Omschrijving VARCHAR(20))

INSERT INTO #Table1 (Bestelnummer, Omschrijving)

VALUES (101,'ABN101'),(102,'ZBN'),(103,'KDN154'),(104,'ADN235'),(105,'QND999'),(110,'QQQ313'),(112,'ABL513')

-- 7 rows

DROP TABLE #Table2

CREATE TABLE #Table2 (Bestelnummer INT, Omschrijving VARCHAR(20))

INSERT INTO #Table2 (Bestelnummer, Omschrijving)

VALUES (100,'QQQ312'),(101,'ABN101'),(102,'ZBN103'),(103,'KDN154'),(104,'ADN235'),(106,'QND999'),

(365,'AAA555'),(34343,'QND999'),(4444,'QND999'),(111,'ZQQ319'),(11233,'AQQ313')

-- 11 rows

SELECT *

FROM (SELECT *, rij = ROW_NUMBER() OVER(ORDER BY Omschrijving) FROM #Table1) AS T1

inner JOIN (SELECT *, rij = ROW_NUMBER() OVER(ORDER BY Omschrijving) FROM #Table2) AS T2

ON T1.rij <= T2.rij -- Returns all 56 unique combinations because t2.rij is unrestricted

SELECT *

FROM (SELECT *, rij = ROW_NUMBER() OVER(ORDER BY Omschrijving) FROM #Table1) AS T1

INNER JOIN (SELECT *, rij = ROW_NUMBER() OVER(ORDER BY Omschrijving) FROM #Table2) AS T2

ON T2.rij <= T1.rij -- Returns only 28 unique combinations because T2.rij cannot be higher than 7

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

mike--p Mr or Mrs. 500 Points: 550 More actions · Answer 14

ChrisM@Work (11/7/2014)
Are you sure your join is correct?
DROP TABLE #Table1
CREATE TABLE #Table1 (Bestelnummer INT, Omschrijving VARCHAR(20))
INSERT INTO #Table1 (Bestelnummer, Omschrijving)
VALUES (101,'ABN101'),(102,'ZBN'),(103,'KDN154'),(104,'ADN235'),(105,'QND999'),(110,'QQQ313'),(112,'ABL513')
-- 7 rows
DROP TABLE #Table2
CREATE TABLE #Table2 (Bestelnummer INT, Omschrijving VARCHAR(20))
INSERT INTO #Table2 (Bestelnummer, Omschrijving)
VALUES (100,'QQQ312'),(101,'ABN101'),(102,'ZBN103'),(103,'KDN154'),(104,'ADN235'),(106,'QND999'),
(365,'AAA555'),(34343,'QND999'),(4444,'QND999'),(111,'ZQQ319'),(11233,'AQQ313')
-- 11 rows
SELECT *
FROM (SELECT *, rij = ROW_NUMBER() OVER(ORDER BY Omschrijving) FROM #Table1) AS T1
inner JOIN (SELECT *, rij = ROW_NUMBER() OVER(ORDER BY Omschrijving) FROM #Table2) AS T2
ON T1.rij <= T2.rij -- Returns all 56 unique combinations because t2.rij is unrestricted
SELECT *
FROM (SELECT *, rij = ROW_NUMBER() OVER(ORDER BY Omschrijving) FROM #Table1) AS T1
INNER JOIN (SELECT *, rij = ROW_NUMBER() OVER(ORDER BY Omschrijving) FROM #Table2) AS T2
ON T2.rij <= T1.rij -- Returns only 28 unique combinations because T2.rij cannot be higher than 7

Yes my join was correct, here the results from your query. As you can see it scans through all the columns (plus one too much highligthed in red). I don't see why I would use this over the query I posted? Thanks 😛