Query performance

  • I have the following SQL query which is executed in a dynamic SQL statement:

    SELECT a.id, b.id,CRS.*

    FROM QBMerge.fo_manchester_keys_ AS a

    INNER JOIN QBMerge.fo_manchester_keys_ AS b

    ON a.MatchKeyType13 = b.MatchKeyType13

    AND a.RAND_ID > b.RAND_ID

    AND NOT EXISTS (SELECT 1 FROM ##fo_manchester_LargeClusters_ as d WHERE d.MatchKey = a.MatchKeyType13)

    CROSS APPLY dbo.CompleteRecordScoring(

    a.NormalisedForename, a.NormalisedMiddleName, a.NormalisedSurname,

    a.PhoneticForename, a.PhoneticMiddleName, a.PhoneticSurname,

    a.NormalisedOrganisationName, a.NormalisedOrganisationName2, a.Premise, a.AddressKey, a.POstOut, a.PostIn,

    a.HomeTelephoneNumber, a.MobileTelephoneNumber, a.EmailAddress,

    b.NormalisedForename, b.NormalisedMiddleName, b.NormalisedSurname,

    b.PhoneticForename, b.PhoneticMiddleName, b.PhoneticSurname,

    b.NormalisedOrganisationName, b.NormalisedOrganisationName2, b.Premise, b.AddressKey, b.PostOut, b.POstIn,

    b.HomeTelephoneNumber, b.MobileTelephoneNumber, b.EmailAddress, 1) AS CRS

    If I run the query without the CROSS APPLY, it takes a few seconds, however when I include the CLR TVF it takes minutes to run.

    The weird thing is, it's not my CLR TVF that's taking long, I have used this code before and it can process hundreds of thousands of rows in a few seconds. I have also added a start end time for each row returned by the CLR TVF and the times are very low (hundredths of milliseconds if that).

    The table :

    The INDEX on MatchKeyType13:

    CREATE NONCLUSTERED INDEX [IX_MatchKeyType13] ON [QBMerge].[fo_manchester_keys_]

    (

    [MatchKeyType13] ASC,

    [Hierarchy] ASC,

    [RAND_ID] ASC

    )

    INCLUDE ( [ID],

    [NormalisedSurname],

    [NormalisedForename],

    [NormalisedMiddleName],

    [PhoneticSurname],

    [PhoneticForename],

    [PhoneticMiddleName],

    [NormalisedOrganisationName],

    [NormalisedOrganisationName2],

    [Premise],

    [AddressKey],

    [PostOut],

    [PostIn],

    [HomeTelephoneNumber],

    [MobileTelephoneNumber],

    [EmailAddress]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

    GO

    Another interesting observation when I run the same code on other MatchKeyTypes, I notice the CPU utilisation is maxed out and the query can generate more than a million rows per minute, however, on some match keys the CPU utilisation is more erratic and that's when it takes minutes to do a few hundred thousand records!

    ---------------------------------------------------------

    It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
    David Edwards - Media lens[/url]

    Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
    Howard Zinn

  • Okay I will try again with a simpler version of my problem.

    Say I have the following table records:

    In order to look for duplicates I create hash keys (converted to BIG INTS) which are made up of combination of columns (for example forename and postcode) which I then index so that I can use them in joins when looking for duplicate records.

    So my query would be:

    SELECT a.RecordID AS Master_ID,

    b.RecordID AS Duplicate_ID

    FROM MyTable AS a

    INNER JOIN MyTable AS b

    ON a.MatchKey = b.MatchKey

    However, all this tell is that these two records have a forename and postcode in common but to decide whether they are duplicates I need to look at the other elements of the records and compare them against each other. This is where I use the CROSS APPLY to pass the fields to a CLR TVF but I am having serious performance issues with this. It's not that my CLR code is badly written. I know this because I've tested it and it's able to process hundreds of thousands of records in less than a minute. Am I misusing the CROSS APPLY? Is there another way of doing this?!

    SELECT a.RecordID AS Master_ID, result.*

    b.RecordID AS Duplicate_ID

    FROM MyTable AS a

    INNER JOIN MyTable AS b

    ON a.MatchKey = b.MatchKey

    CROSS APPLY RecordMatcher(a.Forename, a.surname, a.address1, b.forename, b.surname, b.address1) result

    Please help!

    ---------------------------------------------------------

    It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
    David Edwards - Media lens[/url]

    Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
    Howard Zinn

  • I've also tried materialising the result set of the below query into a table:

    SELECT a.RecordID AS Master_ID,

    b.RecordID AS Duplicate_ID,

    a.Forename AS Master_Forename, b.Forename AS Duplicate_Forename

    FROM MyTable AS a

    INNER JOIN MyTable AS b

    ON a.MatchKey = b.MatchKey

    Then once this table is created I call the CLR TVF, this appears to solve the problem but it does mean creating an extra table which can be very big and wide. I've also tried using a TOP clause to force SQL to materialise the result in tempdb but this appears to work really well on some match keys but not others.

    ---------------------------------------------------------

    It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
    David Edwards - Media lens[/url]

    Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
    Howard Zinn

  • Without seeing the actual plan, I'd guess that the tvf is being applied to many more rows than it needs to be, i.e. before all of the available filters have been applied to the result set of the two tables. If you haven't already got an index on the temp table, then I'd put one on to support the NOT EXISTS. The result may be cheap enough to pull the filter to a position prior to the tvf.

    You could try this variant on the query too:

    SELECT a.id, b.id, CRS.*

    FROM QBMerge.fo_manchester_keys_ AS a

    INNER JOIN QBMerge.fo_manchester_keys_ AS b

    ON a.MatchKeyType13 = b.MatchKeyType13

    AND a.RAND_ID > b.RAND_ID

    CROSS APPLY (

    SELECT *

    FROM dbo.CompleteRecordScoring(

    a.NormalisedForename, a.NormalisedMiddleName, a.NormalisedSurname,

    a.PhoneticForename, a.PhoneticMiddleName, a.PhoneticSurname,

    a.NormalisedOrganisationName, a.NormalisedOrganisationName2, a.Premise, a.AddressKey, a.POstOut, a.PostIn,

    a.HomeTelephoneNumber, a.MobileTelephoneNumber, a.EmailAddress,

    b.NormalisedForename, b.NormalisedMiddleName, b.NormalisedSurname,

    b.PhoneticForename, b.PhoneticMiddleName, b.PhoneticSurname,

    b.NormalisedOrganisationName, b.NormalisedOrganisationName2, b.Premise, b.AddressKey, b.PostOut, b.POstIn,

    b.HomeTelephoneNumber, b.MobileTelephoneNumber, b.EmailAddress, 1) x

    WHERE NOT EXISTS (SELECT 1 FROM ##fo_manchester_LargeClusters_ d

    WHERE d.MatchKey = a.MatchKeyType13)

    ) AS CRS

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Thanks for this Chris.

    I've run my original code again to get the actual execution plan (attached below). Interestingly my original query ran in 15 seconds to produce 234990 records but the version you suggested above gets to 270k then hangs for some reason!

    The execution plan might look okay but when I try a different match key (one that produces more rows) the cost of the nested loop increases significantly!

    ---------------------------------------------------------

    It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
    David Edwards - Media lens[/url]

    Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
    Howard Zinn

  • Abu Dina (8/30/2013)


    Thanks for this Chris.

    I've run my original code again to get the actual execution plan (attached below). Interestingly my original query ran in 15 seconds to produce 234990 records but the version you suggested above gets to 270k then hangs for some reason!

    The execution plan might look okay but when I try a different match key (one that produces more rows) the cost of the nested loop increases significantly!

    The estimates and actuals are wildly different. Try updating stats and rerunning.

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Also, change the datatype of d.MatchKey to that of a.MatchKeyType08.

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Good point about the data types. These have been changed to match the data type of the match key.

    Have a look at this though:

    Running the below (notice the TOP operator) returns 993585 in 59 seconds! ACTUAL execution plan attached. Without the TOP it's already past the minute mark and it's only done 100k.

    I wonder if the TOP is making SQL Server materialise the query in tempdb then it applies the scores?!

    To make it more interesting, if I change the TOP 1000000 to TOP 50000000 or 100 PERCENT it slows down again WTF?!

    SELECT top 1000000 a.id, b.id, CRS.*

    FROM QBMerge.fo_manchester_keys_ AS a

    INNER JOIN QBMerge.fo_manchester_keys_ AS b

    ON a.MatchKeyType13 = b.MatchKeyType13

    AND a.RAND_ID > b.RAND_ID

    AND NOT EXISTS (SELECT 1 FROM ##fo_manchester_LargeClusters_ d WHERE d.MatchKey = a.MatchKeyType13)

    CROSS APPLY (

    SELECT *

    FROM dbo.CompleteRecordScoring(

    a.NormalisedForename, a.NormalisedMiddleName, a.NormalisedSurname,

    a.PhoneticForename, a.PhoneticMiddleName, a.PhoneticSurname,

    a.NormalisedOrganisationName, a.NormalisedOrganisationName2, a.Premise, a.AddressKey, a.POstOut, a.PostIn,

    a.HomeTelephoneNumber, a.MobileTelephoneNumber, a.EmailAddress,

    b.NormalisedForename, b.NormalisedMiddleName, b.NormalisedSurname,

    b.PhoneticForename, b.PhoneticMiddleName, b.PhoneticSurname,

    b.NormalisedOrganisationName, b.NormalisedOrganisationName2, b.Premise, b.AddressKey, b.PostOut, b.POstIn,

    b.HomeTelephoneNumber, b.MobileTelephoneNumber, b.EmailAddress, 1)) CRS

    ---------------------------------------------------------

    It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
    David Edwards - Media lens[/url]

    Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
    Howard Zinn

  • Without the TOP operator, the earlier actual plan shows 234,990 rows returned. With the TOP operator you're returning nearly a million rows ... can you confirm these figures are correct?

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • My bad, the latest actual execution plan is for match key type 13, it looks like I generated an actual execution plan for a different match key type before!

    ---------------------------------------------------------

    It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
    David Edwards - Media lens[/url]

    Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
    Howard Zinn

  • Abu Dina (8/30/2013)


    My bad, the latest actual execution plan is for match key type 13, it looks like I generated an actual execution plan for a different match key type before!

    Aarrgghh!!

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Abu Dina (8/30/2013)


    I wonder if the TOP is making SQL Server materialise the query in tempdb then it applies the scores?!

    No. Top just applies a row goal. It doesn't materialise the resultset, use extra work tables or anything like that.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Sorry Chris 😛

    Well, I've taken a slightly different approach as I can't seem to work out this puzzle. I know this is an ugly way of solving the problem but the speed of execution is more consistent.

    I'm creating a real table of my potential matches with all the columns I needs to pass to the CLR TVF like so:

    SELECT a.id AS Master_ID,

    b.id AS Duplicate_ID,

    a.NormalisedForename AS Master_NormalisedForename,

    a.NormalisedMiddleName AS Master_NormalisedMiddleName,

    a.NormalisedSurname AS Master_NormalisedSurname,

    a.PhoneticForename AS Master_PhoneticForename,

    a.PhoneticMiddleName AS Master_PhoneticMiddleName,

    a.PhoneticSurname AS Master_PhoneticSurname,

    a.NormalisedOrganisationName AS Master_NormalisedOrganisationName,

    a.NormalisedOrganisationName2 AS Master_NormalisedOrganisationName2,

    a.Premise AS Master_Premise,

    a.AddressKey AS Master_AddressKey,

    a.POstOut AS Master_POstOut,

    a.PostIn AS Master_PostIn,

    a.HomeTelephoneNumber AS Master_HomeTelephoneNumber,

    a.MobileTelephoneNumber AS Master_MobileTelephoneNumber,

    a.EmailAddress AS Master_EmailAddress,

    b.NormalisedForename AS Duplicate_NormalisedForename,

    b.NormalisedMiddleName AS Duplicate_NormalisedMiddleName,

    b.NormalisedSurname AS Duplicate_NormalisedSurname,

    b.PhoneticForename AS Duplicate_PhoneticForename,

    b.PhoneticMiddleName AS Duplicate_PhoneticMiddleName,

    b.PhoneticSurname AS Duplicate_PhoneticSurname,

    b.NormalisedOrganisationName AS Duplicate_NormalisedOrganisationName,

    b.NormalisedOrganisationName2 AS Duplicate_NormalisedOrganisationName2,

    b.Premise AS Duplicate_Premise,

    b.AddressKey AS Duplicate_AddressKey,

    b.PostOut AS Duplicate_PostOut,

    b.POstIn AS Duplicate_POstIn,

    b.HomeTelephoneNumber AS Duplicate_HomeTelephoneNumber,

    b.MobileTelephoneNumber AS Duplicate_MobileTelephoneNumber,

    b.EmailAddress AS Duplicate_EmailAddress INTO fo_OTHERCOUNTRIES_MatchKeyType1

    FROM QBMerge.fo_OTHERCOUNTRIES_keys_ AS a

    INNER JOIN QBMerge.fo_OTHERCOUNTRIES_keys_ AS b

    ON a.MatchKeyType01 = b.MatchKeyType01

    AND a.RAND_ID > b.RAND_ID

    AND NOT EXISTS (SELECT 1 FROM ##fo_OTHERCOUNTRIES_PotentialMatches_ as c WHERE c.duplicate_ID IN (a.id,b.id))

    Then I'm doing this:

    INSERT INTO ##fo_OTHERCOUNTRIES_PotentialMatches_(Master_Id, Duplicate_Id, MatchKeyType, Name_Score,

    OrgName_Score, OrgName_Score2, Address_Score, Postcode_Score, HomeTelephone_Score, MobileTelephone_Score, Email_Score,

    MasterName_In_DuplicateOrgName_Score, DuplicateName_In_MasterOrgName_Score, MasterOrgName1_In_DupeOrgName2_Score, MasterOrgName2_Vs_DupeOrgName1_Score, ExecutionTime)

    SELECT Master_ID, Duplicate_ID, 3, CRS.*

    FROM fo_OTHERCOUNTRIES_MatchKeyType3

    CROSS APPLY dbo.CompleteRecordScoring(

    Master_NormalisedForename, Master_NormalisedMiddleName, Master_NormalisedSurname,

    Master_PhoneticForename, Master_PhoneticMiddleName, Master_PhoneticSurname,

    Master_NormalisedOrganisationName, Master_NormalisedOrganisationName2, Master_Premise, Master_AddressKey, Master_POstOut, Master_PostIn,

    Master_HomeTelephoneNumber, Master_MobileTelephoneNumber, Master_EmailAddress,

    Duplicate_NormalisedForename, Duplicate_NormalisedMiddleName, Duplicate_NormalisedSurname,

    Duplicate_PhoneticForename, Duplicate_PhoneticMiddleName, Duplicate_PhoneticSurname,

    Duplicate_NormalisedOrganisationName, Duplicate_NormalisedOrganisationName2, Duplicate_Premise, Duplicate_AddressKey, Duplicate_PostOut, Duplicate_POstIn,

    Duplicate_HomeTelephoneNumber, Duplicate_MobileTelephoneNumber, Duplicate_EmailAddress, 1) AS CRS

    ---------------------------------------------------------

    It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
    David Edwards - Media lens[/url]

    Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
    Howard Zinn

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply