Advice on query optmization - Multiple Unions

  • Hi,

    I'm trying to understand how this might be better optimized. I currently have these three different queries to union the results of each. The first case finds all records with the where condition of email count < 2 (distinct). The other two cases are looking for records where the email is not unique, and in the first case where the LeadSource has a specific value, and the last case where it does not have this value. The whole batch runs in less than 9 seconds returning 1500 results. Is there a better way to write this for performance?

    Thanks for any advice.

    --condition 1) Users.Email = Lead.Email AND Lead.Email is distinct

    SELECT

    u1.uID AS Uid,

    sfl1.Id AS SfLeadId

    FROM (SELECT

    sfl.Email

    FROM [SalesForce].[dbo].[Lead] sfl

    INNER JOIN [dbo].[Users] u

    ON u.email = sfl.Email

    WHERE LEN(ISNULL(sfl.ConvertedContactId, '')) < 1

    AND LEN(ISNULL(u.salesforce_userid, '')) < 14

    GROUP BY sfl.Email

    HAVING COUNT(*) < 2) t1

    INNER JOIN [dbo].[users] u1

    ON u1.Email = t1.Email

    INNER JOIN [SalesForce].[dbo].[Lead] sfl1

    ON u1.email = sfl1.Email

    WHERE LEN(ISNULL(sfl1.ConvertedContactId, '')) < 1

    AND LEN(ISNULL(u1.salesforce_userid, '')) < 14

    UNION

    --condition 3) Lead.Email is not distinct & Lead.LeadSource equals 'ComplimentaryReg'. Select the Lead with the highest lead score

    SELECT

    u2.uID AS Uid,

    (SELECT TOP (1)

    sfl3.Id

    FROM [SalesForce].[dbo].[Lead] sfl3

    WHERE sfl3.Email = sfl2.email

    ORDER BY sfl3.mkto2__Lead_Score__c DESC)

    AS SfLeadId

    FROM [SalesForce].[dbo].[Lead] sfl2

    INNER JOIN [dbo].[users] u2

    ON u2.email = sfl2.Email

    WHERE LEN(ISNULL(sfl2.ConvertedContactId, '')) < 1

    AND LEN(ISNULL(u2.salesforce_userid, '')) < 14

    AND sfl2.LeadSource = 'ComplimentaryReg'

    GROUP BY sfl2.Email,

    u2.uID

    HAVING COUNT(*) > 2

    UNION

    --condition 3) Lead.Email is not distinct & Lead.LeadSource does not equal 'ComplimentaryReg'. Select the Lead with the highest lead score

    SELECT

    u2.uID AS Uid,

    (SELECT TOP (1)

    sfl3.Id

    FROM [SalesForce].[dbo].[Lead] sfl3

    WHERE sfl3.Email = sfl2.email

    ORDER BY sfl3.mkto2__Lead_Score__c DESC)

    AS SfLeadId

    FROM [SalesForce].[dbo].[Lead] sfl2

    INNER JOIN [dbo].[users] u2

    ON u2.email = sfl2.Email

    WHERE LEN(ISNULL(sfl2.ConvertedContactId, '')) < 1

    AND LEN(ISNULL(u2.salesforce_userid, '')) < 14

    AND sfl2.LeadSource <> 'ComplimentaryReg'

    GROUP BY sfl2.Email,

    u2.uID

    HAVING COUNT(*) > 2

  • Can you post the actual (not estimated) execution plans?

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Also, try to post the DDL and some sample data for the tables referenced in your query. Note the link on how to get help in my signature...

    In the meantime, I formatted your DML to make it a little more readable in case anyone wants to take a stab at this (in the future, try to use the code=sql IFCode Shortcut, it helps make your code more readable:

    --condition 1) Users.Email = Lead.Email AND Lead.Email is distinct

    SELECT

    u1.uID AS Uid,

    sfl1.Id AS SfLeadId

    FROM

    (

    SELECT

    sfl.Email

    FROM [SalesForce].[dbo].[Lead] sfl

    INNER JOIN [dbo].[Users] u ON u.email = sfl.Email

    WHERE LEN(ISNULL(sfl.ConvertedContactId, '')) < 1

    AND LEN(ISNULL(u.salesforce_userid, '')) < 14

    GROUP BY sfl.Email

    HAVING COUNT(*) < 2

    ) t1

    INNER JOIN [dbo].[users] u1 ON u1.Email = t1.Email

    INNER JOIN [SalesForce].[dbo].[Lead] sfl1 ON u1.email = sfl1.Email

    WHERE LEN(ISNULL(sfl1.ConvertedContactId, '')) < 1 AND LEN(ISNULL(u1.salesforce_userid, '')) < 14

    UNION

    --condition 3) Lead.Email is not distinct & Lead.LeadSource equals 'ComplimentaryReg'. Select the Lead with the highest lead score

    SELECT

    u2.uID AS Uid,

    (

    SELECT TOP (1) sfl3.Id

    FROM [SalesForce].[dbo].[Lead] sfl3

    WHERE sfl3.Email = sfl2.email

    ORDER BY sfl3.mkto2__Lead_Score__c DESC

    ) AS SfLeadId

    FROM [SalesForce].[dbo].[Lead] sfl2

    INNER JOIN [dbo].[users] u2 ON u2.email = sfl2.Email

    WHERE LEN(ISNULL(sfl2.ConvertedContactId, '')) < 1

    AND LEN(ISNULL(u2.salesforce_userid, '')) < 14

    AND sfl2.LeadSource = 'ComplimentaryReg'

    GROUP BY sfl2.Email, u2.uID

    HAVING COUNT(*) > 2

    UNION

    --condition 3) Lead.Email is not distinct & Lead.LeadSource does not equal 'ComplimentaryReg'. Select the Lead with the highest lead score

    SELECT

    u2.uID AS Uid,

    (

    SELECT TOP (1) sfl3.Id

    FROM [SalesForce].[dbo].[Lead] sfl3

    WHERE sfl3.Email = sfl2.email

    ORDER BY sfl3.mkto2__Lead_Score__c DESC

    ) AS SfLeadId

    FROM [SalesForce].[dbo].[Lead] sfl2

    INNER JOIN [dbo].[users] u2 ON u2.email = sfl2.Email

    WHERE LEN(ISNULL(sfl2.ConvertedContactId, '')) < 1

    AND LEN(ISNULL(u2.salesforce_userid, '')) < 14

    AND sfl2.LeadSource <> 'ComplimentaryReg'

    GROUP BY sfl2.Email, u2.uID

    HAVING COUNT(*) > 2;

    Having looked at this query there are a few things I can tell you right off the bat.

    1. If/When you post that query plan I expect to see a lot of sort operators. UNION, GROUP BY, ORDER BY all cause sorts unless your tables are indexed correctly and the indexes can be used (which is likely not the case here). I also expect to see TABLE SPOOL(LAZY SPOOLS) as it is not likely that the optimizer can get the job done without writing worktables to the temp db. Work tables are generally bad (not always) because the number of reads spikes.

    2. UNION (not UNION ALL) is the same as DISTINCT which will likely be very costly in this case. Having 2 UNION statements is going to kill you; you could get the same result by changing the first UNION to UNION ALL as the 2nd UNION statement will force a distinct result set.

    3. #2 is a moot point because I'm sure you could combine the three UNION'd queries into one.

    4. Don't do this:

    INNER JOIN [dbo].[users] u1 ON u1.Email = t1.Email

    Even if these Email Columns are indexed that JOIN is going to kill you. Consider a Email table with an EMailID that you can index and join to.

    5. Scalar functions in your WHERE clauses are going to kill you. ISNULL and LEN especially. If you give ConvertedContactId a default value (even if it's a blank default value) then make the column not NULLable you could change

    WHERE LEN(ISNULL(sfl1.ConvertedContactId, '')) < 1

    Could probably be rewritten like this:

    WHERE sfl1.ConvertedContactId <> ''

    6. You have what many refer to as a triangular join going on in each of the UNION'd queries. That will kill you! See this article for more details: http://www.sqlservercentral.com/articles/T-SQL/61539/[/url]

    Lastly when you have to go through all this work to get data from 3 tables you should probably consider redesigning the tables.

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • I've attached a screen shot of the plan. So far, I've changed in the first UNION to UNION ALL. Thanks for your advice.

  • When posting the execution plan, please post the plan, not just a picture of it. Capture an actual plan, right-click and choose save as, then attach the .sqlplan file that will be saved. A plan is more than just a picture; behind every icon is a wealth of information hidden in the properties and we cannot see them in your picture.

    As Alan already mentioned, UNION is bad for performance because it has to remove duplicates. You understand the logic of your application better then I possibly can, so you should consider if there is any chance that, when running the queries in isolation, there will be overlapping results. If they are always distinct (and the comments suggest they are, but I have learned not to put too much faith in comments), then you can change each of the UNION operations to UNION ALL and still get the same results (but faster).

    I do also see some red flags in your code. Some can easily be avoided, others will probably take more work.

    WHERE LEN(ISNULL(sfl.ConvertedContactId, '')) < 1

    Alan already commented on this. I see this type of WHERE clause more often and it always me angers me. Just think about it. What are the conditions when this can be true? Either when ConvertedContactId is an empty string (length 0), or when it is NULL (which is replaced by an empty string by the ISNULL function). Every other value for ConvertedContactId will have a non-zero length. So why not write

    WHERE sfl.ConvertedContactId <> ''

    You might be tempted to add "OR sfl.ConvertedContactId IS NULL" but don't. That makes the WHERE clause more complicated for the optimizer (the optimizer is good with AND but struggles with OR), and it adds nothing. For NULL values, the condition evaluates to Unknown, and Unknown is not included in the WHERE clause.

    On the other hand, look at your data as well. Can ConvertedContactId actually be NULL in the data model, or is there a NOT NULL constraint? Can it actually be an empty string? If if can be NULL but not an empty string, then you should take it one step further and replace the condition with WHERE sfl.ConvertedContectId IS NOT NULL.

    AND LEN(ISNULL(u.salesforce_userid, '')) < 14

    Here the ISNULL is not needed either, but you cannot simplify this beyond

    AND LEN(u.salesforce_userid) < 14

    That is a good change to make, but it doesn't make the condition any better. Apart from the performance aspect, I am also worried about what you are doing here. What is the relevance of the length of the userid? Why should people with a longer userid be omitted from this query?

    INNER JOIN [dbo].[users] u1 ON u1.Email = t1.Email

    I disagree with Alan on this. Joining on character is only slightly slower than joining on integers, and the overhead of always joining to the extra table Alan suggest will probably cost more than you save here. Also, when you do use that email table, how will you ensure that you do not accidentally enter the same email address twice (in which case Alan's version of this test would not consider it a match)? Yes, there are ways to prevent that (called "constraints"), but they require procedures to handle the violations to be set up in the client code and in the business processes. And how do you make sure that an email change is handled appropriately? For instance, my son and I might share an email address, but once he is old enough he will get his own email address and tell you to change it - but I still will use the same email address.

    I don't know why Alan refers you to an article about triangular joins. I do not see that pattern in your code. BUt perhaps I am missing something?

    What I do see is that the three queries appear to have a huge overlap. The last two queries especially are remarkably similar; I see no difference except that one filters on sfl2.LeadSource = 'ComplimentaryReg' and the other on sfl2.LeadSource <> 'ComplimentaryReg' - so unless I am overlooking a subtle difference, I think you can replace the set of two with a single query that filters on sfl2.LeadSource IS NOT NULL (because the NULL values will not be included in either of the two queries you now have).

    The first query looks a bit different, but I suspect that this, too, can be combined with the other two. Rewriting this entire thing as a single query will not only improve performance, but also increase maintainability.

    I suggest you post a repro script with CREATE TABLE statements for all tables involved (include all constraints and indexes, but please remove columns that are not used in the query) and INSERT statements with a few well-chosen rows of sample data. (I would normally also ask for expected results, but I guess we can simply use your current query to get them if you post the repro script). Also post a full explanation of what the query should be doing, and some information on rowcounts in the actual tables and number of rows returned by the query.


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

  • Hugo Kornelis (3/23/2016)


    ...

    WHERE LEN(ISNULL(sfl.ConvertedContactId, '')) < 1

    Alan already commented on this. I see this type of WHERE clause more often and it always me angers me. Just think about it. What are the conditions when this can be true? Either when ConvertedContactId is an empty string (length 0), or when it is NULL (which is replaced by an empty string by the ISNULL function). Every other value for ConvertedContactId will have a non-zero length. So why not write

    WHERE sfl.ConvertedContactId <> ''

    You might be tempted to add "OR sfl.ConvertedContactId IS NULL" but don't. That makes the WHERE clause more complicated for the optimizer (the optimizer is good with AND but struggles with OR), and it adds nothing. For NULL values, the condition evaluates to Unknown, and Unknown is not included in the WHERE clause.

    ...

    Just for clarification, I may not have had enough coffee, but it seems those two WHERE clauses do opposite things. The original returns rows where ConvertedContactId is NULL or an empty string. The second returns rows that are neither NULL nor an empty string.

    It could still be stripped down to just an ISNULL(column,'')='', or an ='' OR IS NULL, but I'm not sure there's anything more clever than that.

    Cheers!

  • Jacob Wilkins (3/23/2016)


    Hugo Kornelis (3/23/2016)


    ...

    WHERE LEN(ISNULL(sfl.ConvertedContactId, '')) < 1

    Alan already commented on this. I see this type of WHERE clause more often and it always me angers me. Just think about it. What are the conditions when this can be true? Either when ConvertedContactId is an empty string (length 0), or when it is NULL (which is replaced by an empty string by the ISNULL function). Every other value for ConvertedContactId will have a non-zero length. So why not write

    WHERE sfl.ConvertedContactId <> ''

    You might be tempted to add "OR sfl.ConvertedContactId IS NULL" but don't. That makes the WHERE clause more complicated for the optimizer (the optimizer is good with AND but struggles with OR), and it adds nothing. For NULL values, the condition evaluates to Unknown, and Unknown is not included in the WHERE clause.

    ...

    Just for clarification, I may not have had enough coffee, but it seems those two WHERE clauses do opposite things. The original returns rows where ConvertedContactId is NULL or an empty string. The second returns rows that are neither NULL nor an empty string.

    It could still be stripped down to just an ISNULL(column,'')='', or an ='' OR IS NULL, but I'm not sure there's anything more clever than that.

    Cheers!

    You are right. I somehow mistook < for >=. (And it was already afternoon when I wrote that so I cannot even blame lack of coffee).

    This makes it even more important to investigate whether both NULL and '' are values that can exist in the data. (And please do, I see a lot of developers write these ISNULL expressions "just to make sure" without checking if they are really needed). If only one of them actually exists, then it is a simple test of either Column = '', or Column IS NULL. If both can exist, they you need one of the two variations Jacob writes. Both of them will be bad for performance, but I expect the ='' OR IS NULL version to be slightly better then the version with the function.

    (But it never hurts to test both)


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

  • Is there a table on a linked server?

    If so, it may be better to pull the necessary data into a temp table and then do the unions.

    Also, the subselects with the top 1, may be better to do an outer apply.

    I would test this if I had some DDL.

    Catch-all queries done right [/url]
    Gail Shaw's Performance Blog[/url]

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply