Select Query - need optimization

  • Table_1(custid,store,amont)

    Insert into Table_1 values (1,store1,1000)

    Insert into Table_1 values (2,store2,2000)

    Insert into Table_1 values (3,store3,3000)

    Insert into Table_1 values (4,store4,4000)

    Table_2 (store,product,col3,col4...)

    Insert into Table_2 values(store1,prod1,value)

    Insert into Table_2 values(store1,prod2,value)

    Insert into Table_2 values(store1,prod3,value)

    Insert into Table_2 values(store2,prod1,value)

    Insert into Table_2 values(store2,prod5,value)

    Insert into Table_2 values(store3,prod4,value)

    Insert into Table_2 values(store3,prod2,value)

    Insert into Table_2 values(store3,prod6,value)

    Insert into Table_2 values(store4,prod6,value)

    Insert into Table_2 values(store4,prod1,value)

    Insert into Table_2 values(store4,prod2,value)

    select custid,store,Amount

    from Table_1 where store not in (select store from table_2 where product != 'prod6')

    custid, store, amount

    ---------------------

    1,store1,1000

    2,store2,2000

    I need a query to get the same output in a more optimized way....because i have 500 millions in Table_1 and 700 millions of rows in Table_2

    Please help me out it is a production issue, need to fix this asap

  • Please post table definitions, index definitions and execution plan, as per http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • A quick educated guess, before the requested details turn up:

    CREATE VIEW dbo.T2V

    WITH SCHEMABINDING AS

    SELECT

    t.product,

    t.store,

    COUNT_BIG(*) AS cnt

    FROM dbo.Table_2 AS t

    GROUP BY

    t.store,

    t.product

    GO

    CREATE UNIQUE CLUSTERED INDEX i

    ON T2V (product, store)

    GO

    CREATE INDEX i

    ON dbo.Table_1 (store)

    INCLUDE (custid, amount)

    If this is not Enterprise Edition, you would need to replace FROM Table_2 in the query with: FROM T2V WITH (NOEXPAND). Be aware that all access to Table_2, even reading, will be blocked while the indexed view is created.

  • In the meantime, you will get some mileage from this

    SELECT custid, store, Amount

    FROM Table_1 t1

    WHERE NOT EXISTS (SELECT 1 FROM table_2 t2 WHERE t2.store = t1.store AND t2.product != 'prod6')

    It will work ok if [store] is indexed in both tables - and it's kinda standard (I reckon).

    Gail & Paul are likely to come up with something more appropriate for tables of this size.


    [font="Arial"]Low-hanging fruit picker and defender of the moggies[/font]

    For better assistance in answering your questions, please read this[/url].


    Understanding and using APPLY, (I)[/url] and (II)[/url] Paul White[/url]

    Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]

  • ChrisM@home (1/14/2012)


    In the meantime, you will get some mileage from this

    SELECT custid, store, Amount

    FROM Table_1 t1

    WHERE NOT EXISTS (SELECT 1 FROM table_2 t2 WHERE t2.store = t1.store AND t2.product != 'prod6')

    It will work ok if [store] is indexed in both tables - and it's kinda standard (I reckon).

    Gail & Paul are likely to come up with something more appropriate for tables of this size.

    That query only has the same semantics as the original if the store columns are defined as NOT NULL. In my tests, with store constrained to be NOT NULL, both forms (original and yours above) produce exactly the same plan:

    With both columns allowing NULLs (even if none actually exist) the possibility of different results necessarily results in a different plan (again, for both textual forms):

  • Thanks Paul - looks like it's time to revisit Gail's NOT EXISTS vs etc blog 😉

    Edit: you've got a 700 million row sample data set on your home lappy? Good lord - I need an upgrade.


    [font="Arial"]Low-hanging fruit picker and defender of the moggies[/font]

    For better assistance in answering your questions, please read this[/url].


    Understanding and using APPLY, (I)[/url] and (II)[/url] Paul White[/url]

    Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]

  • ChrisM@home (1/14/2012)


    Edit: you've got a 700 million row sample data set on your home lappy? Good lord - I need an upgrade.

    No, I just ran:

    UPDATE STATISTICS dbo.Table_1

    WITH ROWCOUNT = 500000000, PAGECOUNT = 5000000

    UPDATE STATISTICS dbo.Table_2

    WITH ROWCOUNT = 700000000, PAGECOUNT = 7000000

    See http://msdn.microsoft.com/en-us/library/ms187348.aspx

  • SQL Kiwi (1/14/2012)


    ChrisM@home (1/14/2012)


    Edit: you've got a 700 million row sample data set on your home lappy? Good lord - I need an upgrade.

    No, I just ran:

    UPDATE STATISTICS dbo.Table_1

    WITH ROWCOUNT = 500000000, PAGECOUNT = 5000000

    UPDATE STATISTICS dbo.Table_2

    WITH ROWCOUNT = 700000000, PAGECOUNT = 7000000

    See http://msdn.microsoft.com/en-us/library/ms187348.aspx

    Cheat. I'm so disappointed.


    [font="Arial"]Low-hanging fruit picker and defender of the moggies[/font]

    For better assistance in answering your questions, please read this[/url].


    Understanding and using APPLY, (I)[/url] and (II)[/url] Paul White[/url]

    Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]

  • How many distinct stores are there in each table?


    [font="Arial"]Low-hanging fruit picker and defender of the moggies[/font]

    For better assistance in answering your questions, please read this[/url].


    Understanding and using APPLY, (I)[/url] and (II)[/url] Paul White[/url]

    Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]

  • Sample data for testing:

    DROP TABLE Table_1

    CREATE TABLE Table_1(

    RowID INT IDENTITY (1,1) PRIMARY KEY CLUSTERED,

    custid INT,

    store VARCHAR(9) NOT NULL,

    amount INT)

    INSERT INTO Table_1 (custid, store, amount) -- 2 million rows

    SELECT [custid], [store], [Amount]

    FROM (SELECT TOP 1000 [custid] = ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM sys.columns a, sys.columns b) customers

    CROSS JOIN (SELECT [store] = 'store' + CAST(rn AS VARCHAR(4)) FROM (

    SELECT TOP 1000 rn = ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM sys.columns a, sys.columns b)

    d) stores

    CROSS JOIN (SELECT [Amount] = 1000 UNION ALL SELECT 2000) Amounts

    CREATE NONCLUSTERED INDEX [IX_Store] ON Table_1 (Store)

    DROP TABLE Table_2

    CREATE TABLE Table_2 (

    RowID INT IDENTITY (1,1),

    store VARCHAR(9) NOT NULL,

    product VARCHAR(8),

    value INT)

    INSERT INTO Table_2 (s.store, p.product, v.value) -- 7920040 rows

    SELECT store, product, value

    FROM (SELECT [store] = 'store' + CAST(rn AS VARCHAR(4)) FROM (

    SELECT TOP 1000 rn = ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM sys.columns a, sys.columns b)

    d) s

    CROSS JOIN (SELECT [product] = 'prod' + CAST(rn AS VARCHAR(4)) FROM (

    SELECT TOP 2000 rn = ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM sys.columns a, sys.columns b)

    d) p

    CROSS JOIN (SELECT [Value] = 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4) v

    WHERE NOT (product != 'prod6' AND store IN ('store126', 'store185', 'store284', 'store299', 'store359', 'store667', 'store682', 'store793', 'store947', 'store957'))

    /*

    If I create table_2 with "RowID INT IDENTITY (1,1) PRIMARY KEY CLUSTERED, ", then the following error occurs:

    The instance of the SQL Server Database Engine cannot obtain a LOCK resource at this time. Rerun your statement when there are fewer active users. Ask the database administrator to check the lock and memory configuration for this instance, or to check for long-running transactions.

    The error doesn't occur if there are only 2 values for [value] i.e. table half the size

    */

    CREATE UNIQUE CLUSTERED INDEX [CX_RowID] ON Table_2 (RowID)

    CREATE NONCLUSTERED INDEX [IX_Store_product] ON Table_2 (Store, product)

    Testing:

    DECLARE @custid INT, @store VARCHAR(9), @amount INT

    SET STATISTICS IO,TIME ON

    DBCC DROPCLEANBUFFERS

    PRINT 'Query 1 -----------------------------------------------------------------'

    --DECLARE @custid INT, @store VARCHAR(9), @amount INT

    select @custid = custid, @store = store, @amount = Amount

    from Table_1 where store not in (select store from table_2 where product != 'prod6')

    PRINT 'Query 1 ================================================================='

    DBCC DROPCLEANBUFFERS

    PRINT 'Query 2 -----------------------------------------------------------------'

    --DECLARE @custid INT, @store VARCHAR(9), @amount INT

    ;WITH Table_2_Stores AS

    (

    SELECT store = MIN(T.store)

    FROM dbo.Table_2 T

    WHERE product != 'prod6'

    UNION ALL

    SELECT R.store

    FROM (

    SELECT T.store,

    rn = ROW_NUMBER() OVER (ORDER BY T.store)

    FROM dbo.Table_2 T

    JOIN Table_2_Stores R

    ON R.store < T.store

    WHERE T.product != 'prod6'

    ) R

    WHERE R.rn = 1

    )

    SELECT @custid = custid, @store = store, @amount = Amount

    FROM Table_1 t1

    WHERE store NOT IN (SELECT store FROM Table_2_Stores)

    OPTION (MAXRECURSION 0);

    PRINT 'Query 2 ================================================================='

    Results:

    Query 1 -----------------------------------------------------------------

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 0 ms.

    Table 'Table_2'. Scan count 3, logical reads 31014, physical reads 78, read-ahead reads 7124, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'Table_1'. Scan count 3, logical reads 8245, physical reads 43, read-ahead reads 8176, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 7456 ms, elapsed time = 6335 ms.

    Query 2 -----------------------------------------------------------------

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 0 ms.

    Table 'Worktable'. Scan count 6, logical reads 17823, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'Table_2'. Scan count 2973, logical reads 8922, physical reads 227, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'Table_1'. Scan count 3, logical reads 8245, physical reads 43, read-ahead reads 8176, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 2247 ms, elapsed time = 3330 ms.

    Subjectively, it's different - Q2 runs in about a second, Q1 in about 8 seconds. <-- Edit: corrected

    Q2 is an idea nicked from here, another of Paul White's excellent queries.


    [font="Arial"]Low-hanging fruit picker and defender of the moggies[/font]

    For better assistance in answering your questions, please read this[/url].


    Understanding and using APPLY, (I)[/url] and (II)[/url] Paul White[/url]

    Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]

  • Hi Chris,

    Yes, this could well be a perfect use for the super-fast DISTINCT recursive CTE - well spotted!

    The error about the engine not being able to obtain a LOCK resource is a consequence of a server with a small amount of configured memory, and an unfixed regression bug in SQL Server 2008+ where lock escalation does not work for INSERT statements. SQL Server continues to take row or page level locks, and runs out of memory (sigh). The workaround is to use a locking hint: INSERT TableName WITH (TABLOCKX) (column list). I do this automatically these days when inserting a significant number of rows, especially if the table is empty table. From memory, I think this bug relates to changes made to allow minimally-logged INSERT in SQL Server 2008.

    I'm normally a fan of DBCC DROPCLEANBUFFERS for testing (though one should issue a CHECKPOINT immediately beforehand to ensure dirty pages are flushed too). In this case, though, all we end up testing is which query issues read-ahead most efficiently, and how fast the I/O system is. Assuming we all generally test on laptops and home PCs with a single relatively slow disk, this may not give a fair assessment of how the query might be expected to perform in reality.

    Another consideration is: do we really expect *none* of the tables' pages to be in memory when this query typically runs? Clearly for an important query, one would hope that all the non-leaf index pages would be in buffer pool, and probably a significant proportion of the leaf index and data pages too. For these reasons, I would give more weight to tests run with all pages in memory than tests run with a completely cold cache.

    Finally, what does subjectively mean here "Subjectively, it's different - Q1 runs in about a second, Q2 in about 8 seconds". I cannot get query 1 to run in a second, even with parallelism and zero disk reads...?

  • SQL Kiwi (1/15/2012)


    The error about the engine not being able to obtain a LOCK resource is a consequence of a server with a small amount of configured memory, and an unfixed regression bug in SQL Server 2008+ where lock escalation does not work for INSERT statements. SQL Server continues to take row or page level locks, and runs out of memory (sigh). The workaround is to use a locking hint: INSERT TableName WITH (TABLOCKX) (column list). I do this automatically these days when inserting a significant number of rows, especially if the table is empty table. From memory, I think this bug relates to changes made to allow minimally-logged INSERT in SQL Server 2008.

    Since it worked with around 4M rows and failed with around 8M rows, I figured it must be a bug. Thanks for the info.

    I'm normally a fan of DBCC DROPCLEANBUFFERS for testing (though one should issue a CHECKPOINT immediately beforehand to ensure dirty pages are flushed too). In this case, though, all we end up testing is which query issues read-ahead most efficiently, and how fast the I/O system is. Assuming we all generally test on laptops and home PCs with a single relatively slow disk, this may not give a fair assessment of how the query might be expected to perform in reality.

    Another consideration is: do we really expect *none* of the tables' pages to be in memory when this query typically runs? Clearly for an important query, one would hope that all the non-leaf index pages would be in buffer pool, and probably a significant proportion of the leaf index and data pages too. For these reasons, I would give more weight to tests run with all pages in memory than tests run with a completely cold cache.

    There are too many unknowns. We don't know how often the tables are accessed or the configuration of the server. I tested with and without a clean data cache and the results from the perspective of comparing the queries were consistent - Q2 is about 4x faster. This lappy has two hdd's, data on one, tempdb and log on the other - I kinda assume this helps it mimic a real situation more closely than with only one hdd.

    Finally, what does subjectively mean here "Subjectively, it's different - Q1 runs in about a second, Q2 in about 8 seconds". I cannot get query 1 to run in a second, even with parallelism and zero disk reads...?

    My mistake, corrected. Q2 runs in about a second on its own (as opposed to in a batch with Q1 and other statements), with 4M rows or 8M rows. Q1 runs in about 4s with 4M rows and about 7.5s with 8M rows.

    I concentrated effort on the subquery on the assumption that the number of stores (qualifying or not) would be relatively small resulting in significant duplication, which the "ultra-fast distinct" excels at. Improving the LHS (table_1) will almost certainly involve invoking sneaky tricks like indexed views.


    [font="Arial"]Low-hanging fruit picker and defender of the moggies[/font]

    For better assistance in answering your questions, please read this[/url].


    Understanding and using APPLY, (I)[/url] and (II)[/url] Paul White[/url]

    Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]

  • calibar2k (1/14/2012)


    Table_1(custid,store,amont)

    Insert into Table_1 values (1,store1,1000)

    Insert into Table_1 values (2,store2,2000)

    Insert into Table_1 values (3,store3,3000)

    Insert into Table_1 values (4,store4,4000)

    Table_2 (store,product,col3,col4...)

    Insert into Table_2 values(store1,prod1,value)

    Insert into Table_2 values(store1,prod2,value)

    Insert into Table_2 values(store1,prod3,value)

    Insert into Table_2 values(store2,prod1,value)

    Insert into Table_2 values(store2,prod5,value)

    Insert into Table_2 values(store3,prod4,value)

    Insert into Table_2 values(store3,prod2,value)

    Insert into Table_2 values(store3,prod6,value)

    Insert into Table_2 values(store4,prod6,value)

    Insert into Table_2 values(store4,prod1,value)

    Insert into Table_2 values(store4,prod2,value)

    select custid,store,Amount

    from Table_1 where store not in (select store from table_2 where product != 'prod6')

    custid, store, amount

    ---------------------

    1,store1,1000

    2,store2,2000

    Hello,

    reading your post I see your example results are erroneous. Your inner SELECT,

    select store from table_2 where product != 'prod6'

    will return store1, store2, store3, store4, so your whole SELECT,

    select custid,store,Amount

    from Table_1 where store not in (select store from table_2 where product != 'prod6')

    will return nothing. Where the mistake is, in the results or in the SELECT?.

    Anyway I prefer code joins rather than subselects, your query can be rewritten as

    select custid,store,Amount

    from Table_1

    left join table_2 on Table_1.store = table_2.store and product != 'prod6'

    where table_2.store is null

    Regards,

    Francesc

  • You are right about the sample data not matching the output.

    frfernan (1/16/2012)


    Anyway I prefer code joins rather than subselects...

    The LEFT JOIN...IS NULL construction does not have the quite same semantic as the original query (as mentioned to Chris earlier), and will frequently be less efficient than an anti-semi-join (however it is expressed in T-SQL). See http://sqlinthewild.co.za/index.php/2010/03/23/left-outer-join-vs-not-exists/

  • I was believing that a LEFT JOIN was better in performance that a NOT IN clause, more generically, I accepted that a JOIN is generally better in performance than a SUBSELECT. You say that not always, many thanks for your explanations and links.

    Francesc

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply