How to change the Scan to Seek and optimize

  • Hello,

    Attached sample query and actual plan, I want to further optimize the query and change the scan to seek?

    Do I have to create an IDX or modify the query in this case?  Please let me know thoughts

     

    Attachments:
    You must be logged in to view attached files.
  • For others who want to help without downloading, here is the query:

    USE Database1
    set statistics io on;
    set statistics time on;

    SELECT
    Object1.Column1,
    Object1.Column2,
    Object1.Column3,
    Object1.Column4,
    Object1.Column5,
    Object1.Column6,
    Object1.Column7,
    Object1.Column8,
    Object1.Column9

    FROMSchema1.Object2 Object1
    INNER JOIN
    (SELECT Column1, Function1(Column3)Column3
    FROM Schema1.Object2 GROUP BY Column1) Object3 ON Object1.Column1=Object3.Column1 AND Object1.Column3 = Object3.Column3

    GO

    My question is why do you want to change it from a SCAN to a SEEK?  If you look at the execution plan, your SEEK has a higher estimated cost than the SCAN and had 8 executions vs the 1 for the scan.  This is because your SCAN (the subquery) needs to get all of the rows from Object2 in order to compare it to the main query on Object2.

    Blindly trying to change SCANs to SEEKSs may hurt performance or offer no benefit.  If you NEED to look at all rows, a SCAN is often the best way to do it.

    Adding indexes won't change a SCAN to a SEEK.  Adding a WHERE clause on an indexed column should change the SCAN to a SEEK, but you will be filtering out your result set.

    But your table has 8 rows; chances are this is not a query that is worth tuning any further than it already is.  I expect this completes nearly instantly.

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • Thanks for your input, Yes the query is running in a second now, however the table is expected to grow over the time, from 8 hows to more rows. So, I am thinking to tune it as proactive measure..SCAN will be a problem at a later point when we load million record in this table. If not SCAN do you see any other things which can run this query more less mm secs?  by where clause you do mean add where=null ( something like this) at the end of the query? please advise. Thanks!!

  • There is always something; it is just what you are willing to do.

    Selecting fewer columns can improve performance, removing functions can improve performance, reducing the result set can improve performance.

    Fewer columns may require a different index to get the best performance, but indexes are a tricky beast when doing performance tuning.  Each index you add will slow down INSERT, UPDATE and DELETE performance.

    Removing the function may help performance too as it is presumably a calculation being done on Column3 or possibly a lookup related to Column3 or it could just be returning Column3 after doing no real work.  I cannot see that function, so it is hard to say if removing that will help or not.  Depending on what Function1 does, you might be able rewrite the query without the INNER JOIN.  You may get a performance boost by changing the INNER JOIN to a WHERE clause since you aren't actually using any columns in that subquery apart for filtering the result set.

    Since you are not filtering the result set, but looking at every row, even if you do switch the SCAN to a SEEK, you are still going to be essentially doing a SCAN since you are looking at every row.

    Reducing the result set (ie adding a WHERE clause and filtering out some of the data) will likely improve performance, but likely isn't what you are wanting to do.  Heck, adding "WHERE 1=0" will improve performance but will give you no results.

    Since your subquery has no WHERE clause, you are going to be looking at all rows in the table.  The main query is doing a SEEK, but looking at every rows, so it's going to be essentially a SCAN. Since the query will need to look at all rows with every execution, as the table grows in size, it will get slower.

    If you are wanting to remove the INNER JOIN, I THINK this query should be identical and should have similar performance (test it though):

    USE Database1
    set statistics io on;
    set statistics time on;

    SELECT
    Object1.Column1,
    Object1.Column2,
    Object1.Column3,
    Object1.Column4,
    Object1.Column5,
    Object1.Column6,
    Object1.Column7,
    Object1.Column8,
    Object1.Column9

    FROMSchema1.Object2 Object1
    WHERE Object1.Column3 = Function1(Object1.Column3)
    GO

    My understanding, the above query should  be identical to what you had but only 1 lookup on the table.  It'll still be a SCAN because we are needing to look at each and every row, but only a single table lookup SHOULD be faster than 2 table lookups.

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • Thank you!! for sharing the updated query, I will test it out for sure.

    For your question about the functions. it is a MAX function..e.g. max(column3)

    Sincerely!

  • That function is going to be a performance killer - no matter what.  As it is used - it must be a scalar function and therefore needs to execute for every row.

    If the function is required - then converting it to an inline-table valued function could possibly improve the performance.  No way to tell for sure unless you post that function.

    The subquery is performing a group by...not sure if that is necessary either, but it depends on the data and the requirement for this query.  Since the query does not return any data from that subquery, it may be possible to convert that to an EXISTS instead of a join but again it will depend on what the function does.

    Any way you can post that function and maybe some sample data?

    Jeffrey Williams
    “We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”

    ― Charles R. Swindoll

    How to post questions to get better answers faster
    Managing Transaction Logs

  • If the function is just a MAX, I would change the query to be something more like this:

    DECALRE @maxValue int;
    SELECT @maxValue = MAX(column3)
    FROM Schema1.Object2

    SELECT Object1.Column1,
    Object1.Column2,
    Object1.Column3,
    Object1.Column4,
    Object1.Column5,
    Object1.Column6,
    Object1.Column7,
    Object1.Column8,
    Object1.Column9
    FROM Schema1.Obeject2
    WHERE Column3 = @maxValue

    That should get you a bit of a performance boost as your function only needs to be run once.  If you have an index on column3, that will help this query but likely hurt the performance of other queries.  I am also assuming that Column3 is an INT... since it was a MAX, I just assumed it was an INT.

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • Yes, you are correct! it is PK, INT, not null.

  • oh nice.  Then it is likely already sorted, so the query I gave SHOULD give you pretty good performance.

    My opinion, that is likely the best you will be able to get.  Offhand, I am not thinking of anything that could optimize that query any more apart from doing something incredibly strange.  What I mean by incredibly strange is having a second table  that ONLY holds the columns from Object2 where Column3 = MAX(Column3) and it would be populated by a trigger (after INSERT, UPDATE, DELETE).  Which would make your query as fast as possible, but hurt the performance of INSERT, UPDATE and DELETE a lot and could cause deadlocks and blocking and will make end users grumpy.

    That being said, if a close to real-time result is all you require, you could set up a job to call an SSIS package that pulls out the data from Obejct2 where Column3 = MAX(Column3) on a schedule.  But, that too feels like it'll likely be overkill for what you are going for.  Using the query I mentioned previously, you should get decent performance even with large data sets.  And if you start getting slower performance as the max value for Column3 changes, you could add to the end:

    OPTION (OPTIMIZE FOR (@maxValue UNKNOWN))

    This will help with the parameter sniffing problem and (in the case of a stored procedure) does not need the recompile hint; that is presuming the above query is the slow bit.

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • @Brian - your solution is not the same as the original query.  The original query is joining to the max date for each column1 - not the max date across the table.

    We would need to see the original query - and not the anonymized version.  I suspect that the subquery is actually using the same table and is setup to get the latest row available.  If that is the case, this can be done using ROW_NUMBER() OVER().

    SELECT TOP 1 WITH TIES
    Object1.Column1,
    Object1.Column2,
    Object1.Column3,
    Object1.Column4,
    Object1.Column5,
    Object1.Column6,
    Object1.Column7,
    Object1.Column8,
    Object1.Column9
    FROM Schema1.Object1
    ORDER BY row_number() over(Partition By Object1.Column1 Order By Object1.Column3 desc)

    You could also use a CTE to get the row number...

    Jeffrey Williams
    “We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”

    ― Charles R. Swindoll

    How to post questions to get better answers faster
    Managing Transaction Logs

  • Thanks Jeffrey!  You are right.  My query is not going to work due to the comparison on Column1.  My query would have only returned rows that had Column3 = MAX(Column3) which may result in some rows that should be included being excluded.

    Jeffrey's query hits the nail on the head though (assuming I understand the anonymized query correctly).

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • The query does not have any WHERE condition, so there is nothing to SEEK there.

    Aggregation query still has to go through every row in the table to return the result set.

    Having an index on (Column1,Column3) might help a bit by reducing the amount of data to be scanned, but not much. Still will be Index Scan.

     

    _____________
    Code for TallyGenerator

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply