A Plunge Into TOP

,

Introduction

TOP is one of the many syntactical operators available in T-SQL and at a first view, it could seem very simple and not particularly interesting. According to the official documentation, it “limits the rows returned in a query result set to a specified number of rows or percentage of rows”.

The following is the syntax for SQL Server and Azure SQL Database:

[
TOP (expression) [PERCENT]
[ WITH TIES ]
]

TOP has several characteristics that are worth mentioning:

  • It is not an ANSI standard operator
  • If you specify PERCENT, the expression argument is converted to the FLOAT data type. Otherwise the argument is converted to a BIGINT; in any case the argument must be greater or equal to zero
  • It can be used without the ORDER BY clause. In this case you, do not know to which order the results of the TOP are related
  • It always runs in a serial zone of the execution plan

In this article, I am going to dig into two interesting scenarios. The first one is when the TOP argument is small enough to leverage the Row Goal. The second one is when it is big enough to ensure no rows are cut from the results, where we will see this can be used as a powerful tuning tool.

Notes

  1. I have carried out all the scripts on SQL Server 2019 CTP 2.5
  2. All the examples, with the exception of the last one, are based on AdventureWorks database with the addition of two tables (Sales.SalesOrderHeaderEnlarged and Sales.SalesOrderDetailEnlarged) created by the Jonathan Kehayas’ script that you can find here. The last example is self-consistent and database independent.
  3. I am often going to use a particular number: 2147483647 (231-1). This is the maximum for the INT data type, and it is supposed to be higher than the cardinality of each table I will refer to. If you think that it is not big enough for your environment, you could use the maximum for the BIGINT data type and for sure, you will not have a table with such number of rows: 9223372036854775807 (263-1)

The Row Goal

Referring to the official documentation, we know that “when the Query Optimizer estimates the cost of a query execution plan, it usually assumes that all qualifying rows from all sources have to be processed. However, some queries cause the Query Optimizer to search for a plan that will return a smaller number of rows faster”.

If you want to limit the number of the returned rows specifying the target number, you can use one of the specific keywords listed below.

  • TOP, IN or EXISTS clauses
  • FAST number_of_rows query hint
  • SET ROWCOUNT number_of_rows statement

In this case, and if other conditions that we will see later are observed, the Query Optimizer (QO) can leverage the “Row Goal”, that basically is an optimization strategy.

As an example, take a look at the following query and its plan:

select top 10 * from Production.Product;

Fig. 1

At a first view, one could think that the Clustered Index Scan operator performs a scan of the whole table and then the TOP operator cuts all the rows except the first ten. Actually, as it is clear from the tooltip, the Clustered Index Scan is aware that only ten records are requested. Therefore, the number of rows that are read and the cost of the whole query are reduced.

When the Row Goal is applied, the QO usually chooses non-blocking operators (nested loop joins, index seeks and lookups) instead of blocking operators (sort and hash).

Is there the Row Goal in the plan?

Starting from SQL Server 2014 SP3, SQL Server 2016 SP2, and SQL Server 2017 CU3, we have the new EstimateRowsWithoutRowGoal attribute for each operator of the plan affected by the Row Goal. It returns the number of rows that would have been estimated if the Row Goal were not applied.

For the versions of SQL Server prior to the ones mentioned above, it is not evident if the Row Goal is acting, but in this case we have at least two clues. The first one is about the Estimated Number of Rows for a scan of a table; if we see that there is not any residual predicate and that the Estimated Number of Rows is less than the cardinality of the table, then the Row Goal could be applied.

The second clue is when the following condition is true:

Estimated Operator Cost < (Estimated I/O Cost + Estimated CPU Cost) * Estimated Number of Executions

Although basically the Row Goal is an optimization strategy, in several circumstances it leads to an inefficient query plan. Very often, this is due to a higher number of rows to manage at runtime than expected or to an adverse data distribution.

In these situations, one possible workaround is to disable the Row Goal with the query hint, DISABLE_OPTIMIZER_ROWGOAL, available from SQL Server 2016 SP1, or the trace flag, 4138.

Final considerations

I would like to underline a couple of important aspects related to the Row Goal. The first one is that different Row Goals can be applied in different sections of the plan. Think for example about a query with several CTEs, each one with its own TOP clause.

The second one is that the Row Goal is applied when the ‘goal’ is less than the regular estimate; moreover, the Row Goal affects the optimizer choices, if there are any! If there are not more than one choice, the Row Goal is not applied.

From here, it should be clear that the only presence of TOP, FAST, IN, EXISTS in a query or ROWCOUNT in a script does not ensure the Row Goal.

A Tuning Tool

T-SQL is a declarative language. This means that the user enters code and lets the engine take care of choosing the best control flow to get the result. In short, writing the code, we express WHAT we want to do but not HOW to do; this is the QO’s job.

Using the TOP operator, there is the possibility to keep control of several choices of the QO and to address them towards the desired direction. In this sense, we can consider TOP as a tuning tool.

A First Example

As a first demonstrative example, look at the following queries. They logically do the same thing, but actually, they are very different although they give the same result.

SELECT
      BusinessEntityID
    , PersonType
    , FirstName
    , LastName
    , ModifiedDate
FROM  Person.Person
WHERE BusinessEntityID = 10000;
SELECT *
FROM
       (
           SELECT TOP 2147483647
                  BusinessEntityID
                , PersonType
                , FirstName
                , LastName
                , ModifiedDate
           FROM   Person.Person
       ) AS t
WHERE  BusinessEntityID = 10000;

These are the two related plans:

Fig. 2

The first one performs a Clustered Index Seek, 3 logical reads and ends in 2 ms; the second one performs a Clustered Index Scan, 3821 logical reads and ends in 184 ms.

What produces such a different plan in the second query is the presence of TOP in the derived table. The QO can foresee, using the statistics, the number of rows that will be extracted and that this number will probably be less than 2147483647, but it must guarantee not to go beyond that number. Therefore, at the beginning it performs a full scan and then, if it has more rows than requested, it applies the TOP criteria and only afterwards activates the filter of the WHERE condition.

The change done in this example has worsened the performance, but this is not the point. The important thing to understand is that the usage of the TOP operator inside a derived table forces the QO to change the shape of the original plan.

In the next sections, we will see several examples of how TOP can be helpful during the query tuning.

Join order

When we write a query with several joined tables, probably we write the join chain in a way that seems logical to us, but when we look at the query plan, very often we are surprised by the QO choices. Every T-SQL developer has asked himself if changes in the join order of the query can produce some modifications in the plan. Well, they do not.

All the following queries have the same plan, nevertheless the different join order, the usage of derived tables or CTEs:

--Original query
SELECT
               o.OrderDate
             , o.PurchaseOrderNumber
             , od.OrderQty
             , c.TerritoryID
FROM
               Sales.SalesOrderHeader AS o
    INNER JOIN Sales.SalesOrderDetail AS od
        ON o.SalesOrderID = od.SalesOrderID
    INNER JOIN Sales.Customer AS C
        ON o.CustomerID   = c.CustomerID
WHERE
               o.OrderDate       = '2011-05-31'
               AND c.TerritoryID = 6;
--Join order changed
SELECT
               o.OrderDate
             , o.PurchaseOrderNumber
             , od.OrderQty
             , c.TerritoryID
FROM
               Sales.Customer AS C
    INNER JOIN Sales.SalesOrderHeader AS o
        ON o.CustomerID   = c.CustomerID
    INNER JOIN Sales.SalesOrderDetail AS od
        ON o.SalesOrderID = od.SalesOrderID
WHERE
               o.OrderDate       = '2011-05-31'
               AND c.TerritoryID = 6;
--With a derived table
SELECT
               x.OrderDate
             , x.PurchaseOrderNumber
             , x.OrderQty
             , c.TerritoryID
FROM
               (
                   SELECT
                                  o.OrderDate
                                , o.PurchaseOrderNumber
                                , od.OrderQty
                                , o.CustomerID
                   FROM
                                  Sales.SalesOrderHeader AS o
                       INNER JOIN Sales.SalesOrderDetail AS od
                           ON o.SalesOrderID = od.SalesOrderID
               ) AS x
    INNER JOIN Sales.Customer AS C
        ON x.CustomerID = c.CustomerID
WHERE
               x.OrderDate       = '2011-05-31'
               AND c.TerritoryID = 6;
--With a Cte
;
WITH x
AS (SELECT
                   o.OrderDate
                 , o.PurchaseOrderNumber
                 , od.OrderQty
                 , o.CustomerID
    FROM
                   Sales.SalesOrderHeader AS o
        INNER JOIN Sales.SalesOrderDetail AS od
            ON o.SalesOrderID = od.SalesOrderID)
SELECT
               x.OrderDate
             , x.PurchaseOrderNumber
             , x.OrderQty
             , c.TerritoryID
FROM
               x
    INNER JOIN Sales.Customer AS C
        ON x.CustomerID = c.CustomerID
WHERE
               x.OrderDate       = '2011-05-31'
               AND c.TerritoryID = 6;

The following is the plan related to all of them:

Fig. 3

As we can see, the QO joins SalesOrderHeader with Customer first and then it joins the result with SalesOrderDetails.

Now, imagine for a moment that for tuning reasons we wanted to see if joining SalesOrderHeader with SalesOrderDetails first could produce a better plan. It is possible to do it by using a join hint, maintaining the “Nested Loop” join type.

SELECT
                    o.OrderDate
                  , o.PurchaseOrderNumber
                  , od.OrderQty
                  , c.TerritoryID
FROM
                    Sales.SalesOrderHeader AS o
    INNER LOOP JOIN Sales.SalesOrderDetail AS od
        ON o.SalesOrderID = od.SalesOrderID
    INNER JOIN      Sales.Customer AS C
        ON o.CustomerID   = c.CustomerID
WHERE
                    o.OrderDate       = '2011-05-31'
                    AND c.TerritoryID = 6;

Fig. 4

In this way, we have reached the goal to change the join order, but this solution has two drawbacks. The first one is that we have forced the QO to use a loop join and it will never be free to change this type of join even if the statistics and the data distribution should change.

The second drawback is that a single join hint enforces the join order of the entire query. We do not want this. We want to join two tables first considering them as a single block leaving the QO free to join this block and the other tables as it wants.

The TOP operator can help to reach our goal.

;WITH x
AS (SELECT         TOP 2147483647
                   o.OrderDate
                 , o.PurchaseOrderNumber
                 , od.OrderQty
                 , o.CustomerID
    FROM
                   Sales.SalesOrderHeader AS o
        INNER JOIN Sales.SalesOrderDetail AS od
            ON o.SalesOrderID = od.SalesOrderID)
SELECT
               x.OrderDate
             , x.PurchaseOrderNumber
             , x.OrderQty
             , c.TerritoryID
FROM
               x
    INNER JOIN Sales.Customer AS C
        ON x.CustomerID = c.CustomerID
WHERE
               x.OrderDate       = '2011-05-31'
               AND c.TerritoryID = 6;

Fig. 5

The TOP operator in the CTE forces the QO to join the SalesOrderHeader and SalesOrderDetails first. This solution is better than the previous one because we do not have the two drawbacks anymore. Obviously, we can apply the same technique also in the derived table version of the query.

Join Type

With the same technique, it is possible to try changing the type of join chosen by the QO in order to see if a different one can help to improve the performance. As an example, carry out the following query:

SELECT
               p.ProductSubcategoryID
             , d.SalesOrderID
             , d.SalesOrderDetailID
             , d.CarrierTrackingNumber
FROM
               Production.Product AS p
    INNER JOIN Sales.SalesOrderDetail AS d
        ON d.ProductID = p.ProductID
           AND d.ModifiedDate
           BETWEEN '20111201' AND '20111228';

It ends in 128 ms using the Hash Match operator.

Fig. 6

Now, suppose that we wanted to see if the performance would benefit from a Merge Join instead of the Hash Match. One essential condition for the usage of the Merge Join is to have the two sets of data ordered in the same way. Since the Product table is read with a scan of the clustered index that is ordered by ProductId, the idea is to substitute the table SalesOrderDetail with a derived table where the data are ordered in the same way.

The TOP operator is essential for the usage of the ORDER BY clause inside the derived table. The following is the new query version and its plan:

SELECT
        p.ProductSubcategoryID
      , d.SalesOrderID
      , d.SalesOrderDetailID
      , d.CarrierTrackingNumber
FROM
        Production.Product AS p
    INNER JOIN
        (
            SELECT   TOP 2147483647
                     SalesOrderID
                   , SalesOrderDetailID
                   , CarrierTrackingNumber
                   , ProductID
            FROM     Sales.SalesOrderDetail
            WHERE
                     ModifiedDate
            BETWEEN  '20111201' AND '20111228'
            ORDER BY ProductID
        ) AS d
        ON d.ProductID = p.ProductID;

Fig. 7

In this way, the QO has chosen the Merge Join instead of the Hash Join and the query ends in 120 ms.

What if we wanted to test the Nested Loop join? We only have to cheat the QO making it believe that the derived table has very few rows. Once again, we can do it using the TOP operator with a variable in order to decouple the compilation from the execution.

DECLARE @i INT = 2147483647;
SELECT
        p.ProductSubcategoryID
      , d.SalesOrderID
      , d.SalesOrderDetailID
      , d.CarrierTrackingNumber
FROM
        Production.Product AS p
    INNER JOIN
        (
            SELECT   TOP (@i)
                     SalesOrderID
                   , SalesOrderDetailID
                   , CarrierTrackingNumber
                   , ProductID
            FROM     Sales.SalesOrderDetail
            WHERE
                     ModifiedDate
            BETWEEN  '20111201' AND '20111228'
            ORDER BY ProductID
        ) AS d
        ON d.ProductID = p.ProductID
OPTION (OPTIMIZE FOR (@i = 1));

Fig. 8

As it was foreseeable, the performance has worsened substantially (309 ms) but it is without any doubt an interesting example of the usage of TOP operator with a variable.

Breakdown

When dealing with a complex query that is not particularly fast with several joins that involve tables, CTEs, or derived tables, one possible useful technique to speed up the execution is to break the query into multiple queries and materialize the intermediate result sets into temporary tables. In this way, SQL Server has the possibility to manage smaller plans with more affordable statistics related to the intermediate sets of data.

As we saw above, the TOP operator is a way to break down a query in memory, even if in general the materialization is more efficient and not always this technique gets the desired result in terms of performance. The following is a demonstrative example.

SELECT
               h.OrderDate
             , h.Status
             , d.OrderQty
             , d.UnitPrice
             , c.rowguid
             , c.ModifiedDate
             , t.TerritoryID
             , t.Name
             , crc.CardNumber
             , PrName = p.Name
             , p.ListPrice
             , sc.Name
             , cat.Name
             , m.Name
             , m.Instructions
             , st.Name
             , cat.Name
             , d.ProductID
FROM
               Sales.SalesOrderHeaderEnlarged AS h
    INNER JOIN Sales.SalesOrderDetailEnlarged AS d
        ON h.SalesOrderID         = d.SalesOrderID
    INNER JOIN Sales.Customer AS C
        ON c.CustomerID           = h.CustomerID
    INNER JOIN Sales.Store AS st
        ON st.BusinessEntityID    = c.StoreID
    INNER JOIN Sales.SalesTerritory AS t
        ON t.TerritoryID          = CASE h.SalesOrderID % 3
                                        WHEN 0 THEN
                                            1
                                        WHEN 1 THEN
                                            2
                                        WHEN 2 THEN
                                            5
                                        ELSE
                                            10
                                    END
    INNER JOIN Sales.CreditCard AS crc
        ON crc.CreditCardID       = h.CreditCardID
    INNER JOIN Production.Product AS p
        ON p.ProductID            = d.ProductID
    INNER JOIN Production.ProductSubcategory AS sc
        ON p.ProductSubcategoryID = sc.ProductSubcategoryID
    INNER JOIN Production.ProductCategory AS cat
        ON cat.ProductCategoryID  = sc.ProductCategoryID
    INNER JOIN Production.ProductModel AS m
        ON m.ProductModelID       = p.ProductModelID
WHERE
               d.OrderQty                                > 12
               AND DATEDIFF(DAY, h.OrderDate, h.DueDate) < 100
               AND d.ProductID                           = 712;

The above query takes an average of 1230 ms to complete. In order to break down this query, I am going to use a derived table; it gathers the tables Sales.SalesOrderHeaderEnlarged, Sales.SalesOrderDetailEnlarged and their related WHERE conditions. The following query does logically the same thing and extracts the same rows. It takes an average of 630 ms that means it is more or less 50% faster than the previous one.

SELECT
               h.OrderDate
             , h.Status
             , h.OrderQty
             , h.UnitPrice
             , c.rowguid
             , c.ModifiedDate
             , t.TerritoryID
             , t.Name
             , crc.CardNumber
             , PrName = p.Name
             , p.ListPrice
             , sc.Name
             , cat.Name
             , m.Name
             , m.Instructions
             , st.Name
             , cat.Name
             , h.ProductID
FROM
               (
                   SELECT         TOP 2147483647
                                  h.OrderDate
                                , h.Status
                                , h.CustomerID
                                , h.SalesOrderID
                                , h.CreditCardID
                                , d.OrderQty
                                , d.UnitPrice
                                , d.ProductID
                   FROM
                                  Sales.SalesOrderHeaderEnlarged AS h
                       INNER JOIN Sales.SalesOrderDetailEnlarged AS d
                           ON h.SalesOrderID = d.SalesOrderID
                   WHERE
                                  DATEDIFF(DAY, h.OrderDate, h.DueDate) < 100
                                  AND d.OrderQty                        > 12
                                  AND d.ProductID                       = 712
               ) AS h
    INNER JOIN Sales.Customer AS C
        ON c.CustomerID           = h.CustomerID
    INNER JOIN Sales.Store AS st
        ON st.BusinessEntityID    = c.StoreID
    INNER JOIN Sales.SalesTerritory AS t
        ON t.TerritoryID          = CASE h.SalesOrderID % 3
                                        WHEN 0 THEN
                                            1
                                        WHEN 1 THEN
                                            2
                                        WHEN 2 THEN
                                            5
                                        ELSE
                                            10
                                    END
    INNER JOIN Sales.CreditCard AS crc
        ON crc.CreditCardID       = h.CreditCardID
    INNER JOIN Production.Product AS p
        ON p.ProductID            = h.ProductID
    INNER JOIN Production.ProductSubcategory AS sc
        ON p.ProductSubcategoryID = sc.ProductSubcategoryID
    INNER JOIN Production.ProductCategory AS cat
        ON cat.ProductCategoryID  = sc.ProductCategoryID
    INNER JOIN Production.ProductModel AS m
        ON m.ProductModelID       = p.ProductModelID;

From the Field

Some months ago, I had to face a problem with a query that broke. The following code reconstructs the situation that is not particularly brilliant from the point of view of the table design.

We have the table T1, with the column RecValue (VARCHAR type) whose content can be a simple alphanumeric string or a string representation of a “date” or of a “number”. The table T1 has a reference to table T2, where there is the ContentType column that indicates the type of data contained in T1.RecValue. The allowed values of ContentType are ‘D’ that stands for Date, ‘N’ for Number and ‘C’ for Character.

The following script inserts three records in T1, each one referencing a different record in table T2 and with a content congruent with the ContentType value.

CREATE TABLE T2
(   Id          INT     NOT NULL PRIMARY KEY
  , ContentType CHAR(1) NOT NULL
  , CONSTRAINT Chk_ContentType CHECK (ContentType IN ( 'C', 'D', 'N' )));
GO
CREATE TABLE T1
(   Id       INT         IDENTITY(1, 1) PRIMARY KEY
  , TypeId   INT         NOT NULL
  , RecValue VARCHAR(50) NULL
  , CONSTRAINT Fk_T2 FOREIGN KEY (TypeId) REFERENCES T2 (Id));
GO
INSERT INTO T2 VALUES (1, 'C'), (2, 'D'), (3, 'N');
GO
INSERT INTO T1 VALUES (1, 'ABC'), (2, '2019-02-28'), (3, '12.3');
GO

The aim of the author of the query below is to FIRST select all the records that contain numeric values in T1.RecValue according to T2.ContentType and THEN cast T1.RecValue into a DECIMAL data type, filtering only those records that have that value higher than 10.

SELECT         NumericValue = CAST(T1.RecValue AS DECIMAL(17, 2))
FROM
               T1
    INNER JOIN T2
        ON T1.TypeId = T2.Id
WHERE
               T2.ContentType                          = 'N'
               AND CAST(T1.RecValue AS DECIMAL(17, 2)) > 10;

Unfortunately for the author, the QO’s strategy is different; it performs a Clustered Index Scan (look at the Estimated Plan in Fig. 9) with a predicate that carries out the cast into DECIMAL (Fig. 10) for every record of the table. Only after the scan, it joins with the records of the table T2 that have ‘N’ in ContentType.

Fig. 9

Fig. 10

The result is that as soon as the scan of table T1 encounters a value that cannot be converted into DECIMAL, the query breaks giving the following error: “Error converting data type varchar to numeric”.

The solution has been to rewrite the query forcing the QO FIRST to filter the records that should have numeric values in RecValue field and THEN comparing that value, casted into DECIMAL, with 10. All this has been possible thanks to TOP operator:

SELECT NumericValue = CAST(x.RecValue AS DECIMAL(17, 2))
FROM
       (
           SELECT         TOP 2147483647
                          T1.RecValue
           FROM
                          T1
               INNER JOIN T2
                   ON T1.TypeId = T2.Id
           WHERE          T2.ContentType = 'N'
       ) AS x
WHERE  CAST(x.RecValue AS DECIMAL(17, 2)) > 10;

The following is the related plan:

Fig. 11

In this plan, the Compute Scalar operator that follows the Clustered Index Scan represents the conversion into DECIMAL but actually it only defines the formula and it does not compute it. The calculation is performed only when the plan needs the result of it; this happens in the Filter operator that is AFTER the join with table T2. For this reason, the query ends without any error.

Conclusion

As a matter of fact, we can consider the TOP operator as a powerful tuning tool. It can be crucial in several situations but also useless or even damaging in others. For this reason, I would recommend to test a lot before using the techniques shown above in a production environment.

If you want to read more on these topics, my suggestion is to have a look at the following two cool and very interesting resources: an article written by Paul White and a video by Adam Machanic.

Rate

5 (6)

Share

Share

Rate

5 (6)