Strange behavior with TVF execution plan

  • Hi friends

    I had a quiet strange behavior with joining a table valued function last week.

    Business case:

    We've got a distributed system. To keep data in sync we use export and import processes. We have to import files with several thousand rows (using XML). After parsing the file we have to update existing data and add new rows. (I must not use any technology like BCP, SSIS or SS2k8 MERGE.)

    To avoid thousands of single selects I used a inline table-valued function which gets all concatenated keys. Internally, this function calls a T-SQL split function (like Jeff Moden's function). I concatenate all keys within the client process, send the keys to the server and JOIN the split function to the data table.

    My first split function used a inline Numbers table created with CROSS JOINS:

    ---========================================================

    -- create a Jeff Moden style string split function

    -- using a inline Numbers table

    IF (OBJECT_ID('SplitStringInlineNumbers') IS NULL)

    EXECUTE ('CREATE FUNCTION SplitStringInlineNumbers () RETURNS TABLE AS RETURN SELECT 1 A');

    GO

    ALTER FUNCTION SplitStringInlineNumbers (

    @text VARCHAR(MAX)

    ,@separator CHAR(1)

    )

    RETURNS TABLE

    AS

    RETURN

    WITH

    n1 (Num) AS (SELECT 1 UNION ALL SELECT 1), -- 2

    n2 (Num) AS (SELECT 1 FROM n1 CROSS JOIN n1 b), -- 4

    n3 (Num) AS (SELECT 1 FROM n2 CROSS JOIN n2 b), -- 16

    n4 (Num) AS (SELECT 1 FROM n3 CROSS JOIN n3 b), -- 256

    n5 (Num) AS (SELECT 1 FROM n4 CROSS JOIN n4 b), -- 65536

    -- 4294967296

    Numbers (Num) AS (

    SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 1))

    FROM n5 CROSS JOIN n5 b

    )

    SELECT

    SUBSTRING(

    @text

    ,t.Num + 1

    ,CHARINDEX(@separator, @text, t.Num + 1) - t.Num - 1

    ) AS Item

    FROM Numbers t

    WHERE

    t.Num < LEN(@text)

    AND SUBSTRING(@text, t.Num, 1) = @separator

    GO

    Everything worked fine, though, as I looked into the ERRORLOG I noticed that SQL Server has no clue about SQL :-D. There have been 32 "Missing Join Predicate" warnings to tell me that my cool function appears to be scrap whenever SQL Server created a execution plan for this function.

    Next thought was, no problem use a database Numbers table. (Bad idea...). I slightly changed the function and removed the inline Numbers table:

    ---========================================================

    -- create a Jeff Moden style string split function

    -- using a database Numbers table

    IF (OBJECT_ID('SplitStringDbNumbers') IS NULL)

    EXECUTE ('CREATE FUNCTION SplitStringDbNumbers () RETURNS TABLE AS RETURN SELECT 1 A');

    GO

    ALTER FUNCTION SplitStringDbNumbers (

    @text VARCHAR(MAX)

    ,@separator CHAR(1)

    )

    RETURNS TABLE

    AS

    RETURN

    SELECT

    SUBSTRING(

    @text

    ,t.Num + 1

    ,CHARINDEX(@separator, @text, t.Num + 1) - t.Num - 1

    ) AS Item

    FROM Numbers t

    WHERE

    t.Num < LEN(@text)

    AND SUBSTRING(@text, t.Num, 1) = @separator

    Warnings gone, but ends up with a completely strange execution plan using a huge index spool and table spool. The old (potentially bad warned) function took 110ms for thousand keys, the new function took about 85,000ms!?!

    I noticed both functions are marked as non-deterministic (why?), so I changed my previously used TVFs to a procedure which executes the split function into a table variable and joins this table to the data. Finally everything works fine.

    Anyway, why are both functions marked as non-deterministic? Why does the inline-numbers table work 850 times faster than the database-numbers table?

    In respect to help you to help me, here is a sample environment which can be used as copy-paste. (Also copy the previous two functions into tempdb.)

    Setup tables and data on tempdb

    SET NOCOUNT ON;

    USE tempdb;

    GO

    ---========================================================

    -- drop existing test tables

    IF (OBJECT_ID('TestReferences') IS NOT NULL)

    DROP TABLE TestReferences;

    IF (OBJECT_ID('TestData') IS NOT NULL)

    DROP TABLE TestData;

    IF (OBJECT_ID('Numbers') IS NOT NULL)

    DROP TABLE Numbers;

    GO

    ---========================================================

    -- create test tables

    -- we need a numbers table

    CREATE TABLE Numbers

    (

    Num INT NOT NULL

    PRIMARY KEY CLUSTERED

    WITH (FILLFACTOR = 100)

    );

    -- any sample data

    CREATE TABLE TestData

    (

    Id INT NOT NULL IDENTITY

    PRIMARY KEY CLUSTERED

    ,SomeInt INT

    );

    -- several references which relate to the test data

    CREATE TABLE TestReferences

    (

    Id INT NOT NULL IDENTITY

    PRIMARY KEY CLUSTERED

    ,DataId INT NOT NULL

    REFERENCES TestData (Id)

    ,Reference VARCHAR(36)

    );

    -- reference column is indexed

    CREATE INDEX IX_TestReferences_Reference ON TestReferences (Reference);

    GO

    ---========================================================

    -- data

    -- create a numbers table with 50,000 rows

    INSERT INTO Numbers

    SELECT TOP(50000)

    ROW_NUMBER() OVER (ORDER BY (SELECT 1))

    FROM master.sys.all_columns c1

    CROSS JOIN master.sys.all_columns c2;

    -- create 10,000 data rows

    INSERT INTO TestData

    SELECT TOP(10000)

    Num

    FROM Numbers

    -- create two references for each data row

    INSERT INTO TestReferences

    SELECT

    d.Id

    ,CONVERT(VARCHAR(36), NEWID())

    FROM TestData d

    CROSS JOIN (SELECT 1 A UNION ALL SELECT 2) blah

    GO

    Call the functions

    SET NOCOUNT ON;

    USE tempdb;

    GO

    --DBCC FREEPROCCACHE;

    ---==================================================================

    -- call the inline-numbers function joined to the data table

    -- get some references from TestReferences table to search

    DECLARE @keys VARCHAR(MAX);

    SELECT @keys =

    CHAR(9)

    + (SELECT TOP(1000) Reference + CHAR(9) FROM TestReferences FOR XML PATH(''));

    --SET STATISTICS XML ON;

    --SET STATISTICS TIME ON;

    -- select all TestData related to the specified keys

    SELECT DISTINCT

    d.*

    FROM SplitStringInlineNumbers(@keys, CHAR(9)) keys

    JOIN TestReferences r ON keys.Item = r.Reference

    JOIN TestData d ON r.DataId = d.Id;

    --SET STATISTICS TIME OFF;

    --SET STATISTICS XML OFF;

    GO

    ---==================================================================

    -- call the database-numbers function joined to the data table

    -- get some references from TestReferences table to search

    DECLARE @keys VARCHAR(MAX);

    SELECT @keys =

    CHAR(9)

    + (SELECT TOP(1000) Reference + CHAR(9) FROM TestReferences FOR XML PATH(''));

    --SET STATISTICS XML ON;

    --SET STATISTICS TIME ON;

    -- select all TestData related to the specified keys

    SELECT DISTINCT

    d.*

    FROM SplitStringDbNumbers(@keys, CHAR(9)) keys

    JOIN TestReferences r ON keys.Item = r.Reference

    JOIN TestData d ON r.DataId = d.Id;

    --SET STATISTICS TIME OFF;

    --SET STATISTICS XML OFF;

    GO

    ---==================================================================

    -- call the database-numbers function into a table variable and

    -- join this to the data table

    -- get some references from TestReferences table to search

    DECLARE @keys VARCHAR(MAX);

    SELECT @keys =

    CHAR(9)

    + (SELECT TOP(1000) Reference + CHAR(9) FROM TestReferences FOR XML PATH(''));

    -- copy result of split function into a table variable

    DECLARE @lookup TABLE (Item VARCHAR(36) PRIMARY KEY CLUSTERED);

    INSERT INTO @lookup

    SELECT

    Item

    FROM SplitStringDbNumbers(@keys, CHAR(9));

    -- select all data by joining the lookup table

    SELECT DISTINCT

    d.*

    FROM @lookup keys

    JOIN TestReferences r ON keys.Item = r.Reference

    JOIN TestData d ON r.DataId = d.Id;

    First part calls the inline-numbers split function joined to the data table.

    Second part calls the database-numbers split function joined to the data table. BAD PART!

    Third part calls the database-numbers split function into a table-variable and joins it. (Performs good).

    (I cannot attach the execution plans, since I get an error while upload)

    I'd be glad about any suggestions about the WHY?

    Thanks

    Flo

    PS to the regulars: Splitting will never end :hehe:

  • Flo,

    The first part of your question is easy: SQL Server marks all functions as non-deterministic unless they are schema-bound. Schema-binding a function forces the engine to evaluate the content of the function to decide if it is deterministic or not.

    Simply change: RETURNS TABLE to RETURNS TABLE WITH SCHEMABINDING and then

    select OBJECTPROPERTYEX(object_id('SplitStringInlineNumbers', 'IF'), 'IsDeterministic') returns '1'.

    For details, see http://blogs.msdn.com/sqlprogrammability/default.aspx?p=5

    That's one of my most favouritest links of all time by the way 🙂

    Paul

  • Ha, so on second look, the same thing answers the second question. When the function does data access (the database numbers table) it can't be deterministic, so SQL Server can't rely on it always returning the same output for the same input, so it can't safely cache the results in a table spool - it must be recalculated every time, and it must access the database each time too! The link I posted before explains it pretty well, let me know if it is less than clear.

    Paul

  • Paul White (11/7/2009)


    Simply change: RETURNS TABLE to RETURNS TABLE WITH SCHEMABINDING and then

    select OBJECTPROPERTYEX(object_id('SplitStringInlineNumbers', 'IF'), 'IsDeterministic') returns '1'.

    Cool, thanks! Learned something new

    For details, see http://blogs.msdn.com/sqlprogrammability/default.aspx?p=5

    That's one of my most favouritest links of all time by the way 🙂

    Immediately added to my blog list 🙂

    Greets

    Flo

  • Paul White (11/7/2009)


    Ha, so on second look, the same thing answers the second question. When the function does data access (the database numbers table) it can't be deterministic, so SQL Server can't rely on it always returning the same output for the same input, so it can't safely cache the results in a table spool - it must be recalculated every time, and it must access the database each time too! The link I posted before explains it pretty well, let me know if it is less than clear.

    Paul

    Thanks again, the link you posted explains everything. Apparently SS2k5 does not mark the inline-numbers based solution as deterministic but handles it as a deterministic function.

    Thanks for opening my eyes!

    Flo

  • Hmmm well the issue is correct even if my rushed explanation there isn't quite right. The cause is still the deterministic thing, and the query plan problems still stem directly from that, but not in quite the way I had assumed. Nevertheless, you seem to be totally across the reasons for the behaviour, and that was the point of the thread really 🙂

    It's also good to be able to post that link again - there are still far too many experienced SQL people that don't know about that behaviour with SCHEMABINDING. Even those that do don't always realise that it applies to functions that don't do any data access...

    One of these days I will find a T-SQL function that does data access which doesn't suck horribly, but I am not holding my breath. I sometimes wish data access from T-SQL functions would be deprecated and swiftly removed, for all the problems that accompany it. If functions couldn't do data access, they could always be checked for determinism without (sometimes pointless) schema-binding. Oh well.

    Paul

  • Paul White (11/7/2009)


    Nevertheless, you seem to be totally across the reasons for the behaviour, and that was the point of the thread really 🙂

    Yep 🙂

    It's also good to be able to post that link again - there are still far too many experienced SQL people that don't know about that behaviour with SCHEMABINDING. Even those that do don't always realise that it applies to functions that don't do any data access...

    Like me.. Seems to be a topic I have to study - when I find the time. Unfortunately there are too many other important things I have to study (especially WCF, what is a huge universe with many gains and pains).

    I sometimes wish data access from T-SQL functions would be deprecated and swiftly removed, for all the problems that accompany it. If functions couldn't do data access, they could always be checked for determinism without (sometimes pointless) schema-binding. Oh well.

    Good point. And I learned something really important with problem. TVFs (especially single-statement) look really cool but should be used carefully.

    Greets

    Flo

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply