Question regarding the difference between two different ways to populate a new table in TSQL

  • TomThomson (8/24/2014)


    Jeff Moden (8/23/2014)


    There's nothing wrong with challenging a DA when it comes to performance

    up to there I agree with you 100%

    because, and with absolutely no malice in my heart, DAs don't actually need to know how to write a lick of code, never mind high performance code, to do their jobs correctly.

    And I don't agree with that at all; a DA had better be able to produce a schema design which enables high-performance code to be written, and if the DA knows nothing about code and in particular nothing about high performance code that DA is not going to be able to do that.

    I absolutely agree that they'd better be able to pull the proverbial rabbit out of the hat on schema design to enable high performance code to be written but, with the tools they use to do such a thing and the design patterns they follow, they don't actually need to know a thing about how to write the code.

    I went through that about a year ago. A company wanted me to come in to help them write some code. They introduced me to the DA and he spent about an hour going over the schema, the table keys, and the relationships for a huge database. It was a work of art bordering on genius. I asked him why with such a great design that they needed me to write code. His answer was (and I literally nearly fell backwards in my chair) "I only know how to design the databases... I don't actually know how to write code".

    As another example, I went through that at a previous company that I worked for. They built a whole new OSS in Oracle. Again, the design of the database was outstanding in every way I could think of. I'd been assigned a task that I was having problems writing code for (it was early in my experience with Oracle) so I went to the DA that designed it and starting drawing code on his white-board. He asked me to stop because he couldn't help. When I asked "Why not?", his comment was "This is a gibberish to me... I don't know how to do a 3 table join never mind what you've written so far". When I asked him how he designed the database, his reply was "The tools make it easy. I just need to fill in the right blanks and draw the right lines. It even builds the correct indexes for me".

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Eirikur Eiriksson (8/24/2014)


    TomThomson (8/24/2014)


    Jeff Moden (8/23/2014)


    There's nothing wrong with challenging a DA when it comes to performance

    up to there I agree with you 100%

    because, and with absolutely no malice in my heart, DAs don't actually need to know how to write a lick of code, never mind high performance code, to do their jobs correctly.

    And I don't agree with that at all; a DA had better be able to produce a schema design which enables high-performance code to be written, and if the DA knows nothing about code and in particular nothing about high performance code that DA is not going to be able to do that.

    We wouldn't want to leave the design of the internals of a building to the builders would we?

    😎

    No, and neither would I want to use a coder (as opposed to a developer) to design code. But I wouldn't want to use an architect who didn't know about about joins and unions and projections and restrictions and filters and intersections and symmetric differences and so on to design a database - and if someone knows about all those things they can presumably write relational expressions in some notation, even if that notation isn't SQL or T-SQL.

    edit; typos

    Tom

  • Jeff Moden (8/24/2014)


    I absolutely agree that they'd better be able to pull the proverbial rabbit out of the hat on schema design to enable high performance code to be written but, with the tools they use to do such a thing and the design patterns they follow, they don't actually need to know a thing about how to write the code.

    I went through that about a year ago. A company wanted me to come in to help them write some code. They introduced me to the DA and he spent about an hour going over the schema, the table keys, and the relationships for a huge database. It was a work of art bordering on genius. I asked him why with such a great design that they needed me to write code. His answer was (and I literally nearly fell backwards in my chair) "I only know how to design the databases... I don't actually know how to write code".

    As another example, I went through that at a previous company that I worked for. They built a whole new OSS in Oracle. Again, the design of the database was outstanding in every way I could think of. I'd been assigned a task that I was having problems writing code for (it was early in my experience with Oracle) so I went to the DA that designed it and starting drawing code on his white-board. He asked me to stop because he couldn't help. When I asked "Why not?", his comment was "This is a gibberish to me... I don't know how to do a 3 table join never mind what you've written so far". When I asked him how he designed the database, his reply was "The tools make it easy. I just need to fill in the right blanks and draw the right lines. It even builds the correct indexes for me".

    That sounds as if you've met some amazing data analysts - perhaps people who are using tools based on the research by people like Bernstein and Zaniolo and Beeri and Fagin and Raiha and Melkanoff and so on which I thought had been ignored by the industry (except IBM, who published some of it, but as far as I know didn't actually use it) and indeed by most of academia. I wish I'd met some of those DAs and got my hands on some of their tools.

    I still reckon a good DA can probably write relational expressions in some notation though, although it may not be SQL so some people won't recognise it as code.

    Tom

  • Thank you everyone for your responses. I obviously need to work on my confidence level and challenge our DAs when necessary. I am new to the organization and haven't felt confident enough to push back when I am directed to do something that does not make sense to me. Sounds like in this case in particular, I would have been better off pushing back a bit and requiring some supporting evidence before changing all of my code.

  • TomThomson (8/25/2014)


    Jeff Moden (8/24/2014)


    I absolutely agree that they'd better be able to pull the proverbial rabbit out of the hat on schema design to enable high performance code to be written but, with the tools they use to do such a thing and the design patterns they follow, they don't actually need to know a thing about how to write the code.

    I went through that about a year ago. A company wanted me to come in to help them write some code. They introduced me to the DA and he spent about an hour going over the schema, the table keys, and the relationships for a huge database. It was a work of art bordering on genius. I asked him why with such a great design that they needed me to write code. His answer was (and I literally nearly fell backwards in my chair) "I only know how to design the databases... I don't actually know how to write code".

    As another example, I went through that at a previous company that I worked for. They built a whole new OSS in Oracle. Again, the design of the database was outstanding in every way I could think of. I'd been assigned a task that I was having problems writing code for (it was early in my experience with Oracle) so I went to the DA that designed it and starting drawing code on his white-board. He asked me to stop because he couldn't help. When I asked "Why not?", his comment was "This is a gibberish to me... I don't know how to do a 3 table join never mind what you've written so far". When I asked him how he designed the database, his reply was "The tools make it easy. I just need to fill in the right blanks and draw the right lines. It even builds the correct indexes for me".

    That sounds as if you've met some amazing data analysts - perhaps people who are using tools based on the research by people like Bernstein and Zaniolo and Beeri and Fagin and Raiha and Melkanoff and so on which I thought had been ignored by the industry (except IBM, who published some of it, but as far as I know didn't actually use it) and indeed by most of academia. I wish I'd met some of those DAs and got my hands on some of their tools.

    I still reckon a good DA can probably write relational expressions in some notation though, although it may not be SQL so some people won't recognise it as code.

    Indeed, the were amazing, IMHO. Still, it would have been nice if they understood the T-SQL language a bit more but it was a happy sacrifice to have a really well built database for a change.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • GilaMonster (8/24/2014)


    Jeff Moden (8/23/2014)


    Cathy DePaolo (8/22/2014)


    I checked the execution plans of each, and they are the same. I am of the opinion that the subquery is not going to improve performance of the query, [font="Arial Black"]and the execution plans seem to prove me correct.[/font] I prefer the straight "select into" version because it is less code and, in my opinion, is a cleaner version.

    I was wondering if anyone in the community knows why I would be told, by a database architect, that the subquery version of this query would be better performing. I was forced to change all of my queries to use the subquery version, rather than a straight select into. I was not given a reason that satisfied me, so I am hoping someone in this community can tell me why I would be directed in this manner.

    Thank you for your time.

    Gosh, be careful now. In this case, it works out, but the Execution plans never [font="Arial Black"]prove [/font]anything having to do with performance. Not even the ACTUAL execution plans.

    Sorry, but I'm going to disagree with you there.

    If two queries have identical execution plans, then they will (in the absence of blocking or waits) perform identically.

    I don't mean same costs (costs are always estimated), I mean identical plans. If the operators are the same and the row counts are the same and the number of executions for each operator is the same, then the two queries will be executed the same way because the plan is the 'recipe' given to the execution engine.

    It's a good way to tell if two different ways of writing a query are equivalent. If they produce identical plans, then they have to be logically equivalent.

    I enjoy doing this to people who insist on 'subqueries in the from clause are BAD' or 'subqueries in the from clause are good' (I've seen both recently), as I can just write the queries with and without the subquery (logically equivalent ways) and then show that the execution plans are absolutely, 100%, completely identical in every way and hence there's no difference between the two queries by the time they reach the query execution engine.

    Sorry for the delay, ol' friend. Work has been a bugger and it took me a bit to write this up.

    As with all else, "It Depends". First, let me state that under normal circumstances, I'd agree with you 100%. But there are edge cases where the execution plans are identical and yet the queries perform quite differently because of "other factors". If the "other factors" aren't present then I agree... queries with identical execution plans will perform identically, as well. I just don't trust the execution plans even when such duplicity occurs and I so always prove (and recommend proving) it some other way to remove all doubt.

    [font="Arial Black"]Edge Case #1[/font]

    For the first edge case, we need to build some good volume of data to make the differences apparent and obvious. Here's the code to build such data.

    [font="Arial Black"]Test Data[/font]

    --===== If the test table exists, drop it to make reruns in SSMS easier.

    IF OBJECT_ID('tempdb..#JBMTest01','U') IS NOT NULL DROP TABLE #JBMTest;

    GO

    --===== Create and populate the test table on-the-fly.

    -- Note that the ISNULL turns the SomeID column into a NOT NULL column so we can add

    SELECT TOP (1000000)

    SomeID = ISNULL(ROW_NUMBER() OVER (ORDER BY (SELECT 1)),0),

    SomeInt = ABS(CHECKSUM(NEWID()))%50000+1,

    SomeLetters2 = CHAR(ABS(CHECKSUM(NEWID()))%26+65)

    + CHAR(ABS(CHECKSUM(NEWID()))%26+65),

    SomeFiller = 'Now is the time for all good men to come to the aid of their country.',

    SomeMoney = ABS(CHECKSUM(NEWID()))%10000 /100.0,

    SomeDate = CAST(RAND(CHECKSUM(NEWID()))*3653.0+36524.0 AS DATETIME),

    SomeHex09 = RIGHT(NEWID(),09)

    INTO #JBMTest

    FROM sys.all_columns t1

    CROSS JOIN sys.all_columns t2;

    --===== Add the PK

    -- Takes about 1 second to execute.

    ALTER TABLE #JBMTest

    ADD PRIMARY KEY CLUSTERED (SomeID);

    GO

    [font="Arial Black"]Target Test Tables[/font]

    For this experiment, we also need two target test tables, one of which will contain a single row while the other has no rows in it. Other than that and the names of the tables and their related indexes, the tables are identical.

    --===== Drop the test tables to make reruns in SSMS easier.

    IF OBJECT_ID('tempdb..#JBMTest01','U') IS NOT NULL DROP TABLE #JBMTest01;

    IF OBJECT_ID('tempdb..#JBMTest02','U') IS NOT NULL DROP TABLE #JBMTest02;

    GO

    --===== Create two identical but empty copies of the #JBMTest table/

    SELECT * INTO #JBMTest01 FROM #JBMTest WHERE 1=0;

    SELECT * INTO #JBMTest02 FROM #JBMTest WHERE 1=0;

    ALTER TABLE #JBMTest01 ADD PRIMARY KEY CLUSTERED (SomeID);

    ALTER TABLE #JBMTest02 ADD PRIMARY KEY CLUSTERED (SomeID);

    GO

    --===== Add just one row to the first test table.

    INSERT INTO #JBMTest01 SELECT * FROM #JBMTest WHERE SomeID = 1;

    GO

    [font="Arial Black"]The Test[/font]

    Enable the actual execution plan and run the following code. Feel free to reverse the order of the SELECTs just to make sure (you'll need to rerun the code above for a new run). See that, except for the necessary table and index name changes to keep this demo simple, the execution plans are otherwise identical. Then, check out the messages tab and see that the queries operate quite differently.

    --===== With the necessary exception for table and index names just for ease of demonstration, these

    -- two queries produce identical execution plans but the resource usage is quite different.

    SET STATISTICS TIME,IO ON;

    INSERT INTO #JBMTest01 SELECT * FROM #JBMTest WITH (TABLOCK) WHERE SomeID > 1;

    INSERT INTO #JBMTest02 SELECT * FROM #JBMTest WITH (TABLOCK) WHERE SomeID > 1;

    SET STATISTICS TIME,IO OFF;

    GO

    [font="Arial Black"]Results[/font]

    Here are the stats results from that query...

    Table '#JBMTest01__________________________________________________________________________________________________________0000000000C0'.

    Scan count 0, logical reads 3046184, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table '#JBMTest____________________________________________________________________________________________________________0000000000BD'.

    Scan count 1, logical reads 15423, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 1997 ms, elapsed time = 2097 ms.

    (999999 row(s) affected)

    Table '#JBMTest____________________________________________________________________________________________________________0000000000BD'.

    Scan count 1, logical reads 15423, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 1217 ms, elapsed time = 2176 ms.

    (999999 row(s) affected)

    [font="Arial Black"]The Reason[/font]

    The reason why the stats are so different between the two identical queries is, of course, the query for #JBMTest02 is minimally logged whereas the query for #JBMTest01 is not (because of the presence of the row in the Clustered Index).

    For the newbies out there that might not know, if you just close the window that you ran the above code from, the Temp Tables will clear leaving only a very small skeleton that will eventually go away. Feel free to explicitly drop the tables if that's what you'd prefer.

    [font="Arial Black"]Edge Case #2[/font]

    This next example is one of my favorites because I use the second part of the code to demonstrate that a recursive CTE is frequently slower and more resource intensive than a While Loop.

    [font="Arial Black"]The Test[/font]

    In this case, we're going to use two identical While Loops, the second of which has some "other factors" that come into play. Each of the 10 execution plans (5 for each While Loop) are identical in every way including the table name and the system generated name of the clustered index.

    DECLARE @StartTime DATETIME = GETDATE();

    SET NOCOUNT OFF

    ;

    IF OBJECT_ID('tempdb..#JBMTest01','U') IS NOT NULL

    DROP TABLE #JBMTest01

    ;

    CREATE TABLE #JBMTest01 (N INT PRIMARY KEY CLUSTERED)

    ;

    DECLARE @Counter INT = 1;

    WHILE @Counter <= 5

    BEGIN

    INSERT INTO #JBMTest01 (N) VALUES(@Counter);

    SELECT @Counter = @Counter + 1;

    END

    ;

    SELECT '1st While Loop Duration: ' + CAST(DATEDIFF(ms,@StartTime,GETDATE()) AS VARCHAR(10)) + ' ms'

    ;

    GO

    ---------------------------------------------------------

    DECLARE @StartTime DATETIME = GETDATE();

    SET NOCOUNT ON

    ;

    IF OBJECT_ID('tempdb..#JBMTest01','U') IS NOT NULL

    DROP TABLE #JBMTest01

    ;

    CREATE TABLE #JBMTest01 (N INT PRIMARY KEY CLUSTERED)

    ;

    DECLARE @Counter INT = 1;

    BEGIN TRANSACTION

    WHILE @Counter <= 5

    BEGIN

    INSERT INTO #JBMTest01 (N) VALUES(@Counter);

    SELECT @Counter = @Counter + 1;

    END

    ;

    COMMIT

    ;

    SELECT '2nd While Loop Duration: ' + CAST(DATEDIFF(ms,@StartTime,GETDATE()) AS VARCHAR(10)) + ' ms'

    ;

    GO

    [font="Arial Black"]Results[/font]

    If you change both loops to count to 1 million instead of just 5 and disable the execution plan (so you don't get an error about too many execution plans), you end up seeing a rather significant difference (2nd While Loop is twice as fast).

    1st While Loop Duration: 12203 ms

    2nd While Loop Duration: 6226 ms

    [font="Arial Black"]The Reason[/font]

    Again, Gail will already know the reason for the performance boost in the second While Loop. I'm probably not using the right terminology but the simple explanation is that the log is only "commited" once in the second While Loop because of the explicit transaction instead of a million times like in the first While Loop (unless you have implicit transactions turned off, of course, and I don't recommend that).

    To a lesser extent, the second While Loop also has the automatic generation of row counts turned off. It does take some time and energy to generate a million row counts of "1" as done by default on most systems. Turning SET NOCOUNT ON when using While Loops isn't just for the sake of appearance and to keep from filling up jog logs that may capture the output of the job or to prevent unexpected returns in the form of row counts to the GUI. It's for performance, as well.

    So, yes, I absolutely agree with your point about identical execution plans providing identical performance, but only under normal circumstances. I just want folks to understand that there are places where other factors (I'm sure that I've not explored them all) may come into play and that's why I made the blanket recommendation to not trust the execution plan to make any absolute decision about performance.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (9/3/2014)


    Again, Gail will already know the reason for the performance boost in the second While Loop. I'm probably not using the right terminology but the simple explanation is that the log is only "commited" once in the second While Loop because of the explicit transaction instead of a million times like in the first While Loop (unless you have implicit transactions turned off, of course, and I don't recommend that).

    I think you mean autocommit mode, not implicit transactions; if implicit transactions were on instead of auto-ommit mode the first while loop would dtart a single transaction and that transaction would never be committed (until some code you havet included committed it).

    Tom

Viewing 7 posts - 16 through 21 (of 21 total)

You must be logged in to reply to this topic. Login to reply