SELECT INTO vs INSERT INTO on Columnstore

  • Thordog

    Ten Centuries

    Points: 1212

    SELECT INTO does place locks on system metadata, which can cause issues for users, until the SELECT INTO completes. INSERT INTO can be minimally logged by using a TABLOCK. ~Boyd 🙂

    Enhancements in SQL Server 2008

    SQL Server 2008 introduces a few important enhancements regarding minimally logged insert methods. The regular INSERT SELECT statement (as opposed to using the BULK rowset provider) can now also be handled with minimal logging. This improvement is important for two reasons. First, you can fully control the target table’s \\[schema\\] definition. Second, unless there are logical reasons for you to put the CREATE TABLE and INSERT SELECT statements in the same transaction, you can run the two in separate transactions. The CREATE TABLE statement will finish very quickly and release all locks on metadata, preventing the problem I described earlier with the SELECT INTO statement. So in SQL Server 2008, a best practice is to use CREATE TABLE with INSERT SELECT instead of SELECT INTO.

    For example, instead of

    SELECT x, y, z INTO TargetTable FROM SourceTable;

    use

    CREATE TABLE TargetTable(x …, y …, z …);

    INSERT INTO TargetTable WITH (TABLOCK) (x, y, z)

    http://sqlmag.com/t-sql/minimally-logged-inserts

    [font="Courier New"]____________________________________________________________________________________________
    Remember as you walk down lifes road, don't forget to stop and pee on the bushes - Thordog
    [/font]

  • ramya.sqlservr

    Ten Centuries

    Points: 1068

    You are right. We can use either of the approaches for small tables because it doesn't make much difference. But, for large tables, insert into is better when select into is causing space issues by filling up the database.

    We have faced this problem in our production and databases got filled up performing select into on tables with more than 100 million rows. That's the reason I wanted to write this article so that it would be of help if anyone is facing such issues.

    Thanks,

    Ramya

  • ramya.sqlservr

    Ten Centuries

    Points: 1068

    JoeS 3024 (6/9/2015)


    So for those of us who aren't DBA experts does this mean that choosing between a SELECT INTO and INSERT INTO is one of those "It Depends" choices? Using a SELECT INTO maybe OK for a small table or a DB that isn't heavily used but when you have a large DB or heavily used table take the hit and choose INSERT INTO so as to not cause potential blocking issues. Does that sound reasonable?

    Thanks

    You are right. We can use either of the approaches for small tables because it doesn't make much difference. But, for large tables, insert into is better when select into is causing space issues by filling up the database.

    We have faced this problem in our production and databases got filled up performing select into on tables with more than 100 million rows. Thats the reason I wanted to share this article so that it would be of help if anyone is facing such issues.

  • ramya.sqlservr

    Ten Centuries

    Points: 1068

    dbishop (6/9/2015)


    I'm sorry if I am being too critical, but this seemed like more of an article to provide a lot of screen shots, and not a lot of substance. SELECT INTO and then creating ANY index is almost always going to be faster than INSERT unless it is a clustered index and data is being inserted in the correct order, whether it is being done in SQL Server 2000 or 2014. SELECT INTO is minimally logged, so the log growth will always be smaller, again regardless of the version of SQL Server. These are facts that are almost always true whether dealing with a CS index, a clustered index or is just a heap.

    It was kind of sucky lduvall that the DBA didn't go into any explanation of why it is faster to use SELECT INTO. That really didn't teach you anything. But the main reason is that it is minimally logged, where INSERT logs every row being inserted into the table.

    I was focusing on tables with clustered columnstore indexes here. The name says clustered but the concept is different. If we create a clustered columnstore index on a table,it means the table is compressed (not sorted). When we perform a SELECT INTO on a table that is compressed, then SELECT INTO creates a new table and inserts the data without any index on it. It means the new table created is in uncompressed state. Suppose if a table is of size 400 GB, the compressed table will be of size 50 GB approximately. Imagine if we are using SELECT INTO and the new table is 400 GB, and if we are doing 4 such SELECT INTO's on tables parallely, then the database will be filled up very fast. This is the scenario we faced on our production and I wanted to write this article so that it will be helpful to others if they are experiencing the same.

  • Jeff Moden

    SSC Guru

    Points: 997132

    Thordog (6/9/2015)


    SELECT INTO does place locks on system metadata, which can cause issues for users, until the SELECT INTO completes. INSERT INTO can be minimally logged by using a TABLOCK. ~Boyd 🙂

    Enhancements in SQL Server 2008

    SQL Server 2008 introduces a few important enhancements regarding minimally logged insert methods. The regular INSERT SELECT statement (as opposed to using the BULK rowset provider) can now also be handled with minimal logging. This improvement is important for two reasons. First, you can fully control the target table’s \\[schema\\] definition. Second, unless there are logical reasons for you to put the CREATE TABLE and INSERT SELECT statements in the same transaction, you can run the two in separate transactions. The CREATE TABLE statement will finish very quickly and release all locks on metadata, preventing the problem I described earlier with the SELECT INTO statement. So in SQL Server 2008, a best practice is to use CREATE TABLE with INSERT SELECT instead of SELECT INTO.

    For example, instead of

    SELECT x, y, z INTO TargetTable FROM SourceTable;

    use

    CREATE TABLE TargetTable(x …, y …, z …);

    INSERT INTO TargetTable WITH (TABLOCK) (x, y, z)

    http://sqlmag.com/t-sql/minimally-logged-inserts

    The only issue I've ever seen with the schema locks that occur when using SELECT INTO is that the DBA sometimes can do what he wants in the Object Explorer. 😀 That, notwithstanding, I agree but IF and ONLY IF you wrap a SELECT INTO in an explicit transaction (or the equivalent server wide setting is turned on) will SELECT INTO cause contention with other items in TempDB.

    You also need to mention the fact that none of the minimally logging stuff works for either SELECT INTO or INSERT SELECT unless you're in some other recovery model other than the FULL recovery model. Even if you take the "safe" route of using the Bulk Logged recovery model for such an evolution, you've still screwed any point in time restore for any log file that has any minimally logged action in it.

    Also, under normal cases and to the best of my knowledge, SELECT INTO does NOT cause any unnecessary growth any more than INSERT or INSERT SELECT would cause, which is none. Using SELECT INTO or INSERT SELECT to insert a large number of rows might both cause log file growth.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".
    "Dear Lord... I'm a DBA so please give me patience because, if you give me strength, I'm going to need bail money too!"

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden

    SSC Guru

    Points: 997132

    PatrickSimons (6/9/2015)


    I agree 100%. Especially when you SELECT INTO inside a transaction to a temporary table (##) = worst case....

    What problems have you had there? Or are you citing the age old myth that was once true and is no longer true about SELECT INTO locking up TempDB for the duration?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".
    "Dear Lord... I'm a DBA so please give me patience because, if you give me strength, I'm going to need bail money too!"

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • ramya.sqlservr

    Ten Centuries

    Points: 1068

    TheSQLGuru (6/9/2015)


    I think you missed a VERY important type of test in your examples: 1M row inserts at a time, in discreet transactions. Please try that (primarily for the insert to the table that already has the CCI on it) and report back.

    This will avoid the logs filling up in case if destination table doesn't require clustered columnstore index to be created on it.

  • ramya.sqlservr

    Ten Centuries

    Points: 1068

    Jeff Moden (6/9/2015)


    The scenario descriptions in the final chart of the article also don't seem to match the actual scenario descriptions in the text of the article. That also means (I haven't done the check) that the data in that chart might also be incorrect.

    The text of the article contains individual times taken and the chart contains total time taken for the entire operation. For example, when I was performing SELECT INTO + Create CCI , the time displayed in chart is of total time taken for both the operations. Please revisit the article to understand the entire flow.

  • Jeff Moden

    SSC Guru

    Points: 997132

    BTW, if anyone wants some official MS documentation on the subject, please see the following article.

    FIX: SELECT INTO Locking Behavior

    Also notice the line where it says that this isn't a problem since SQL Server 7. To clarify that, though, it's still going to lock things up if you use SELECT INTO inside an explicit transact where INSERT SELECT will not but anyone making that mistake probably deserves it.

    And, as I previously stated, all bets are off when using SELECT INTO against a linked server (although that's not documented and is based solely on my personal experience).

    I'll also admit that being able to do minimally logged inserts using INSERT SELECT is a wonderful change as of 2008. Just remember that nothing is minimally logged in the FULL recovery mode.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".
    "Dear Lord... I'm a DBA so please give me patience because, if you give me strength, I'm going to need bail money too!"

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden

    SSC Guru

    Points: 997132

    ramya.sqlservr (6/9/2015)


    Jeff Moden (6/9/2015)


    The scenario descriptions in the final chart of the article also don't seem to match the actual scenario descriptions in the text of the article. That also means (I haven't done the check) that the data in that chart might also be incorrect.

    The text of the article contains individual times taken and the chart contains total time taken for the entire operation. For example, when I was performing SELECT INTO + Create CCI , the time displayed in chart is of total time taken for both the operations. Please revisit the article to understand the entire flow.

    Crud... My humble apologies. Never mind about the graphic. I missed the columns that have the "Yes" and "No" in them.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".
    "Dear Lord... I'm a DBA so please give me patience because, if you give me strength, I'm going to need bail money too!"

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • PatrickSimons

    SSChasing Mays

    Points: 637

    Jeff Moden (6/9/2015)


    PatrickSimons (6/9/2015)


    I agree 100%. Especially when you SELECT INTO inside a transaction to a temporary table (##) = worst case....

    What problems have you had there? Or are you citing the age old myth that was once true and is no longer true about SELECT INTO locking up TempDB for the duration?

    It was an explicit transaction with SELECT INTO which blocked up other connections (they couldn't create theirself a temp. table). So I only use (if ever) SELECT INTO outside explicit transactions.


    Patrick SIMONS, MCP

  • r.e.kiser

    Grasshopper

    Points: 21

    It's my understanding that SELECT INTO locks one or more system database tables until the data completes loading, thus preventing any other user on the entire server instance from performing a DDL operation until it is over. Because of this, use of SELECT INTO on a server with many users can carry a very high cost.

    The exception might be the use of SELECT TOP 0 INTO to create an empty copy of a table. Since no data is moved, the operation completes quickly and system table locks are released almost immediately.

  • Jeff Moden

    SSC Guru

    Points: 997132

    r.e.kiser (6/10/2015)


    It's my understanding that SELECT INTO locks one or more system database tables until the data completes loading, thus preventing any other user on the entire server instance from performing a DDL operation until it is over. Because of this, use of SELECT INTO on a server with many users can carry a very high cost.

    The exception might be the use of SELECT TOP 0 INTO to create an empty copy of a table. Since no data is moved, the operation completes quickly and system table locks are released almost immediately.

    That's not entirely correct unless the SELECT INTO is 1) Inside an explicit transactions or 2) the server is setup so that everything defaults to being an explicit transaction or 3) a distributed transaction is formed (as would be across a linked server). It USED to be true but has not been true since SQL Server 6.5 SP1. Please see the article I previously provided a link for. And, yes, I've tested the hell out of this. The only "nasty" side affect that I've seen is that the Object Explorer won't update until such code is complete.

    To be sure, it does place some locks on some tables but they're incredibly short during the table creation phase of the command which lasts only as long as if you had done a SELECT INTO WHERE 1=0 type of thing.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".
    "Dear Lord... I'm a DBA so please give me patience because, if you give me strength, I'm going to need bail money too!"

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden

    SSC Guru

    Points: 997132

    PatrickSimons (6/10/2015)


    Jeff Moden (6/9/2015)


    PatrickSimons (6/9/2015)


    I agree 100%. Especially when you SELECT INTO inside a transaction to a temporary table (##) = worst case....

    What problems have you had there? Or are you citing the age old myth that was once true and is no longer true about SELECT INTO locking up TempDB for the duration?

    It was an explicit transaction with SELECT INTO which blocked up other connections (they couldn't create theirself a temp. table). So I only use (if ever) SELECT INTO outside explicit transactions.

    Ah. Now THAT I agree with 100%. Thanks, Patrick. You also need to ensure that no distributed transaction is formed or it will lockup the source table.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".
    "Dear Lord... I'm a DBA so please give me patience because, if you give me strength, I'm going to need bail money too!"

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Alexey Voronin

    SSC Enthusiast

    Points: 109

    If anyone want to dive deeper into this subject -- try partitioning on Columnstore. It's a rare situation where you create table once and in full size. Normally you have to add data in parts over time. Here partitioning over columnstore comes in.

Viewing 15 posts - 16 through 30 (of 37 total)

You must be logged in to reply to this topic. Login to reply