SQL query is taking 9 hours to complete

  • Overview:

    Table D (20 mil records, 200 columns)

    Table S (400k records, 4 columns)

    on coalesce(D.[O_NAME_04],'') = coalesce(S.[O_NAME_04],'')

    And coalesce(D.[O_NAME_05],'') = coalesce(S.[O_NAME_05],'')

    And coalesce(D.[O_NAME_06],'') = coalesce(S.[O_NAME_06],'');

    All the above columns may contain null values

    This query is taking at least 8 - 9 hours min to produce 355753582 rows

    About the environment

    Microsoft SQL Server 2016 (SP1) (KB3182545) - 13.0.4001.0 (X64) Oct 28 2016 18:17:30 Copyright (c) Microsoft Corporation Enterprise Edition: Core-based Licensing (64-bit) on Windows Server 2012 R2 Standard 6.3 <X64> (Build 9600: ) (Hypervisor)

    64GB RAM

    HD- TBs of free space

    What I have did so far:

    1. To avoid coalesce function, I have introduced 3 new columns (Big int)

    2. Created cluster index

    3. Join these 3 new columns

    still no changes in the performance. In other words it is taking more time.

     

    Any suggestions to cut down the run time?

    Thanks in advance

    RR

     

  • All those NULLs and Blanks that you're joining on?  Yeah... guess what they're doing?  MASSIVE accidental CROSS JOIN which is more politely called "many-too-many" join just on those alone.

    Second, are the name "triplets" in either of the two tables actually unique within the two tables/

    Other than that and without you posting the CREATE TABLE and CREATE INDEX statements for either table and haven't told us what you're trying to do and haven't posted the rest of the code for the offending query, there's not much else we can to do help until we have such information.

     

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Create two temporary tables which are copies of D and S but with the additional columns of [O_NAME_04_NOT_NULL], [O_NAME_05_NOT_NULL], [O_NAME_06_NOT_NULL] , then add a clustered index on the name columns.

    e.g. for for table D:

    SELECT *,
        coalesce(D.[O_NAME_04],'') AS [O_NAME_04_NOT_NULL],
        coalesce(D.[O_NAME_05],'') AS [O_NAME_05_NOT_NULL],
        coalesce(D.[O_NAME_06],'') AS [O_NAME_06_NOT_NULL]
    INTO #D
    FROM D;

    CREATE CLUSTERED INDEX IX_#D_1 on #D([O_NAME_04_NOT_NULL], [O_NAME_05_NOT_NULL], [O_NAME_06_NOT_NULL]);

    Do the same for table S.

    Then  join these two temporary tables using the  _NOT_NULL columns:

     ON D.[O_NAME_04_NOT_NULL] = S.[O_NAME_04_NOT_NULL]
    AND D.[O_NAME_05_NOT_NULL] = S.[O_NAME_05_NOT_NULL]
    AND D.[O_NAME_06_NOT_NULL] = S.[O_NAME_06_NOT_NULL]

    Hopefully you will then get a merge join which will be about as fast as it can get.

  • I'm thinking that's still going to result in a massive accidental cross join I previously spoke of for the rows where all 3 of the join criteria are blank or null.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Thanks Jonathan and Jeff for your guidance and support.

    Jonathan: Those 3 newly introduced columns contains KEYs I have created and using that for join. YES It is cluster index

    Jeff: You are right, it is a massive data explode.

    Let me explain why the data exploded. The main table (join keys + data) is the data table, and the reference table (user details + join keys) is the security table. The goal is to figure out who can view what. The output generates data sets for each user. For example if you are me are at the same level and have the same privilege to see a data set, this logic will produce a data set for you as well as for me. This will be used at the reporting level, with an AD-based filter applied.

    This was created four years ago by a consultant, however it is now being delayed owing to data growth.

    What I am doing?

    When it came to my knowledge, I noticed the below issues

    1. 3 columns used to join the two tables (Info)
    2. I found no index
    3. Join key used coalesce
    4. Those 3 columns are Nvarchar 255 data type
    5. Join Keys contains NULLs

    I introduced two intermediate tables

    1. Introduced 3 new KEY columns with not null
    2. Created cluster index

    Even after these, I wasn't able to find any performance improvement. That's a  surprise to me.

    Hope I have explained enough!

    Thanks & Regards

    RR

  • Ok... so the first and very important step is to "know thy enemy".  With that, we have to know the "gazintas" for both tables.  Run the following query one each of your two tables.  You might want to store the results of each run in a working table so that you can do additional analysis later...

     SELECT Name_04, Name_05, Name_06, Cnt = COUNT(*)
    INTO dbo.SomeWorkingTable --TODO: Change the table name here
    FROM dbo.OneOfTheTwoTables --TODO: Change the table name here
    GROUP BY Name_04, Name_05, Name_06
    ;

    Then figure out what it really is that you're after from any group of rows identified in the working table as having more than 1 row in the group.

     

     

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • to produce 355753582 rows

    Producing 335 mil rows cannot be fast, no matter what indexes are used in a query.

    And any suggestions regarding COALESCE replacement might be generally correct but totally useless in this case.

    You need to rethink the whole approach to the query.

    How that huge recordset is used further?

    If at the end you need to figure out the set of privileges for a specific user, then you need to build a query for a specific user, not for all of them at once.

    Then that indexing and other optimization techniques might come to play.

    _____________
    Code for TallyGenerator

  • (1) You need to verify that the 3 values are unique on at least one table; otherwise you will get the equivalent of a CROSS JOIN for all non-unique groups.  If the 3 values are not unique in one table, add $IDENTITY to the key to make it unique without SQL having to "uniquify" it.

    (2) Are you encoding the char values to convert them to a bigint?  (3)I don't see why you need a bigint rather than just an int, given that you have fewer than 60M different values.

    (4) What is the fragmentation of the tables like?

    (5) What is the tables' DDL, the query text and the query plan (estimated is fine for now)?

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • Hey Jeff,

    "SELECT Name_04, Name_05, Name_06, Cnt = COUNT(*)"

    I executed the above query in the reference (security) table and the findings are as below.  In total 1271 rows in total.

    Cnt

    299

    300

    301

    302

    303

    304

    305

    306

    307

    308

    309

    311

    312

    313

    314

    315

    316

    317

    318

    319

    320

    321

    322

    323

    326

    328

    337

  • Hi Sergiy

    Thanks for your time. I think so. It a new learning for me as the Index process took more time than the non-index query 🙂

    The whole idea is to create a data set for each individual users + their access level. The output of these joins will bring an addition column as a username for each row. This username is the key value for each users to filter their entitlement to see the data.

    Thanks & Regards

    RR

     

     

  • surafeb7 wrote:

    The whole idea is to create a data set for each individual users + their access level. The output of these joins will bring an addition column as a username for each row. This username is the key value for each users to filter their entitlement to see the data.

    So, you better put those joins into a iTVF (aka "parameterized view) and invoke the function for each particular user/access privilege.

    If built correctly, such function will be lightning fast very effective.

    Building those "overall" huge recordsets is as terrible as common approach to this kind of tasks.

    _____________
    Code for TallyGenerator

  • surafeb7 wrote:

    Hey Jeff,

    "SELECT Name_04, Name_05, Name_06, Cnt = COUNT(*)"

    I executed the above query in the reference (security) table and the findings are as below.  In total 1271 rows in total.

    Cnt 299 300 301 302 303 304 305 306 307 308 309 311 312 313 314 315 316 317 318 319 320 321 322 323 326 328 337

    So, what is in that table that you need to query all 20 million rows instead of just the 1,271 rows?  Keep in mind that you've told us absolutely nothing about what is in the tables other than the 3 columns you're joining on and you've posting nothing of your query except the join criteria.  You've not even explained what exactly it is that you're looking for other than the very nebulous description of "reference (security) table" and equally nebulous requirement of "The whole idea is to create a data set for each individual users + their access level."

    The code I posted is step 1 and that is to identify the individual users.  But, like I said, everything else is a total guess because of how little you've provided.  You didn't even provide how long the query I provided took to run. :p

     

     

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • My apologies, COVID delayed my progress.

    Start working slowly.

     

    Regards

    RR

  • surafeb7 wrote:

    My apologies, COVID delayed my progress.

    Start working slowly.

    Regards

    RR

    You came down with COVID?  Sorry to hear and glad you're still around.  I know I sound like a Grandma but take care of yourself first and get some good sleep.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply