How to get recursive CTE at least 3000 times faster?

  • This heavily recursive CTE is the engine room of my application. It may have to work on 4 columns of up to 7 million rows. Queries currently can take over 30 minutes with just 1.2 million rows on my home laptop development computer as well as on my shared server web host.

    To reduce the waiting time of most users, I save minimalist 2 column tables that capture the results of long-running CTE recursion queries into a separate "cache" database. Thus 1000s of these two column tables, some containing nearly 7 million rows collect in this database. Clearly, the caching is a major pain, especially since they must often be updated when new individuals are added to the records. Ideally, the CTE would just run a whole lot faster.

    Is there a better way to do this? This is my first CTE. What am I not getting?

    Recursive CTE:

    USE [relationship]

    GO

    /****** Object: StoredProcedure [home].[proportion] Script Date: 2/24/2016 9:51:22 PM ******/

    SET ANSI_NULLS ON

    GO

    SET QUOTED_IDENTIFIER ON

    GO

    -- =============================================

    -- Author: Me

    -- Create date: today

    -- Description: cte with normalized tbls

    -- =============================================

    ALTER PROCEDURE [home].[proportion]

    -- Add the parameters for the stored procedure here

    @id int

    AS

    BEGIN

    -- SET NOCOUNT ON added to prevent extra result sets from

    -- interfering with SELECT statements.

    SET NOCOUNT ON;

    -- Insert statements for procedure here

    DELETE FROM relationship.home.persistedTemp WHERE originator = @id;

    WITH PctCTE(id, percent, kind)

    AS

    (SELECT individual, CONVERT(DECIMAL(28,25),100.0) AS percent, model

    FROM relationship.home.relationshipRecords constituent

    WHERE constituent.individual = @id

    UNION ALL

    SELECT derivation.individual, CONVERT(DECIMAL(28,25),constituent.percent/2.0), derivation.model

    FROM relationship.home.relationshipRecords AS derivation

    INNER JOIN PctCTE AS constituent

    ON (constituent.kind = 'M' AND

    (derivation.inputKindB = constituent.id))

    OR

    (NOT constituent.kind = 'M' AND

    (derivation.inputKindA = constituent.id))

    ),

    mergeCTE(i, p)

    AS

    (SELECT id, SUM(percent)

    FROM PctCTE

    GROUP BY id

    )

    INSERT INTO relationship.home.persistedTemp

    SELECT @id, tmp.i, tmp.p, w.yr

    FROM mergeCTE AS tmp

    LEFT JOIN relationship.home.beginning w

    ON tmp.i = w.individual;

    DELETE FROM relationship.home.persistedTemp WHERE originator = @id AND i = @id

    END

    Note: inputKindA may be used to create many individuals (1000s), inputKindB may be used to create less individuals (up to 20), and both often create no individuals.

  • samuel.johnson (3/1/2016)


    ...since they must often be updated when new individuals are added to the records.

    How often is "often"? I ask because I have a plan.

    Also, is this for some form of MLM?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I am projecting often to be at least 3000/year and up to at least 30,000/year. I have yet to design this update mechanism. I have trouble thinking in terms of sets, and am frankly a bit daunted by trying to construct another CTE that basically traces in the opposite direction from child to parent. So I have put it off for quite a while now and worked on other things.

    It is not for an MLM (had to look that one up), but would rather discuss that angle via private messaging.

  • samuel.johnson (3/1/2016)


    I am projecting often to be at least 3000/year and up to at least 30,000/year. I have yet to design this update mechanism. I have trouble thinking in terms of sets, and am frankly a bit daunted by trying to construct another CTE that basically traces in the opposite direction from child to parent. So I have put it off for quite a while now and worked on other things.

    It is not for an MLM (had to look that one up), but would rather discuss that angle via private messaging, but it says that your allotment is full and won't let me message you.

    You could send me a private email. PM's are a pain.

    When you say "at least 3000/year", I'm not looking for the number of people you might modify, add, or delete (MADs)... I'm looking for how often you would make such a change. Once an hour... once a day? For example, how often are you currently updating the cache you previously identified?

    The methods I'm thinking of will update what is known as "Nested Sets" at the rate of at least a million rows in about 54 seconds. Once update, you won't need to see hide nor hair of another recursive CTE no matter which direction you search in until you need to do another set of MADs.

    Resist the temptation to listen to someone that suggests the use of the HierarchyID datatype. If something goes haywire there, it's nearly impossible for a human to set it straight again and has a reputation (I've never used it so can't say for sure) for having issues with performance.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • You're right, I was unknowingly referring to MADs. I would hope to update daily. Nested Sets sound right on. In reference to my question at http://www.sqlservercentral.com/Forums/Topic1765709-373-1.aspx, how would using nested sets implicate the best location to store "cache" data (or is that no longer relevant)?

  • samuel.johnson (3/1/2016)


    You're right, I was unknowingly referring to MADs. I would hope to update daily. Nested Sets sound right on. In reference to my question at http://www.sqlservercentral.com/Forums/Topic1765709-373-1.aspx, how would using nested sets implicate the best location to store "cache" data (or is that no longer relevant)?

    You would no longer need to cache data.

    You and I talked a bit through email. This isn't a DAG (Directed Acyclic Graph) problem (like an org chart), which has one and only 1 parent per node. This is an ancestry problem where each node has two parents. I've not had to deal with such a thing before and I'm fairly sure that Nest Sets isn't going to work for such a thing.

    I've got a little studying to do on the problem to see if we can make it as fast as nested sets. In the meantime, we can work on trying to increase the performance of your rCTE with a little "Divide'n'Conquer" along with the correct indexes. Don't know if we can hit the 3000x mark using an rCTE but that's a good target.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • For others interested in this post, Samuel had some additional information on a near identical post. I'm posting it below to try to "keep it all in one place".


    samuel.johnson (3/1/2016)


    Performance is my over-riding concern, as I have a heavily recursive CTE that is the engine room of my application. This CTE may have to work on 4 columns of up to 7 million rows. Queries can take over 30 minutes with just 1.2 million rows.

    To reduce the waiting time of most users, I save minimalist 2 column tables that capture the results of these long-running CTE recursion queries into a separate "cache" database. Thus 1000s of these two column tables, some that may contain nearly 7 million rows collect in this database. I have to maintain concurrency with the base data in the CTE specific database as new records are added. At the moment I keep and upkeep the same data in both databases.

    Is there a better way to do all of this? Should I just keep all the data in one database? If so, how will my long-running CTEs interact with lots of interruptions, and won't all users then have to wait more, as I would suspect (based on what I know about queueing theory)? Clearly, the caching is a major pain, especially since the content must often be updated when new individuals are added to the records. Ideally, the CTE would just run a whole lot faster.

    Recursive CTE:

    USE [relationship]

    GO

    /****** Object: StoredProcedure [home].[proportion] Script Date: 2/24/2016 9:51:22 PM ******/

    SET ANSI_NULLS ON

    GO

    SET QUOTED_IDENTIFIER ON

    GO

    -- =============================================

    -- Author: Me

    -- Create date: today

    -- Description: cte with normalized tbls

    -- =============================================

    ALTER PROCEDURE [home].[proportion]

    -- Add the parameters for the stored procedure here

    @id int

    AS

    BEGIN

    -- SET NOCOUNT ON added to prevent extra result sets from

    -- interfering with SELECT statements.

    SET NOCOUNT ON;

    -- Insert statements for procedure here

    DELETE FROM relationship.home.persistedTemp WHERE originator = @id;

    WITH PctCTE(id, percent, kind)

    AS

    (SELECT individual, CONVERT(DECIMAL(28,25),100.0) AS percent, model

    FROM relationship.home.relationshipRecords constituent

    WHERE constituent.individual = @id

    UNION ALL

    SELECT derivation.individual, CONVERT(DECIMAL(28,25),constituent.percent/2.0), derivation.model

    FROM relationship.home.relationshipRecords AS derivation

    INNER JOIN PctCTE AS constituent

    ON (constituent.kind = 'M' AND

    (derivation.inputKindB = constituent.id))

    OR

    (NOT constituent.kind = 'M' AND

    (derivation.inputKindA = constituent.id))

    ),

    mergeCTE(i, p)

    AS

    (SELECT id, SUM(percent)

    FROM PctCTE

    GROUP BY id

    )

    INSERT INTO relationship.home.persistedTemp

    SELECT @id, tmp.i, tmp.p, w.yr

    FROM mergeCTE AS tmp

    LEFT JOIN relationship.home.beginning w

    ON tmp.i = w.individual;

    DELETE FROM relationship.home.persistedTemp WHERE originator = @id AND i = @id

    END

    Note: inputKindA may be used to create many individuals (1000s), inputKindB may be used to create less individuals (up to 20), and both often create no individuals.

    The database associated with the CTE is called relationship and contains the following tables:

    myids

    record_id char(6) PK FK

    reg nchar(10) PK

    name nvarchar(60)

    individual int FK U1

    main int

    FK_myids_names

    IX_myids_1 individual(ASC) Unique

    PK_myids record_id (ASC), reg (ASC) Unique Clustered

    names

    individual int PK {I}

    name nvarchar(60)

    PK_names individual (ASC) Unique Clustered

    relationshipRecords

    individual int FK

    inputKindA int {I}

    inputKindB int I2

    model char(1) I3

    FK_relationshipRecords_names

    IX_relationshipRecords_B inputKindB (ASC)

    IX_relationshipRecords_mdl model (ASC)

    IX_relationshipRecords_id individual (ASC) Unique

    IX_relationshipRecords_A inputKindA (ASC) Clustered

    beginning

    individual int FK

    yr smallint {I}

    color nvarchar(50)

    FK_beginning_names

    IX_beginning_culr color (ASC)

    IX_beginning_id individual (ASC) Unique

    IX_beginning_yr yr (ASC) Clustered

    persistedTemp

    originator int

    i int

    p decimal(28,25)

    y smallint

    IX_persistedTemp originator (ASC) Clustered

    and record_log

    record_id char(6) PK

    version_id char(3)

    cntry_id char(2)

    record_nm nvarchar(255)

    notes nvarchar(max)

    updtd int

    base_yr smallint

    user_id nvarchar(255)

    dflt_id char(15)

    dflt_nm nvarchar(255)

    css_ref nvarchar(255)

    url_ref nvarchar(255)

    bt_link nvarchar(150)

    PK_record_log record_id (ASC) Unique Clustered

    Finally, the database that stores the cached long-query results has the following tables:

    hinfo, which holds redundant data from the names, relationshipRecords, and beginning tables above:

    individual int PK U1

    reg nvarchar(255)

    name nvarchar(60) U1

    kinds nvarchar(121)

    year smallint {I}

    color nvarchar(50)

    model char(1)

    PK_hinfo individual (ASC) Unique

    IX_hinfo year (ASC) Clustered

    IX_nameorder individual (ASC), name (ASC) Unique

    srch_log

    individual int PK FK

    last_id int

    last_dt datetime2(7)

    hit_cnt int

    lapse int

    saved bit

    PK_srch_log individual (ASC) Unique Clustered

    and mostly this database stores many of the following type of tables:

    p000001

    i int PK

    p decimal(28,25) I1

    PK_p000001 i (ASC) Unique

    IX_p000001 p (ASC) Clustered

    I am at the beginning stages of a revision of my web application and the above structure is likely to change a bit in light of what I learned the first time around.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply