Hierarchies on Steroids #2: A Replacement for Nested Sets Calculations

Question

Hierarchies on Steroids #2: A Replacement for Nested Sets Calculations

Viewing 6 posts - 46 through 50 (of 50 total)

You must be logged in to reply to this topic. Login to reply

Jeff Moden SSC Guru Points: 1004149 More actions · Answer 1

faredoon (12/6/2016)
Hi Jeff. Excellent article and techniques.
One question: What would be the best way to use this when we expect CRUD operations on the Employee list. Take a typical app that calls, say, a stored proc that adds an employee to the list - should we re-generate everything on every little chatty call like Add/Delete or is there something more efficient.
I am looking for the technique that would best suit this use case.

Thanks for the feedback. I appreciate it.

To answer your question, which I'm sure many others have also thought about, we need to know 3 things:

1. How many rows are in the table.

2. How often will moves, adds, and deletes occur.

3. How long a duration for an "outage" of the data will the users easily tolerate.

With those 3 things in mind, let's look back at the typical run times that occurred on my little i5 laptop for the code in either of the two articles (2nd article is only slightly slower, too little to mince about).

Duration for 1,000 Nodes = 00:00:00.053

Duration for 10,000 Nodes = 00:00:00.323

Duration for 100,000 Nodes = 00:00:03.867

Duration for 1,000,000 Nodes = 00:00:54.283

[font="Arial Black"]Non-Hybrid Methods[/font]

For most applications, a 5 second delay is the absolute maximum that users seem to tolerate. Of course, they prefer "instant" returns. If you have up to about 30,000 nodes in the Adjacency List table, a full rebuild takes less than a second and you might be able to keep it simple by doing the rebuild after every change (I'd likely use the "Hybrid Methods" further below, though). Of course, if you have dozen's or hundreds of individual changes being made by many people every minute, you might want to do the rebuild something like once a minute or once every 5 minutes.

If you have more than 30,000 nodes in Adjacency List table, then the runs will start taking longer than a second and no longer be perceived as "instantaneous". Synchronous runs may no longer be the way to go and you might want to make sure that even for a million row run of nearly a minute in duration, that the current data is always available except for a couple of milliseconds. To do that, you might want to "flop" two tables using a synonym where one final hierarchy table is online while the other is being rebuilt. Once the rebuild is complete, simply drop and rebuild the synonym to point at the newly rebuilt table. The other table will now become available for the next rebuild.

Of course, once you get over the 100,000 node mark, you might want to advise the customers that their changes are being processed and the updated hierarchy will be available in a minute or so.

[font="Arial Black"]Hybrid Methods[/font]

The Non-Hybrid Methods only consider the ultimate goal of displaying the Hierarchy using the Nested Sets. One of the reasons why I recommend keeping both the Adjacency List and the Nested Sets is because they both have advantages that the other doesn't have.

With that in mind, you could create a bit of a hybrid system. Users are making changes to the Adjacency List because it's very quick (virtually "instantaneous") and relatively easy to do so. The Nested Sets are used for very high speed reporting but take comparatively long to rebuild (even traditional methods would need to update all nodes below and to the "right" of a node that has been added, deleted, or moved.

A hybrid system could easily be made where a traditional rCTE (Recursive CTE) could be used to nearly instantly provide the feedback of "Your change has been accepted and your downline tree now looks like the following. Your changes will be available for reporting purposes before HH:MM." Substitute the scheduled time + a minute for the next rebuild run and change the word "downline" to "department", "parts manifest", or whatever is appropriate.

The time estimate will almost always be predictable because the number of changes have no bearing on the time it takes to rebuild the Nested Sets hierarchy. Only the number of nodes in the Adjacency List has any bearing on the duration. If you kept track of how long the rebuilds take, you could take the maximum duration for the day and add that to the next scheduled run time and likely not disappoint any customers.

Of course, the two-table-synonym-flop would still come into play on larger hierarchies. Even for small node counts, I'd still do that so I wouldn't have to do a panic change when the number of nodes increases and reaches the point of being slow as perceived by the other customers waiting to do some reporting or changes.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

faredoon SSC Veteran Points: 222 More actions · Answer 2

Jeff Moden (12/7/2016)
faredoon (12/6/2016)
Hi Jeff. Excellent article and techniques.
One question: What would be the best way to use this when we expect CRUD operations on the Employee list. Take a typical app that calls, say, a stored proc that adds an employee to the list - should we re-generate everything on every little chatty call like Add/Delete or is there something more efficient.
I am looking for the technique that would best suit this use case.
Thanks for the feedback. I appreciate it.
To answer your question, which I'm sure many others have also thought about, we need to know 3 things:
1. How many rows are in the table.
2. How often will moves, adds, and deletes occur.
3. How long a duration for an "outage" of the data will the users easily tolerate.
With those 3 things in mind, let's look back at the typical run times that occurred on my little i5 laptop for the code in either of the two articles (2nd article is only slightly slower, too little to mince about).
Duration for 1,000 Nodes = 00:00:00.053
Duration for 10,000 Nodes = 00:00:00.323
Duration for 100,000 Nodes = 00:00:03.867
Duration for 1,000,000 Nodes = 00:00:54.283
[font="Arial Black"]Non-Hybrid Methods[/font]
For most applications, a 5 second delay is the absolute maximum that users seem to tolerate. Of course, they prefer "instant" returns. If you have up to about 30,000 nodes in the Adjacency List table, a full rebuild takes less than a second and you might be able to keep it simple by doing the rebuild after every change (I'd likely use the "Hybrid Methods" further below, though). Of course, if you have dozen's or hundreds of individual changes being made by many people every minute, you might want to do the rebuild something like once a minute or once every 5 minutes.
If you have more than 30,000 nodes in Adjacency List table, then the runs will start taking longer than a second and no longer be perceived as "instantaneous". Synchronous runs may no longer be the way to go and you might want to make sure that even for a million row run of nearly a minute in duration, that the current data is always available except for a couple of milliseconds. To do that, you might want to "flop" two tables using a synonym where one final hierarchy table is online while the other is being rebuilt. Once the rebuild is complete, simply drop and rebuild the synonym to point at the newly rebuilt table. The other table will now become available for the next rebuild.
Of course, once you get over the 100,000 node mark, you might want to advise the customers that their changes are being processed and the updated hierarchy will be available in a minute or so.
[font="Arial Black"]Hybrid Methods[/font]
The Non-Hybrid Methods only consider the ultimate goal of displaying the Hierarchy using the Nested Sets. One of the reasons why I recommend keeping both the Adjacency List and the Nested Sets is because they both have advantages that the other doesn't have.
With that in mind, you could create a bit of a hybrid system. Users are making changes to the Adjacency List because it's very quick (virtually "instantaneous") and relatively easy to do so. The Nested Sets are used for very high speed reporting but take comparatively long to rebuild (even traditional methods would need to update all nodes below and to the "right" of a node that has been added, deleted, or moved.
A hybrid system could easily be made where a traditional rCTE (Recursive CTE) could be used to nearly instantly provide the feedback of "Your change has been accepted and your downline tree now looks like the following. Your changes will be available for reporting purposes before HH:MM." Substitute the scheduled time + a minute for the next rebuild run and change the word "downline" to "department", "parts manifest", or whatever is appropriate.
The time estimate will almost always be predictable because the number of changes have no bearing on the time it takes to rebuild the Nested Sets hierarchy. Only the number of nodes in the Adjacency List has any bearing on the duration. If you kept track of how long the rebuilds take, you could take the maximum duration for the day and add that to the next scheduled run time and likely not disappoint any customers.
Of course, the two-table-synonym-flop would still come into play on larger hierarchies. Even for small node counts, I'd still do that so I wouldn't have to do a panic change when the number of nodes increases and reaches the point of being slow as perceived by the other customers waiting to do some reporting or changes.

Thanks very much for the reply. Much appreciated.

I was thinking more along the lines of a "live" app that constantly displays the hierarchical tree nodes and allowing live edits. It's a situation where the reporting is live rather than static and the feedback is more-or-less instantaneous. I will play around with the hybrid approach and see if it works in this case.

Jeff Moden SSC Guru Points: 1004149 More actions · Answer 3

faredoon (12/7/2016)
I was thinking more along the lines of a "live" app that constantly displays the hierarchical tree nodes and allowing live edits. It's a situation where the reporting is live rather than static and the feedback is more-or-less instantaneous. I will play around with the hybrid approach and see if it works in this case.

Let us know how it works out either way. BTW... how many nodes in the hierarchy and what are you expecting for an hourly modification rate?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

faredoon SSC Veteran Points: 222 More actions · Answer 4

Jeff Moden (12/7/2016)
faredoon (12/7/2016)
I was thinking more along the lines of a "live" app that constantly displays the hierarchical tree nodes and allowing live edits. It's a situation where the reporting is live rather than static and the feedback is more-or-less instantaneous. I will play around with the hybrid approach and see if it works in this case.
Let us know how it works out either way. BTW... how many nodes in the hierarchy and what are you expecting for an hourly modification rate?

Thanks Jeff. Currently, we do not know the stats. Since it's an employee list, we expect a Todo list style app where a user can play around quite a lot. So, it may not involve heavy hourly changes, but it may see a lot of tree operations (i.e. an entire sub-tree in the hierarchy moved somewhere else in the tree, or new Sales figures updated). Currently the node count is very low (< 10K). But, going forward, we would like to know how to scale it).

Jeff Moden SSC Guru Points: 1004149 More actions · Answer 5

faredoon (12/8/2016)
Jeff Moden (12/7/2016)
faredoon (12/7/2016)
I was thinking more along the lines of a "live" app that constantly displays the hierarchical tree nodes and allowing live edits. It's a situation where the reporting is live rather than static and the feedback is more-or-less instantaneous. I will play around with the hybrid approach and see if it works in this case.
Let us know how it works out either way. BTW... how many nodes in the hierarchy and what are you expecting for an hourly modification rate?
Thanks Jeff. Currently, we do not know the stats. Since it's an employee list, we expect a Todo list style app where a user can play around quite a lot. So, it may not involve heavy hourly changes, but it may see a lot of tree operations (i.e. an entire sub-tree in the hierarchy moved somewhere else in the tree, or new Sales figures updated). Currently the node count is very low (< 10K). But, going forward, we would like to know how to scale it).

I don't believe you'll have a problem with scale. If you use the "hybrid" method and the synonym "flop" that we previously talked about, even a million rows shouldn't be a problem unless someone goes crazy on using the rCTE method on the root node. 10K rows are still in the 300ms range.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

faredoon SSC Veteran Points: 222 More actions · Answer 6

Jeff Moden (12/8/2016)
faredoon (12/8/2016)
Jeff Moden (12/7/2016)
faredoon (12/7/2016)
I was thinking more along the lines of a "live" app that constantly displays the hierarchical tree nodes and allowing live edits. It's a situation where the reporting is live rather than static and the feedback is more-or-less instantaneous. I will play around with the hybrid approach and see if it works in this case.
Let us know how it works out either way. BTW... how many nodes in the hierarchy and what are you expecting for an hourly modification rate?
Thanks Jeff. Currently, we do not know the stats. Since it's an employee list, we expect a Todo list style app where a user can play around quite a lot. So, it may not involve heavy hourly changes, but it may see a lot of tree operations (i.e. an entire sub-tree in the hierarchy moved somewhere else in the tree, or new Sales figures updated). Currently the node count is very low (< 10K). But, going forward, we would like to know how to scale it).
I don't believe you'll have a problem with scale. If you use the "hybrid" method and the synonym "flop" that we previously talked about, even a million rows shouldn't be a problem unless someone goes crazy on using the rCTE method on the root node. 10K rows are still in the 300ms range.

That's correct Jeff. Thanks again. I will try both Hybrid and non-hybrid and see how it goes.