Hi Jeff. Excellent article and techniques.
One question: What would be the best way to use this when we expect CRUD operations on the Employee list. Take a typical app that calls, say, a stored proc that adds an employee to the list - should we re-generate everything on every little chatty call like Add/Delete or is there something more efficient.
I am looking for the technique that would best suit this use case.
Thanks for the feedback. I appreciate it.
To answer your question, which I'm sure many others have also thought about, we need to know 3 things:
1. How many rows are in the table.
2. How often will moves, adds, and deletes occur.
3. How long a duration for an "outage" of the data will the users easily tolerate.
With those 3 things in mind, let's look back at the typical run times that occurred on my little i5 laptop for the code in either of the two articles (2nd article is only slightly slower, too little to mince about).
Duration for 1,000 Nodes = 00:00:00.053
Duration for 10,000 Nodes = 00:00:00.323
Duration for 100,000 Nodes = 00:00:03.867
Duration for 1,000,000 Nodes = 00:00:54.283
[font="Arial Black"]Non-Hybrid Methods[/font]
For most applications, a 5 second delay is the absolute maximum that users seem to tolerate. Of course, they prefer "instant" returns. If you have up to about 30,000 nodes in the Adjacency List table, a full rebuild takes less than a second and you might be able to keep it simple by doing the rebuild after every change (I'd likely use the "Hybrid Methods" further below, though). Of course, if you have dozen's or hundreds of individual changes being made by many people every minute, you might want to do the rebuild something like once a minute or once every 5 minutes.
If you have more than 30,000 nodes in Adjacency List table, then the runs will start taking longer than a second and no longer be perceived as "instantaneous". Synchronous runs may no longer be the way to go and you might want to make sure that even for a million row run of nearly a minute in duration, that the current data is always available except for a couple of milliseconds. To do that, you might want to "flop" two tables using a synonym where one final hierarchy table is online while the other is being rebuilt. Once the rebuild is complete, simply drop and rebuild the synonym to point at the newly rebuilt table. The other table will now become available for the next rebuild.
Of course, once you get over the 100,000 node mark, you might want to advise the customers that their changes are being processed and the updated hierarchy will be available in a minute or so.
[font="Arial Black"]Hybrid Methods[/font]
The Non-Hybrid Methods only consider the ultimate goal of displaying the Hierarchy using the Nested Sets. One of the reasons why I recommend keeping both the Adjacency List and the Nested Sets is because they both have advantages that the other doesn't have.
With that in mind, you could create a bit of a hybrid system. Users are making changes to the Adjacency List because it's very quick (virtually "instantaneous") and relatively easy to do so. The Nested Sets are used for very high speed reporting but take comparatively long to rebuild (even traditional methods would need to update all nodes below and to the "right" of a node that has been added, deleted, or moved.
A hybrid system could easily be made where a traditional rCTE (Recursive CTE) could be used to nearly instantly provide the feedback of "Your change has been accepted and your downline tree now looks like the following. Your changes will be available for reporting purposes before HH:MM." Substitute the scheduled time + a minute for the next rebuild run and change the word "downline" to "department", "parts manifest", or whatever is appropriate.
The time estimate will almost always be predictable because the number of changes have no bearing on the time it takes to rebuild the Nested Sets hierarchy. Only the number of nodes in the Adjacency List has any bearing on the duration. If you kept track of how long the rebuilds take, you could take the maximum duration for the day and add that to the next scheduled run time and likely not disappoint any customers.
Of course, the two-table-synonym-flop would still come into play on larger hierarchies. Even for small node counts, I'd still do that so I wouldn't have to do a panic change when the number of nodes increases and reaches the point of being slow as perceived by the other customers waiting to do some reporting or changes.