SQLServerCentral Article

Horizontal Trees

,

Introduction

The sp_HorizontalTree procedure introduced here may be used to audit tables with dependent columns. A column is dependent on another if the meaning of the first column depends on the meaning of the second. For example, in a table containing accounting information, the meaning of a Fiscal Quarter column depends on the Fiscal Year to which it belongs.

When manually building tables with dependent columns, you may wish to audit the values that one column has for a given value of another. For example, does every Fiscal Year in your table have four Fiscal Quarters? Do they have the same number of dates? Which Accounting Periods span multiple Fiscal Quarters? Which ones have the largest number of dates? What are those dates?

This proc provides a table summarizing the dependencies between selected columns on any table, so questions like the above can be easily answered (often by just looking at the summary). This summary table makes no reference to the column names of the original table, so any SQL written for it won’t use them either. Because of that, SQL written for the summary of one table can be applied to the summary of any other table. It will know about column values, however, so any SQL that references them would only be meaningful when comparing tables whose selected columns have similar content.

The program works by building a tree with the observed values of the first column sitting below the top node as children, and below each of them are the observed values of the second column, and below each of them are the observed values of the third column, and so on. At each depth of the tree, the observed values may repeat but below any parent, they are always distinct.

This solution is reminiscent of the one developed in a previous article [1] where a table with a parent relationship between records is viewed as a tree whose node attributes numChildren, numDescendants, numLeaves, Height, Depth allows you to easily build sophisticated queries on the table. For such tables, one migrates from node to node by scrolling up and down the table (vertical trees). Here, however, a tree is dynamically built from the table itself using its dependent columns across the top (horizontal trees).

The following tree represents the four columns FY, FQ, AP and DT of a simple accounting table with dummy data (the proc doesn’t care much about data types):

Here, three values are observed for FY (I, J, K), four values for FQ (A, B, C, D), six values for AP (U, V, W, X, Y, Z) and five values for DT (1,2,3,4,5). Below each parent the names of its children (ie. the values of their common column) are always distinct, by way of how they’re defined, but they may overlap with the values of another parent at the same depth (or even different depths).

The tree’s underlying table has three columns: NodeId (primary key of the table), Node (name of the node) and ParentId (foreign key to a record’s parent). Its primary key values are denoted in red in the above diagram.

What does this tree mean?

One way to explain it is the following: the top node represents the set of all records in the data set. The nodes directly below it represent the various (necessarily disjoint and non-empty) subsets of those records with a specific value on the first column. So, node 2 is the set of records whose FY value is I, node 3 is the set of records whose FY value is J and so on. The nodes below them are defined in the same way. So, node 5 is the set of records of node 2 whose FQ value is A, etc. Note that the children of any node always partition that node’s set of records, and for a given depth they partition the entire data set.

Another way of viewing the tree is to recognize that each path from the top node to one of its leaves represents all the records whose column values agree with the node names (or values) on that path.

For example, the left-most path represents all the records in the data set whose column values for FY, FQ, AP, DT are I, A, U, 1 (respectively). If the column of the leaf nodes (DT) is a primary key of the table from which it was defined, every path represents a unique record (but it doesn’t have to be that way in general). The arrows represent the subset relation of a set of records, but they can also be viewed in a different way.

For example, there are two children of node 5 which means that for any record in the data set whose first two columns are I and A, the third column must be U or V. Furthermore, its child U has three children while V has only one. This may be of interest to you when examining how the value of one column (with its ancestors) affects the possible values of the next one. In other words, the tree describes the “shape” of the data.

The sp_HorizontalTree proc builds the tree’s underlying table before passing it to the sp_VerifyTree proc [1], which verifies that it’s a tree and computes its node attributes numChildren, numDescendants, numLeaves, Height, Depth. These attributes may be employed to study the shape of the data using simple SQL (which would be harder to write without these pre-computed values).

For example, the following query lists the Accounting Periods spanning multiple Fiscal Quarters for any accounting table:

SELECT Node FROM ##HorizTree WHERE Depth = 3
GROUP BY NODE
HAVING COUNT(*) > 1

The ##HorizTree table is the proc’s global temporary table defining the tree. Its name has a random suffix generated by the proc at run-time to avoid clashing with other sessions (but is not generally shown here for simplicity).

The result for the accounting table in the upcoming sample audit is:

2019-04
2019-07
2019-10
2020-04
2020-07
2020-10
2021-04
2021-07
2021-10

In fact, this query can be used on the horizontal tree of any table since it knows nothing about column names. It simply lists the names of those nodes of depth = 3 that occur more than once. So, it can be applied to any accounting table, regardless of its column names. In this manner a library of auditing queries for all future tables may be built. For non-accounting tables this query may, or may not, be of interest.

Sample audit

The D_DATE dimension table that’s commonly found in a database warehouse (but with more columns) will now be audited. A script to generate this table can be found in the resource section.

First, run the proc to generate its tree. You’ll need to specify the server, database, schema, table and list of dependent columns ordered by dependency. That is, the second column is dependent on the first, the third is dependent on the second, and so on. As well, another parameter (@FirstNodes) lists the values of the first column that will filter the data set before processing begins. The proc doesn’t really care about column data types, but the first column must be character based. Extensive error checking ensures that the parameters and incoming data make sense.

The following invocation of the proc examines Fiscal_Year, Fiscal_Yr_Qtr, Accounting_Period, Date_Dim_Key of D_DATE:

DECLARE @ERROR_NUMBER   INT
DECLARE @ERROR_MESSAGE  VARCHAR(MAX)
DECLARE @ROW_COUNT      INT
EXEC @ERROR_NUMBER      =
[dbo].[sp_HorizontalTree]
@Server           = (Your Server)
,@Database        = (Your database)
,@Schema          = 'dbo'
,@Table           = 'D_DATE'
,@Columns         = 'Fiscal_Year, Fiscal_Yr_Qtr, Accounting_Period, Date_Dim_Key'
,@FirstNodes      = '2019/2020,2020/2021, 2021/2022'
,@DEBUG           = 0  -- Set to 1 to see global temporary tables
,@ERROR_NUMBER    = @ERROR_NUMBER OUTPUT
,@ERROR_MESSAGE   = @ERROR_MESSAGE OUTPUT
,@ROW_COUNT       = @ROW_COUNT OUTPUT
SELECT @ERROR_NUMBER AS ERROR_NUMBER, @ERROR_MESSAGE AS ERROR_MESSAGE, @ROW_COUNT AS ROW_COUNT -- Optional

The output is the following tree:

NodeId Node       ParentId NumChildren NumDescendants NumLeaves Height Depth
1      Top_Node   NULL     3           1159           1096      4      0
2      2019/2020  1        4           386            366       3      1
3      2020/2021  1        4           385            365       3      1
4      2021/2022  1        4           385            365       3      1
5      2020 Q2    3        4           96             92        2      2
6      2021 Q3    4        4           96             92        2      2
7      2020 Q3    3        4           96             92        2      2
8      2021 Q1    4        4           95             91        2      2
9      2019 Q2    2        4           96             92        2      2
10     2020 Q4    3        4           94             90        2      2
11     2019 Q4    2        4           95             91        2      2
12     2021 Q2    4        4           96             92        2      2
13     2021 Q4    4        4           94             90        2      2
14     2020 Q1    3        4           95             91        2      2
15     2019 Q1    2        4           95             91        2      2
16     2019 Q3    2        4           96             92        2      2
17     2019-01    15       32          32             32        1      3
18     2019-02    15       28          28             28        1      3
19     2019-03    15       28          28             28        1      3
20     2019-04    9        25          25             25        1      3
21     2019-04    15       3           3              3         1      3
22     2019-05    9        28          28             28        1      3
23     2019-06    9        28          28             28        1      3
24     2019-07    9        11          11             11        1      3
25     2019-07    16       17          17             17        1      3
26     2019-08    16       28          28             28        1      3
27     2019-09    16       28          28             28        1      3
28     2019-10    11       9           9              9         1      3
29     2019-10    16       19          19             19        1      3
30     2019-11    11       28          28             28        1      3
31     2019-12    11       28          28             28        1      3
32     2019-13    11       26          26             26        1      3
33     2020-01    14       30          30             30        1      3
34     2020-02    14       28          28             28        1      3
35     2020-03    14       28          28             28        1      3
36     2020-04    5        23          23             23        1      3
37     2020-04    14       5           5              5         1      3
38     2020-05    5        28          28             28        1      3
39     2020-06    5        28          28             28        1      3
40     2020-07    5        13          13             13        1      3
41     2020-07    7        15          15             15        1      3
42     2020-08    7        28          28             28        1      3
43     2020-09    7        28          28             28        1      3
44     2020-10    7        21          21             21        1      3
45     2020-10    10       7           7              7         1      3
46     2020-11    10       28          28             28        1      3
47     2020-12    10       28          28             28        1      3
48     2020-13    10       27          27             27        1      3
49     2021-01    8        29          29             29        1      3
50     2021-02    8        28          28             28        1      3
51     2021-03    8        28          28             28        1      3
52     2021-04    8        6           6              6         1      3
53     2021-04    12       22          22             22        1      3
54     2021-05    12       28          28             28        1      3
55     2021-06    12       28          28             28        1      3
56     2021-07    6        14          14             14        1      3
57     2021-07    12       14          14             14        1      3
58     2021-08    6        28          28             28        1      3
59     2021-09    6        28          28             28        1      3
60     2021-10    6        22          22             22        1      3
61     2021-10    13       6           6              6         1      3
62     2021-11    13       28          28             8         1      3
63     2021-12    13       28          28             28        1      3
64     2021-13    13       28          28             28        1      3
65     20190401   17       0           0              1         0      4
66     20190402   17       0           0              1         0      4
etc.
etc.
ERROR_NUMBER ERROR_MESSAGE ROW_COUNT
0                          1160

Note that the SELECT @ERROR_NUMBER query in the proc’s invocation is optional, but without it you will receive no error messages when running the proc. At the very least, check that @ERROR_NUMBER is 0 before proceeding further in your calling application. Errors in the parameter values, for example, will prevent any tree building but the SELECT @ERROR_NUMBER query will tell you why. ROW_COUNT is the number of nodes in the tree.

Now that you have your tree, you can begin to audit its underlying table for accuracy.

Assuming you don’t know anything about leap years, the first thing you might ask is: why does node '2019/2020' have 366 days? How did that happen in the model you’re auditing? Are duplicate dates present?

If you ran the proc with @DEBUG = 1, you may look at the node’s children: SELECT * FROM ##HorizTree WHERE ParentId = 2 and see that there are four Fiscal Quarters (one of which must contain the extra date):

NodeId Node      ParentId NumChildren NumDescendants NumLeaves Height Depth
9      2019 Q2   2        4           96             92        2      2
11     2019 Q4   2        4           95             91        2      2
15     2019 Q1   2        4           95             91        2      2
16     2019 Q3   2        4           96             92        2      2

Looking at their children: SELECT * FROM ##HorizTree WHERE ParentId IN (9, 11, 15, 16) you see that there are 16 Accounting Periods (one of which must contain the extra date):

NodeId Node     ParentId NumChildren NumDescendants NumLeaves Height Depth
17     2019-01  15       32          32             32        1      3
18     2019-02  15       28          28             28        1      3
19     2019-03  15       28          28             28        1      3
20     2019-04  9        25          25             25        1      3
21     2019-04  15       3           3              3         1      3
22     2019-05  9        28          28             28        1      3
23     2019-06  9        28          28             28        1      3
24     2019-07  9        11          11             11        1      3
25     2019-07  16       17          17             17        1      3
26     2019-08  16       28          28             28        1      3
27     2019-09  16       28          28             28        1      3
28     2019-10  11       9           9              9         1      3
29     2019-10  16       19          19             19        1      3
30     2019-11  11       28          28             28        1      3
31     2019-12  11       28          28             28        1      3
32     2019-13  11       26          26             26        1      3

Browsing through their children SELECT * FROM ##HorizTree WHERE ParentId IN(17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32) you find the date you never expected:

NodeId Node      ParentId NumChildren NumDescendants NumLeaves Height Depth
399    20200229  31       0           0              1         0      4

You’ve just discovered leap years. You’ll probably want to write some SQL to audit any future accounting tables to ensure that leap years are included for every Fiscal Year divisible by 4. Of course, your SQL will need to assume that the nodes directly below the top node represent Fiscal Year so it can perform the necessary computations (recall that the column names are unknown). And your SQL will be restricted to tables whose first column, whatever it’s called, looks like Fiscal Year.

The actual name of the global temporary table holding the tree (for a given session) may be found in the SELECT @ERROR_NUMBER query or the messages tab of SSMS whenever @DEBUG = 1 (eg. ##HorizTree51407). The tree’s name ends in a 5-character suffix (eg. 51407) which, like all the global temporary tables created by the proc, is randomly generated when the proc begins (to avoid clashing with other sessions). Therefore, you can run the proc simultaneously in another pane of SSMS or write SQL on the global temporary tables (including the source table). But note that when @DEBUG = 0 all such tables are dropped when the proc concludes. In any event, closing the pane from which the proc was run (in either mode) will drop all global temporary tables that the proc created (such as the temporary lookup tables for the columns in the original table).

Special feature of horizontal trees

Because horizontal trees are built from data sets, some interesting attributes may be added to the tree. For example, each path in the tree represents the set of records in the data set whose column values agree with the nodes on that path. An attribute related to this would be numRecords, which counts the number of records passing through each node. That way, one could compute the probability of realizing the possible values of any parent’s children. In other words, determine which children are more favored to appear in the data set (given their genealogy). The computation for this would be nearly identical to that which defines numDescendants (simply compute the volume of records for each leaf and work upwards).

How does the proc work?

First, it copies the table @Table into a global temporary table, ##Table, upon which filtering occurs. It then builds lookup global temporary tables for the observed values of each column. After that it builds a global temporary table, ##TempStage, containing just the columns appearing in the @Columns parameter, along with lookup columns for each of them that reference the column lookup tables. From here it builds the tree in the table, ##HorizTree, one column at a time.

Set @DEBUG = 1 to view these tables as the tree is being built, where these views include a new column (Column 0, Column1, etc.) displaying the name of the lookup table.

For the above example we have these results. The lookup table for top node is:

Column 0          NodeId        Node
##TopNode         1             TOP

The lookup table for first column:

Column 1          NodeId        Node
##Fiscal_Year     2             2019/2020
##Fiscal_Year     3             2020/2021
##Fiscal_Year     4             2021/2022

At this point the first four nodes of the tree may be created using the original NodeId values, where the top node is the parent:

NodeId  Node          ParentId
1       Top_Node      NULL
2       2019/2020     1
3       2020/2021     1
4       2021/2022     1

But the lookup tables for the remaining columns have NodeId starting at 1, which will clash with the nodes already inserted. The script has no idea what these values should be until the preceding insertions have occurred. That’s where ##TempTable comes in. Assigning the start value for its NodeId as one more than the current maximum NodeId in the tree, the current column is inserted into a fresh copy of ##TempTable without its NodeId column. Those NodeId values are automatically computed by the IDENTITY specification of ##TempTable before being inserted into the tree. After the insertion, the lookup values for the current column are updated in ##TempStage using the tree’s newly created key values, so the foreign key of the next column will be correct when it’s processed. The key point here is that the foreign key for any column is the primary key of the column preceding it.

Lookup table for second column:

Column 2          NodeId        Node
##Fiscal_Yr_Qtr   1             2021 Q2
##Fiscal_Yr_Qtr   2             2020 Q2
##Fiscal_Yr_Qtr   3             2021 Q3
##Fiscal_Yr_Qtr   4             2021 Q4
##Fiscal_Yr_Qtr   5             2021 Q1
##Fiscal_Yr_Qtr   6             2019 Q3
##Fiscal_Yr_Qtr   7             2019 Q4
##Fiscal_Yr_Qtr   8             2019 Q2
##Fiscal_Yr_Qtr   9             2019 Q1
##Fiscal_Yr_Qtr   10            2020 Q3
##Fiscal_Yr_Qtr   11            2020 Q1
##Fiscal_Yr_Qtr   12            2020 Q4

Lookup table for third column:

Column 3             NodeId     Node
##Accounting_Period  1          2019-01
##Accounting_Period  2          2019-02
##Accounting_Period  3          2019-03
##Accounting_Period  4          2019-04
##Accounting_Period  5          2019-05
##Accounting_Period  6          2019-06
##Accounting_Period  7          2019-07
##Accounting_Period  8          2019-08
##Accounting_Period  9          2019-09
##Accounting_Period  10         2019-10
##Accounting_Period  11         2019-11
##Accounting_Period  12         2019-12
##Accounting_Period  13         2019-13
##Accounting_Period  14         2020-01
##Accounting_Period  15         2020-02
etc.
etc

Lookup table for fourth column:

Column 4              NodeId    Node
##Date_Dim_Key        1         20190401
##Date_Dim_Key        2         20190402
##Date_Dim_Key        3         20190403
##Date_Dim_Key        4         20190404
##Date_Dim_Key        5         20190405
##Date_Dim_Key        6         20190406
##Date_Dim_Key        7         20190407
##Date_Dim_Key        8         20190408
##Date_Dim_Key        9         20190409
##Date_Dim_Key        10        20190410
##Date_Dim_Key        11        20190411
##Date_Dim_Key        12        20190412
etc.
etc.

##TempStage before initial lookup values are computed:

I Top_Node  Top_Node_Id Fiscal_Year Fiscal_Year_Id Fiscal_Yr_Qtr Fiscal_Yr_Qtr_Id  etc.
1 NULL      NULL        2019/2020   NULL           2019 Q1       NULL 
2 NULL      NULL        2019/2020   NULL           2019 Q1       NULL 
3 NULL      NULL        2019/2020   NULL           2019 Q1       NULL 
4 NULL      NULL        2019/2020   NULL           2019 Q1       NULL 
5 NULL      NULL        2019/2020   NULL           2019 Q1       NULL 
6 NULL      NULL        2019/2020   NULL           2019 Q1       NULL 
7 NULL      NULL        2019/2020   NULL           2019 Q1       NULL
8 NULL      NULL        2019/2020   NULL           2019 Q1       NULL
9 NULL      NULL        2019/2020   NULL           2019 Q1       NULL
10 NULL     NULL        2019/2020   NULL           2019 Q1       NULL 
etc.
etc.

##TempStage after initial lookup values are computed:

I Top_Node  Top_Node_Id Fiscal_Year Fiscal_Year_Id Fiscal_Yr_Qtr Fiscal_Yr_Qtr_Id etc.
1 Top_Node  1           2019/2020   2              2019 Q1       9
2 Top_Node  1           2019/2020   2              2019 Q1       9
3 Top_Node  1           2019/2020   2              2019 Q1       9 
4 Top_Node  1           2019/2020   2              2019 Q1       9 
5 Top_Node  1           2019/2020   2              2019 Q1       9
6 Top_Node  1           2019/2020   2              2019 Q1       9 
7 Top_Node  1           2019/2020   2              2019 Q1       9 
8 Top_Node  1           2019/2020   2              2019 Q1       9 
9 Top_Node  1           2019/2020   2              2019 Q1       9 
10 Top_Node 1           2019/2020   2              2019 Q1       9
etc.
etc.

##TempStage after final lookup values are computed:

I  Top_Node Top_Node_Id Fiscal_Year Fiscal_Year_Id Fiscal_Yr_Qtr Fiscal_Yr_Qtr_Id etc.
1  Top_Node 1           2019/2020   2              2019 Q1       15
2  Top_Node 1           2019/2020   2              2019 Q1       15
3  Top_Node 1           2019/2020   2              2019 Q1       15 
4  Top_Node 1           2019/2020   2              2019 Q1       15 
5  Top_Node 1           2019/2020   2              2019 Q1       15 
6  Top_Node 1           2019/2020   2              2019 Q1       15 
7  Top_Node 1           2019/2020   2              2019 Q1       15 
8  Top_Node 1           2019/2020   2              2019 Q1       15 
9  Top_Node 1           2019/2020   2              2019 Q1       15 
10 Top_Node 1           2019/2020   2              2019 Q1       15
etc.
etc.

Tree (see above listing for more detail):

NodeId   Node      ParentId NumChildren NumDescendants NumLeaves Height Depth
1        Top_Node  NULL     3           1159           1096      4      0
2        2019/2020 1        4           386            366       3      1
3        2020/2021 1        4           385            365       3      1
4        2021/2022 1        4           385            365       3      1
5        2020 Q2   3        4           96             92        2      2
6        2021 Q3   4        4           96             92        2      2
7        2020 Q3   3        4           96             92        2      2
8        2021 Q1   4        4           95             92        2      2
etc.
etc.

Resources

The resource section contains a print-ready copy of this article, along with scripts to install the procs, built the D_DATE table, and run the demo.

References

[1] “Verifying Trees”, www.sqlservercentral.com, 20 Jan 2023

Resources

Rate

You rated this post out of 5. Change rating

Share

Share

Rate

You rated this post out of 5. Change rating