Recursive query problem - find ALL related records for a given record

  • Federico;

    The result I need is actually in the submission ID column of Jonathan's solution.  SELECT DISTINCT [SUBMISSION_ID] FROM #tmp2

    Jonathan;

    My present solution approach, in Tableau, was predicated on joining primary data table, containing rows representing the submissions, to the result of the query I've been trying to develop e.g. [Primary_source].[Submission_ID] = [Related_Submissions_Query].[Related_Submission_ID].  Then I would use [Related_Submissions_Query].[Submission_ID] as a filter to display the list of related submissions. Tableau can use stored procedures although I've never needed to do that.  I'm unsure about how to pass a parameter that would be the selection of the Submission_Id for which to find related submissions and also unsure how to use the stored procedure result.

    I hope this explanation makes sense.

  • good to know how this is supposed to be used.

    from what I know of Tableau using a SP will not work as they can't be joined as part of a tableau query

    see https://help.tableau.com/current/pro/desktop/en-us/connect_basic_stored_procedures.htm

    you can try the approach I mentioned with 2 levels and most likely it would get the majority of cases - but to get them fully (unless Tableau has a trick I am unaware of) one possible solution would be to use a CLR function within a view.

  • CanuckBuck wrote:

    My present solution approach, in Tableau, was predicated on joining primary data table, containing rows representing the submissions, to the result of the query I've been trying to develop e.g. [Primary_source].[Submission_ID] = [Related_Submissions_Query].[Related_Submission_ID].  Then I would use [Related_Submissions_Query].[Submission_ID] as a filter to display the list of related submissions. Tableau can use stored procedures although I've never needed to do that.  I'm unsure about how to pass a parameter that would be the selection of the Submission_Id for which to find related submissions and also unsure how to use the stored procedure result.

    I hope this explanation makes sense.

    You could write a table valued function, then just select from it as you would with a normal table:

    IF OBJECT_ID(N'dbo.RelatedItems', N'TF') IS NULL
    BEGIN
    EXEC ('CREATE FUNCTION [dbo].[RelatedItems] () RETURNS @table TABLE (VALUE nchar(1)) AS BEGIN RETURN END')
    END
    GO
    ALTER FUNCTION dbo.RelatedItems
    (
    Id int
    )
    RETURNS @Results TABLE
    (
    Submission_ID int,
    Related_Submission_Id int
    )
    AS
    BEGIN
    declare @rowcount int = -1
    while @rowcount <> 0 begin

    insert into @Results
    (
    [Submission_ID],
    [Related_Submission_Id]
    )
    select [Submission_ID],
    [Related_Submission_Id]
    from tmp t
    where (@id in (t.Submission_ID, t.Related_Submission_Id)
    or exists(select *
    from @Results t2
    where t2.Related_Submission_Id in (t.Submission_ID, t.Related_Submission_Id)
    or t2.Submission_ID in (t.Submission_ID, t.Related_Submission_Id))
    )
    and not exists(select *
    from @Results t2
    where t2.Submission_ID = t.Submission_ID
    and (t2.Related_Submission_Id = t.Related_Submission_Id
    or (t2.Related_Submission_Id is null
    and t.Related_Submission_Id is null))
    )
    set @rowcount = @@ROWCOUNT
    end

    RETURN
    END
    GO

    Then to select from it just:

    select * from dbo.RelatedItems(85)

    Note: you will have to make #tmp a permanent table.  I've changed it to be named tmp in the code above.

  • Frederico and Jonathan;

    First.  Thank you for your support on this.  I really appreciate it.

    Jonathan;  it took me a few hours to figure out what your code is doing.  The code I write is pretty basic by comparison.  If I understand correctly, your code functions in much the same way as a CTE.  Here's what I think I understand.  Please correct my understanding if I'm wrong.

    In a loop, find and insert into a temp table - #tmp2;

    • Any rows from the source table - #tmp where

      • either column of #tmp matches the input variable

        OR either column of #tmp matches any value in either column from rows previously inserted in #tmp2

      • AND which haven't previously been inserted into #tmp2 (the AND NOT EXISTS statement)

    • The loop repeats until no more rows are inserted
    • The Submission_ID column lists all Submission IDs that are related to the Input variable.

    Unfortunately, Tableau can not use functions.  It can only use stored procedures.   It can also execute code (effectively, pass-through SQL) at runtime (when a workbook is opened and connects to a database).  I've used this technique in the past to create global temp tables which I then join to other tables in a database.  Hence my initial approach to "pre-build" a temp table containing all relationships (in the form that my code above creates.  I should point out that for testing purposes I've segregated the different scenarios but in operation the query would run against all relationships for all submissions in the database) and then joining that to my main table.  Once the tables are joined then I can use a simple filter on any column from either table.

    In Tableau I can create a parameter to enable a user to make a selection of a value (e.g. a Submission Id).  I can also create a something called a set which can be used as a filter.  What I don't know, yet, is if I can use a parameter selection to run a query / call a stored procedure etc. and populate the set with the result of the procedure call.

  • I think I might have found a solution!

    Connect to a Custom SQL Query - Tableau

    I'll report back once I've tried this.

  • >> I need a way to find all of the related records sic in a set of relationships given any record_id sic in the set. The nature of the relationships is variable. The 'happy path' is that the relationships represent a simple 'daisy-chain' where a child sic is related only to its immediately preceding sic record sic - simple recursion. Unfortunately there are other types of relationships expressed in the data; <<

    Our first problem is that you're not expressing your situation in relational terms. Rows are not record sics; this is completely totally different concept. We don't have "children" and "parents" in RDBMS; we have "referenced" and "referencing" tables. A table can play both roles in the schema. Since rows have no ordering, the concept of preceding simply doesn't exist. Since RDBMS and SQL are based on declarative languages, we don't like concept of recursion is an operation on tables. Yes, I know we voted into the standards, but it is both expensive and awkward. We would much prefer to put any recursion we have in the schema rather than in the queries.

    Your mindset is still back in the days of pointer chains and network databases. And you're looking for that kind of solution in SQL. I would suggest that you switch your mindset to a relational model based on sets and not pointer chains. That was Bachman and the network model of databases.

    >> A 'child' record sic can be related to more than one 'parent' record sic (like in real human life, although in this scenario there could potentially be more than two...) multiple 'children' can be related to a single 'parent' record sic

    some records may skip a 'generation' in the daisy-chain, being related to potentially any 'parent' up sic the chain sic a given set of relationships may include some of all of the scenarios expressed above. <<

    Instead of thinking of a "boxes and arrows" model for your data, try thinking of "Euler circles" or what are incorrectly called "Venn diagrams"; circles nested inside each other

    >> Data rules governing the relationships in the source data;

    The first column can not be NULL

    The 2nd column can be NULL

    The value of the 2nd column for a row will always be less than the value if the first column

    The value in the first column can be repeated i.e. it is related to more than one record sic <<

    Would you mind showing us some valid DDL? I'd like to know which one is a key, what their names (with their meaning) are and why you think that columns have any particular ordering (that's just a part of how you have to write things in SQL and has nothing to do with a valid RDBMS model).

    Unlike filesystems or other non-relational data storage, a table must have a key by definition. This is not an option. You should have known this and yet you failed to do anything about it in your DDL. Why?

    CREATE TABLE Submissions -- replace uselss non-table name

    (submission_id INTEGER NOT NULL,

    related_submission_id INTEGER );

    Identifiers are measured with a nominal scale. This means anything that ends with "_id" should be a character string, by its very nature. Your your first day in your data modeling class? Scales and measurements? But if you're thinking of pointer chains, they are usually integers! And that's exactly what you've done. Again you're using a nonrelational network database model.

    In fact, because you have no keys, you can wind up with circular references in your recursion, which will lead to infinite recursion. You probably already discovered that 🙁

    Also, since tables have no ordering by their nature, a query should not have an "ORDER BY" clause to converted into a cursor. But when you're dealing with pointer chains and procedural languages, you really like this nonrelational stuff.

    As an aside, putting the comma at the start of the line (new punchcard) is a trick we used in the 1960s. We did it so we could rearrange and reuse a deck of punch cards. Before you ask, yes I am that old, and I've been involved with SQL since the 1986 standards. We stopped doing it because it actually decreases readability. People who speak Latin alphabet languages parse from left to right and expect a comma to come at the end of something not at the front of it. Their eyes physically jump backward (see the University of Maryland eye motion studies done for AIRMICS from decades ago), so your error rate increases and you can't read as fast as you want to. The general rule is that your code should look "Kinda like the way you would write in English" to avoid errors. Disclaimer: I investigated some of this research when I worked for AIRMICS.

    If you want to do hierarchical structures, I have an entire book devoted to treason hierarchies and SQL. Get a copy easily enough. I believe, but I'm not sure, there is a proof that all graphs without cycles can be reduced to a tree. Perhaps someone can check me on that.

    Please post DDL and follow ANSI/ISO standards when asking for help. 

  • jcelko;

    Thanks for your response.  I've attached the DDL of the pertinent tables.

    In the interest of self preservation 🙂 it's worth noting the design, naming conventions etc. of the database I'm working with are not mine.  I had no part in it's development and have no opportunity to influence changes.  I'm only a consumer.  My role is to provide BI products from the content of the database.  The database is a reporting "data warehouse" - a sort of copy (similar but not identical in structure) of a bespoke enterprise OLTP database which is modelled mostly on the PPDM data model.  A nightly ETL process captures changes from the OLTP system into the data warehouse.

    The use-case involving the two tables in the DDL is that external parties are required to submit Site (geographical location) Condition reports describing the condition of the site with respect to environmental contamination.  The parties are required to submit subsequent reports of the condition of the site at some frequency (varying from months to years) until the site is remediated.  The OLTP system permits a party to relate a new submission to a previous submission (usually the most current previous submission but not necessarily) that they have made by referencing it's Submission ID.  In fact the system permits referencing multiple previous submissions.  Effectively, the party is indicating that the new submission updates information in the related submission(s).

    This relationship information is persisted in the EVENT_XREF table where the SUBMISISON_ID of the submission currently being added - the referencing (record?  row?) is captured in EVENT1_ID and ID(s) of the submission(s) to which it is related - the referenced (record? row?) are captured in EVENT2_ID.

    Submissions are one category of EVENT that the OLTP system handles. The EVENT_XREF table is a generic structure that supports cross-references for many different categories of events (Applications, of various types; Submissions, of various types; Notifications, of various types etc.).

    The customer I'm building the BI product for needs to be able to find all of the submissions related to any given submission.  I'm using Tableau (I don't get to choose any other solution).  Tableau likes to work with tables that are related to one another.  Jonathan's solution appears to work from a "finding all related submissions" perspective.  However, I'm not yet certain if I can incorporate that into Tableau.

    Your mention of commas is interesting.  I've adopted that convention, in part, because that's the way SSMS generates a query when you right-click a table in the object explorer and choose "select top 1000 rows".  I've just become accustomed to it being in the same place for each line of a query.  For me it's become the visual cue for a new column and particularly beneficial when the statement for the preceding column spans multiple lines.

    I hope I've clarified and not mystified...

    • This reply was modified 2 years, 10 months ago by  CanuckBuck.
    • This reply was modified 2 years, 10 months ago by  CanuckBuck.
    • This reply was modified 2 years, 10 months ago by  CanuckBuck.
    Attachments:
    You must be logged in to view attached files.
  • So, to be clear, the code specified above, inserting rows into #tmp would be analogous to what would be found in the EVENT_XREF table defined in the DDL.  The corresponding submissions would be in the SUBMISSION table.

    Further, for the sake of clarity, you'll notice that both the SUBMISSION table and the EVENT_XREF table have _SK columns (e.g. EVENT1_SK in the EVENT_XREF table).  The _ID columns in the tables are "business keys".  The _SK columns are ETL generated values that are actually the related columns from each table.  In the case of related submissions - EVENT1_SK = SUBMISSION_SK for the referencing record and EVENT2_SK = SUBMISSION_SK for the referenced record.

    Again, I hope this clarifies and not mystifies...

  • Looking a the scenarios, it seems that there is always a single starting point or root ancestor.  In scenario1 it is 2 scenario2 , 3 and scenario 3, 4.  If you know the root node you can use a recursive CTE to find all the 'child' nodes.  Similarly from any 'child' node you can use a recursive CTE to find all the ancestors including the root ancestor node. Combining these two I was able to return all related records for any starting point in the hierarchy.  My table called licenses has an id and a parent_id field among others.  I am using MySQL so the syntax is a bit different to SQL Server but the principle should be the same.

    SET @lid = '90';
    WITH RECURSIVE
    cte2 (id, parent_id, ind, level) AS (
    -- Anchor member returns the root ancestor*****************
    SELECT id, parent_id, CAST(',' AS CHAR(21844)) AS ind, 0 FROM licenses WHERE id = (
    -- This embedded CTE returns the id of the root ancestor*************
    WITH RECURSIVE
    cte (id, parent_id, ind, level) AS (
    SELECT id, parent_id, CAST(',' AS CHAR(21844)) AS ind, 0 FROM licenses WHERE id = @lid
    UNION ALL
    SELECT licenses.id, licenses.parent_id, CONCAT(cte.ind, licenses.id, ','), cte.level + 1
    FROM cte JOIN licenses ON licenses.id = cte.parent_id AND LOCATE(CONCAT(',', licenses.id, ','), cte.ind) = 0
    ) SELECT DISTINCT id FROM cte ORDER BY level DESC LIMIT 1
    -- End of embedded CTE******************
    )
    -- END of anchor member*********************************
    UNION ALL
    -- Recursive member**************************************
    SELECT licenses.id, licenses.parent_id, CONCAT(cte2.ind, licenses.id, ','), cte2.level + 1
    FROM cte2 JOIN licenses ON licenses.parent_id = cte2.id AND LOCATE(CONCAT(',', licenses.id, ','), cte2.ind) = 0
    -- End of Recursive member**************************
    ) SELECT DISTINCT id,level FROM cte2 ORDER BY level ASC

    Hope this helps someone.

Viewing 9 posts - 16 through 23 (of 23 total)

You must be logged in to reply to this topic. Login to reply