Find records which are within the depth range of other records

  • Hi,

    Many Thanks for your help in advance, this one has my stumped. I would like to concatenate the text fields of x numbers of rows if the rows depths overlap one anothers.

    Example table:

    CREATE TABLE [dbo].[TEMP1](

    [well_id] [int] NOT NULL,

    [Top1] [real] NOT NULL,

    [BASE1] [real] not NULL,

    [comment] [nvarchar] (50) not null

    ) ON [PRIMARY]

    GO

    INSERT [dbo].[TEMP1] ([well_id], [Top1], [BASE1],[comment]) VALUES (5443, 2665, 2710, 'A')

    INSERT [dbo].[TEMP1] ([well_id], [Top1], [BASE1],[comment]) VALUES (5732, 10, 20, 'A')

    INSERT [dbo].[TEMP1] ([well_id], [Top1], [BASE1],[comment]) VALUES (5732, 15, 25, 'b')

    INSERT [dbo].[TEMP1] ([well_id], [Top1], [BASE1],[comment]) VALUES (5732, 0, 0, 'A')

    INSERT [dbo].[TEMP1] ([well_id], [Top1], [BASE1],[comment]) VALUES (6679, 34.66, 44.45, 'A')

    INSERT [dbo].[TEMP1] ([well_id], [Top1], [BASE1],[comment]) VALUES (6679, 44.25, 54.25, 'Ac')

    INSERT [dbo].[TEMP1] ([well_id], [Top1], [BASE1],[comment]) VALUES (6679, 54.25, 67.31, 'A')

    INSERT [dbo].[TEMP1] ([well_id], [Top1], [BASE1],[comment]) VALUES (6679, 93.44, 106.5, 'A')

    INSERT [dbo].[TEMP1] ([well_id], [Top1], [BASE1],[comment]) VALUES (7201, 1833, 2100, 'A')

    INSERT [dbo].[TEMP1] ([well_id], [Top1], [BASE1],[comment]) VALUES (7201, 2100, 2305, 'A')

    INSERT [dbo].[TEMP1] ([well_id], [Top1], [BASE1],[comment]) VALUES (7201, 2305, 2855, 'A')

    I would like to find all the wells with the same id, that have overlapping top and base depths, e.g. row 2; 10 -20 overlaps with row 3; 15-25 in this case I would like to merge the comments from those two rows into a new record with the maximum range of those two rows populated (e.g. 5732,10,25, Ab). Any rows from the same well that do not overlap can simply be put into the new table.

    Can anyone give me some clues as to how I should go about doing this.

    Many Thanks,

    Oliver

  • Not very efficient, but should give the correct results

    SELECT s1.well_id,

    s1.Top1,

    MIN(t1.BASE1) AS BASE1,

    (SELECT DISTINCT c.comment AS "text()"

    FROM TEMP1 c

    WHERE c.well_id = s1.well_id

    AND c.Top1 BETWEEN s1.Top1 AND MIN(t1.BASE1)

    AND c.BASE1 BETWEEN s1.Top1 AND MIN(t1.BASE1)

    ORDER BY c.comment

    FOR XML PATH(''),TYPE).value('.','VARCHAR(100)') AS comments

    FROM TEMP1 s1

    INNER JOIN TEMP1 t1 ON t1.well_id=s1.well_id

    AND s1.Top1 <= t1.BASE1

    AND NOT EXISTS(SELECT * FROM TEMP1 t2

    WHERE t2.well_id=t1.well_id

    AND t1.BASE1 >= t2.Top1 AND t1.BASE1 < t2.BASE1)

    WHERE NOT EXISTS(SELECT * FROM TEMP1 s2

    WHERE s2.well_id=s1.well_id

    AND s1.Top1 > s2.Top1 AND s1.Top1 <= s2.BASE1)

    GROUP BY s1.well_id,s1.Top1

    ORDER BY s1.well_id,s1.Top1;

    ____________________________________________________

    Deja View - The strange feeling that somewhere, sometime you've optimised this query before

    How to get the best help on a forum

    http://www.sqlservercentral.com/articles/Best+Practices/61537
  • wow, is amazing. thank you so much.

    I will now try and understand what you have done here. I dont really understand how you are comparing more than one row at a time, bulk will break it down and see how I get on.

    Many Thanks for your help,

    Oliver

  • This yields slightly different results to Mark's solution - I'm assuming that you want the comments from the middlemen in a group:

    -- identify rows with/without overlaps

    ;WITH SplitData AS (

    SELECT a.*,

    Parent = CASE WHEN EXISTS (

    SELECT 1

    FROM TEMP1 b

    WHERE b.well_id = a.well_id

    AND (b.Top1 BETWEEN a.Top1 AND a.Base1 OR b.Base1 BETWEEN a.Top1 AND a.Base1)

    AND NOT (b.Top1 = a.Top1 AND b.Base1 = a.Base1)

    ) THEN 1 ELSE 0 END

    FROM TEMP1 a

    ),

    -- sequence the rows for each well_id, partitioned by whether or not there's an overlap

    SequencedData AS (

    SELECT *,

    seq = ROW_NUMBER() OVER (PARTITION BY a.well_id, a.parent ORDER BY a.Top1, a.Base1)

    FROM SplitData a

    ),

    rCTE AS (

    SELECT -- anchor: no subsequent overlaps, or first in a sequence

    level = 1, seq, Parent,

    well_id, Top1, BASE1, comment,

    Newcomment = CAST(ISNULL(comment,'') AS VARCHAR(100)),

    NewTop1 = Top1,

    NewBase1 = BASE1

    FROM SequencedData

    WHERE seq = 1

    UNION ALL

    SELECT

    level = lr.level + 1, tr.seq, tr.Parent,

    tr.well_id, tr.Top1, tr.BASE1, tr.comment,

    Newcomment = CAST(lr.Newcomment + ISNULL(tr.comment,'') AS VARCHAR(100)),

    NewTop1 = CASE WHEN lr.NewTop1 < tr.Top1 THEN lr.NewTop1 ELSE tr.Top1 END,

    NewBase1 = CASE WHEN lr.NewBase1 > tr.BASE1 THEN lr.NewBase1 ELSE tr.BASE1 END

    FROM SequencedData tr

    INNER JOIN rCTE lr

    ON lr.well_id = tr.well_id AND tr.seq > lr.seq

    AND (tr.Top1 BETWEEN lr.Top1 AND lr.Base1 OR tr.Base1 BETWEEN lr.Top1 AND lr.Base1)

    )

    SELECT well_id, NewTop1, NewBase1, Newcomment

    FROM (

    SELECT *, Maxlevel = MAX(level) OVER(PARTITION BY well_id)

    FROM rCTE

    ) d

    WHERE Parent = 0

    OR level = Maxlevel

    ORDER BY well_id, level

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Thank you very much for this. I am slowly working through this.

    Currently though I dont class two rows one row ending and the other row starting on the same number as an overlap. Therefore do I need to change your statement from between to > and <

    Many Thanks again.

    Oliver

  • I changed the code to this to sort that change. Is there any problem with this?

    -- identify rows with/without overlaps

    ;WITH SplitData AS (

    SELECT a.*,

    Parent = CASE WHEN EXISTS (

    SELECT 1

    FROM TEMP1 b

    WHERE b.well_id = a.well_id

    AND ((b.Top1 > a.Top1 AND b.Top1 < a.Base1) OR (b.Base1 > a.Top1 AND b.Base1 < a.Base1))

    AND NOT (b.Top1 = a.Top1 or b.Base1 = a.Base1)

    ) THEN 1 ELSE 0 END

    FROM TEMP1 a

    ),

    -- sequence the rows for each well_id, partitioned by whether or not there's an overlap

    SequencedData AS (

    SELECT *,

    seq = ROW_NUMBER() OVER (PARTITION BY a.well_id, a.parent ORDER BY a.Top1, a.Base1)

    FROM SplitData a

    ),

    rCTE AS (

    SELECT -- anchor: no subsequent overlaps, or first in a sequence

    level = 1, seq, Parent,

    well_id, Top1, BASE1, comment,

    Newcomment = CAST(ISNULL(comment,'') AS VARCHAR(100)),

    NewTop1 = Top1,

    NewBase1 = BASE1

    FROM SequencedData

    WHERE seq = 1

    UNION ALL

    SELECT

    level = lr.level + 1, tr.seq, tr.Parent,

    tr.well_id, tr.Top1, tr.BASE1, tr.comment,

    Newcomment = CAST(lr.Newcomment + ISNULL(tr.comment,'') AS VARCHAR(100)),

    NewTop1 = CASE WHEN lr.NewTop1 < tr.Top1 THEN lr.NewTop1 ELSE tr.Top1 END,

    NewBase1 = CASE WHEN lr.NewBase1 > tr.BASE1 THEN lr.NewBase1 ELSE tr.BASE1 END

    FROM SequencedData tr

    INNER JOIN rCTE lr

    ON lr.well_id = tr.well_id AND tr.seq > lr.seq

    AND (tr.Top1 BETWEEN lr.Top1 AND lr.Base1 OR tr.Base1 BETWEEN lr.Top1 AND lr.Base1)

    )

    SELECT well_id, NewTop1, NewBase1, Newcomment

    FROM (

    SELECT *, Maxlevel = MAX(level) OVER(PARTITION BY well_id)

    FROM rCTE

    ) d

    WHERE Parent = 0

    OR level = Maxlevel

    ORDER BY well_id, level

    Many Thanks

  • oliver.morris (9/17/2012)


    I changed the code to this to sort that change. Is there any problem with this?

    -- identify rows with/without overlaps

    ;WITH SplitData AS (

    SELECT a.*,

    Parent = CASE WHEN EXISTS (

    SELECT 1

    FROM TEMP1 b

    WHERE b.well_id = a.well_id

    AND ((b.Top1 > a.Top1 AND b.Top1 < a.Base1) OR (b.Base1 > a.Top1 AND b.Base1 < a.Base1))

    AND NOT (b.Top1 = a.Top1 or b.Base1 = a.Base1)

    ) THEN 1 ELSE 0 END

    FROM TEMP1 a

    ),

    -- sequence the rows for each well_id, partitioned by whether or not there's an overlap

    SequencedData AS (

    SELECT *,

    seq = ROW_NUMBER() OVER (PARTITION BY a.well_id, a.parent ORDER BY a.Top1, a.Base1)

    FROM SplitData a

    ),

    rCTE AS (

    SELECT -- anchor: no subsequent overlaps, or first in a sequence

    level = 1, seq, Parent,

    well_id, Top1, BASE1, comment,

    Newcomment = CAST(ISNULL(comment,'') AS VARCHAR(100)),

    NewTop1 = Top1,

    NewBase1 = BASE1

    FROM SequencedData

    WHERE seq = 1

    UNION ALL

    SELECT

    level = lr.level + 1, tr.seq, tr.Parent,

    tr.well_id, tr.Top1, tr.BASE1, tr.comment,

    Newcomment = CAST(lr.Newcomment + ISNULL(tr.comment,'') AS VARCHAR(100)),

    NewTop1 = CASE WHEN lr.NewTop1 < tr.Top1 THEN lr.NewTop1 ELSE tr.Top1 END,

    NewBase1 = CASE WHEN lr.NewBase1 > tr.BASE1 THEN lr.NewBase1 ELSE tr.BASE1 END

    FROM SequencedData tr

    INNER JOIN rCTE lr

    ON lr.well_id = tr.well_id AND tr.seq > lr.seq

    AND (tr.Top1 BETWEEN lr.Top1 AND lr.Base1 OR tr.Base1 BETWEEN lr.Top1 AND lr.Base1)

    )

    SELECT well_id, NewTop1, NewBase1, Newcomment

    FROM (

    SELECT *, Maxlevel = MAX(level) OVER(PARTITION BY well_id)

    FROM rCTE

    ) d

    WHERE Parent = 0

    OR level = Maxlevel

    ORDER BY well_id, level

    Many Thanks

    Hi Oliver

    It looks ok to me and works in the way you anticipate on the small sample data set you've provided. Test it thoroughly against a more substantial data set - which only you have - before committing it to production.

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Hi Mark,

    I have spent the last couple of hours looking at your code and I cant work it out, if you could place a few text pointers around that would be great.

    A few things, why the group by (min base) and why s1.top <= t1.base

    I am sorry I am stupid but there are two many steps removed for me to learn how you came up with this.

    Sorry to be a pain.

    Many Thanks,

    Oliver

  • This is a just couple of queries rolled into one.

    This first query finds the start points of the intervals by finding all start points that don't lie inside an earlier interval.

    SELECT s1.well_id,

    s1.Top1

    FROM TEMP1 s1

    WHERE NOT EXISTS(SELECT * FROM TEMP1 s2

    WHERE s2.well_id=s1.well_id

    AND s1.Top1 > s2.Top1 AND s1.Top1 <= s2.BASE1)

    This is the same idea but for end points instead.

    SELECT t1.well_id,

    t1.BASE1

    FROM TEMP1 t1

    WHERE NOT EXISTS(SELECT * FROM TEMP1 t2

    WHERE t2.well_id=t1.well_id

    AND t1.BASE1 >= t2.Top1 AND t1.BASE1 < t2.BASE1)

    Finally, the start and end points are paired up by joining on well_id and finding the earliest end point for a start point, hence the GROUP BY and MIN. You could equally do this pairing up using ROW_NUMBER.

    The "FOR XML" bit is a technique to concatenate rows, you'll find plenty of other examples of it in posts on SSC.

    ____________________________________________________

    Deja View - The strange feeling that somewhere, sometime you've optimised this query before

    How to get the best help on a forum

    http://www.sqlservercentral.com/articles/Best+Practices/61537
  • Hi,

    Many thanks for the break down. Sorry to bother you I am just keen to get my head round this.

    Taken another look at it this evening. I made a change to is to ensure that the overlapping numbers must be within the range as opposed to <= or >=

    SELECT s1.well_id,

    s1.Top1,

    MIN(t1.BASE1) AS BASE1,

    (SELECT distinct c.comment AS "text()"

    FROM TEMP1 c

    WHERE c.well_id = s1.well_id

    AND c.Top1 BETWEEN s1.Top1 AND MIN(t1.BASE1)

    AND c.BASE1 BETWEEN s1.Top1 AND MIN(t1.BASE1)

    ORDER BY c.comment

    FOR XML PATH(''),TYPE).value('.','VARCHAR(400)') AS comments

    FROM TEMP1 s1

    INNER JOIN TEMP1 t1 ON t1.well_id=s1.well_id

    AND s1.Top1 < t1.BASE1

    AND NOT EXISTS(SELECT * FROM TEMP1 t2

    WHERE t2.well_id=t1.well_id

    AND t1.BASE1 > t2.Top1 AND t1.BASE1 < t2.BASE1)

    WHERE NOT EXISTS(SELECT * FROM TEMP1 s2

    WHERE s2.well_id=s1.well_id

    AND s1.Top1 > s2.Top1 AND s1.Top1 < s2.BASE1)

    GROUP BY s1.well_id,s1.Top1;

    I always find the best way I can understand it is to explain it, so here goes:

    SELECT s1.well_id,

    s1.Top1

    FROM TEMP1 s1

    WHERE NOT EXISTS(SELECT * FROM TEMP1 s2

    WHERE s2.well_id=s1.well_id

    AND s1.Top1 > s2.Top1 AND s1.Top1 < s2.BASE1)

    SELECT t1.well_id,

    t1.BASE1

    FROM TEMP1 t1

    WHERE NOT EXISTS(SELECT * FROM TEMP1 t2

    WHERE t2.well_id=t1.well_id

    AND t1.BASE1 > t2.Top1 AND t1.BASE1 < t2.BASE1)

    These two sub queries build up list of ranges where top>base (for Top values) and base < top (For Base Values)

    Then these are joined by well_ID and min base to each top in the list - to be honest I am still not clear about how this part works any why you only need to choose min base and not max top. If you would be kind enough to show how it looks with row number that would be really helpful.

    Then regarding the XML essentially this runs the same queries again but the XML is concatenating the comments field using the text() node in XML Path.

    Really appreciate your help, sorry I am a slow learner.

    Oliver

  • My first thought, like ChrisM@Work was to use a rCTE for this, showing as always that great minds think alike! 😛

    So, to be different, we'll propose a QU approach, which I have found works interestingly well in many cases for gaps and islands (well, islands anyway).

    First the setup data in a temp table, with two additional helper columns added.

    CREATE TABLE #TEMP1(

    [well_id] [int] NOT NULL,

    [Top1] [real] NOT NULL,

    [BASE1] [real] not NULL,

    [comment] [nvarchar] (50) not null,

    [cmin] [real] NULL,

    [cmax] [real] NULL,

    PRIMARY KEY CLUSTERED ([well_id], [Top1], [BASE1])

    ) ON [PRIMARY]

    GO

    INSERT #TEMP1 ([well_id], [Top1], [BASE1],[comment]) VALUES (5443, 2665, 2710, 'A')

    INSERT #TEMP1 ([well_id], [Top1], [BASE1],[comment]) VALUES (5732, 10, 20, 'A')

    INSERT #TEMP1 ([well_id], [Top1], [BASE1],[comment]) VALUES (5732, 15, 25, 'b')

    INSERT #TEMP1 ([well_id], [Top1], [BASE1],[comment]) VALUES (5732, 0, 0, 'A')

    INSERT #TEMP1 ([well_id], [Top1], [BASE1],[comment]) VALUES (6679, 34.66, 44.45, 'A')

    INSERT #TEMP1 ([well_id], [Top1], [BASE1],[comment]) VALUES (6679, 44.25, 54.25, 'Ac')

    INSERT #TEMP1 ([well_id], [Top1], [BASE1],[comment]) VALUES (6679, 54.25, 67.31, 'A')

    INSERT #TEMP1 ([well_id], [Top1], [BASE1],[comment]) VALUES (6679, 93.44, 106.5, 'A')

    INSERT #TEMP1 ([well_id], [Top1], [BASE1],[comment]) VALUES (7201, 1833, 2100, 'A')

    INSERT #TEMP1 ([well_id], [Top1], [BASE1],[comment]) VALUES (7201, 2100, 2305, 'A')

    INSERT #TEMP1 ([well_id], [Top1], [BASE1],[comment]) VALUES (7201, 2305, 2855, 'A')

    Now my proposed solution:

    UPDATE #TEMP1

    SET @cmin = CASE

    WHEN @cwell <> [well_id] THEN [Top1]

    WHEN [Top1] > @cmax AND [Top1] > @cmin THEN [Top1]

    WHEN [Top1] > @cmin THEN @cmin

    ELSE NULL END

    ,@cmax = CASE

    WHEN @cwell <> [well_id] THEN [BASE1]

    WHEN [BASE1] > @cmax AND [Top1] < @cmin THEN @cmax

    ELSE [BASE1] END

    ,@cwell = [well_id]

    ,cmin = @cmin

    ,cmax = @cmax

    SELECT well_id, [Top1]=cmin, [BASE1]=MAX(cmax)

    FROM #TEMP1

    GROUP BY well_id, cmin

    ORDER BY well_id, cmin

    SELECT * FROM #TEMP1 ORDER BY well_id, [Top1]

    DROP TABLE #TEMP1

    Note that it may require a bit of tweaking that would only be uncovered by additional test data. I'd bet it's pretty fast too.


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply