Aggregating records representing ranges

  • I have a bit of an obscure problem.

    I have a table that looks like this;

    Filename range_start range_End

    file1 A B

    file1 B C

    file1 C D

    file1 E F

    file1 F G

    file2 A B

    file2 B C

    file2 D F

    file2 E F

    file2 F G

    ...

    I want to use a query to Aggregate these records to look like this;

    file1 A D

    file1 E G

    file2 A C

    file2 D G

    Can anyone help me? I've been hitting my head against a brick wall for days now.

  • Try this

    SELECT s1.Filename,

    s1.range_start,

    MIN(t1.range_End) AS range_End

    FROM MyTable s1

    INNER JOIN MyTable t1 ON s1.Filename=t1.Filename

    AND s1.range_start <= t1.range_End

    AND NOT EXISTS(SELECT * FROM MyTable t2

    WHERE t1.Filename=t2.Filename

    AND t1.range_End >= t2.range_start

    AND t1.range_End < t2.range_End)

    WHERE NOT EXISTS(SELECT * FROM MyTable s2

    WHERE s1.Filename=s2.Filename

    AND s1.range_start > s2.range_start

    AND s1.range_start <= s2.range_End)

    GROUP BY s1.Filename,s1.range_start

    ____________________________________________________

    Deja View - The strange feeling that somewhere, sometime you've optimised this query before

    How to get the best help on a forum

    http://www.sqlservercentral.com/articles/Best+Practices/61537
  • Is it wrong to use code you don't understant? 😉

    It'll take me a little while to figure out exactly what this does but it does indeed work.

    Thanks so much!

  • Hi. I'm presently using this solution to my problem and it does indeed work. The only drawback is that it takes a long time.

    This code bit is the last step in a chain of 16 queries. The preceeding 15 take a total time of < 1 min. This single script segment takes more than 3 minutes.

    I'm working with an input dataset of almost 900,000 rows!

    Is it possible that there is a way to achieve the same solution which is quicker?

  • If you don't mind using a temp table - you can cut that to 15 seconds.

    [font="Courier New"]--Set up Test scenario

    DROP TABLE files

    GO

    CREATE TABLE files([file_name] VARCHAR(10), range_start VARCHAR(10), Range_end VARCHAR(10))

    INSERT files([file_name],range_start)

    SELECT TOP 1000000

       'file'+CAST(CAST(RAND(checksum(NEWID()))*60000 AS INT) AS VARCHAR(10)),

       CHAR(65+CAST(RAND(checksum(NEWID()))*25 AS INT))

    FROM sys.all_columns sc1, sys.all_columns sc2

    UPDATE files

    SET range_end =CHAR(ASCII(range_start)+1)

    GO

    --start the actual work

    --first some variables

    DECLARE @g DATETIME  --just to test how fast

       SET @g=GETDATE();

    DECLARE @groupnum INT

       SET @groupnum=0;

    DECLARE @currfile VARCHAR(10)

       SET @currfile='';

    DECLARE @currrange VARCHAR(10)

       SET @currrange='';

    --create the temp table

    SELECT DISTINCT *,0 AS groupnum

    INTO #tmpfiles

    FROM files

    CREATE UNIQUE CLUSTERED INDEX pk_files ON #tmpfiles(FILE_NAME,range_start)

    UPDATE #tmpfiles

    SET @groupnum=groupnum=CASE WHEN @currfile=FILE_NAME AND @currrange=range_start THEN @groupnum

                               WHEN NOT(@currfile=FILE_NAME) THEN 1

                               ELSE @groupnum+1 END,

       @currfile=FILE_NAME,

       @currrange=range_end

    FROM #tmpfiles WITH (tablock, INDEX(pk_files))

    --just to see how fast that was

    SELECT DATEDIFF(ms,@g,GETDATE()) --9076ms

    --show the results

    SELECT FILE_NAME,groupnum, MIN(range_start) AS rstart, MAX(range_end) AS rend

    FROM #tmpfiles

    GROUP BY FILE_NAME,groupnum

    ORDER BY FILE_NAME,groupnum

    --just to see how fast that was up until now --14983ms

    SELECT DATEDIFF(ms,@g,GETDATE())

    SELECT COUNT(*) FROM #tmpfiles --729675 distinct values

       [/font]

    DROP TABLE #tmpfiles

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Thank you for your thoughts and effort on this.

    Your answer seems oh so close and it certainly is fast! However, it doesn't quite work the way I need it to.

    Here is a subset of the data that I'm trying to Agregate;

    RecNoUWISeqPTypeOpenerTopOpenerBaseGroupNum

    324File11Perforation836.7838.20

    325File12Perforation839.7840.60

    329File16Perforation841.8842.70

    326File13Perforation841.9842.80

    What I want is;

    File1836.7838.21

    File1839.7840.62

    File1841.8842.73

    File1841.9842.83

    Which would aggregate to;

    File1836.7838.2

    File1839.7840.6

    File1841.8842.8

    What I get is;

    File1836.7838.21

    File1839.7840.62

    File1841.8842.73

    File1841.9842.84

    Is there anyway to modify this query so I get what I want?

    Thanks for your help on this.

  • you went and changed the specs on me! Sorry to say - by having to do range checks - it gets a little uglier. You unfortunately lose a little performance in the process as well...we're up to 30 seconds.

    [font="Courier New"]--Set up Test scenario

    --DROP TABLE files

    --GO

    --CREATE TABLE files([file_name] VARCHAR(10), range_start decimal(10,1), Range_end decimal(10,1))

    --

    --INSERT files([file_name],range_start)

    --SELECT TOP 1000000

    --   'file'+CAST(CAST(RAND(checksum(NEWID()))*60000 AS INT) AS VARCHAR(10)),

    --   CAST(RAND(checksum(NEWID()))*200+500.0 AS decimal(10,1))

    --FROM sys.all_columns sc1, sys.all_columns sc2

    --UPDATE files

    --SET range_end =range_start+RAND(checksum(NEWID()))*3

    GO

    --start the actual work

    --first some variables

    DECLARE @g DATETIME  --just to test how fast

       SET @g=GETDATE();

    DECLARE @groupnum INT

       SET @groupnum=0;

    DECLARE @currfile VARCHAR(10)

       SET @currfile='';

    DECLARE @currrange_start VARCHAR(10)

       SET @currrange_start=0;

    DECLARE @currrange_end VARCHAR(10)

       SET @currrange_end=0;

    --create the temp table        

    SELECT DISTINCT *,0 AS groupnum

    INTO #tmpfiles

    FROM files

    CREATE UNIQUE CLUSTERED INDEX pk_files ON #tmpfiles(FILE_NAME,range_start,range_end)

    SELECT DATEDIFF(ms,@g,GETDATE())

    UPDATE #tmpfiles

    SET @groupnum=groupnum= CASE WHEN @currfile=FILE_NAME AND range_start BETWEEN @currrange_start AND @currrange_end THEN @groupnum

                                 WHEN NOT(@currfile=FILE_NAME) THEN 1

                                 ELSE @groupnum+1 END,

       @currfile=FILE_NAME,

       @currrange_start=  CASE WHEN NOT(@currfile=FILE_NAME) THEN range_start ELSE @currrange_start END,

       @currrange_end=        CASE WHEN NOT(@currfile=FILE_NAME) OR @currrange_end<range_end THEN range_end

                                ELSE @currrange_end END

    FROM #tmpfiles WITH (tablock, INDEX(pk_files))

    SELECT DATEDIFF(ms,@g,GETDATE())

    CREATE NONCLUSTERED INDEX ixtmpfiles ON #tmpfiles (FILE_NAME,groupnum) include (range_start,range_end)

    SELECT DATEDIFF(ms,@g,GETDATE())

    --show the results

    SELECT FILE_NAME,groupnum, MIN(range_start) AS rstart, MAX(range_end) AS rend

    FROM #tmpfiles

    GROUP BY FILE_NAME,groupnum

    ORDER BY FILE_NAME,groupnum

    SELECT DATEDIFF(ms,@g,GETDATE())

    SELECT COUNT(*) FROM #tmpfiles

      

    DROP TABLE #tmpfiles

    [/font]

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Apparently the answer is in the details. Thanks again for spending time on this. Unfortunately, it's still not quite there.

    The code seems to work correctly if I use a WHERE clause and specify a distinct file name. However, when I run it on the whole table most of the rows come back with a groupnum of 1 even if they should have a different group number which means that they then get aggregated when they shouldn't

    Am I correct in thinking that the script is expecting to update the records in the order specified by the index - sort of like walking a cursor where you can retrieve records in a defined order? If that thinking is correct, it doesn't appear to be behaving that way.


    Perhaps a little background would help.

    The records I'm processing represent perforated intervals of a oil or gas well. Briefly (and leaving out many detail) - a well gets drilled and then "completed" (tubing is pushed down the well do seal off the sides of the well so that many bad things don't happen). Once the company determines what depth intervals of the well have the gas or oil they "perforate" those intervals (punch holes in the tubing) in order to produce the oil or gas.

    Over time a well may have many of these perforation events. To complicate matters, a company can also seal off previously perforated intervals, re-perforate the same or overlapping intervals etc. It can get very complex with respect to opening and closing intervals over time.

    The table that I'm starting with contains these event records sorted by date for each file (the "file" is actually the identifier for the well. I've translated that to a file name to protect confidential data).

    My goal is to determine for each well (file) what intervals are actually presently open. We have certain wells which can have 60 or 70 events and actually don't have any presently open!

    I've processed the opening and closing events to the point where I'm only left with the presently open intervals the only thing left to do - which you've been helping me with - is aggregating the overlapping & contiguous intervals for each well.

    Interesting problem, no?

    If it would help and you're interested, I could post a zip of a CSV the table you create as #tempfiles. It's about 6Mb zipped.

  • A few specifics:

    - The group numbers reset when you go from one filename to another. so every filename should have a groupnum 1. That's just to make sure we can tell the query to group by filename AND groupnumb. My test data was definitely not aggregating multiple filenames together, because my understanding was that they're not supposed to be. grouping by BOTH if the key there. Now - if you'd prefer the groupnums to just keep incrementing forever - all you would have to do is make the following change

    WHEN NOT(@currfile=FILE_NAME) THEN 1 --<-- replace the 1 with @groupnum+1

    - You're right - the clustered index is key. Keep in mind that the clustered index is required , since it's the only way to consistently force the order into this. Even switching to a non-clustered would make the technique unreliable.

    Once the order is there, then yes - it's a fancy, high-speed way to walk the dataset in order, in a running totals/aggregates fashion (kind of like a cursor would do, except that it's all one big operation, and not 900K individual update operations like a cursor, which is why this takes a fraction of the time a cursor would). We use the variables to be able to compare the current row's value to the previous row's, and make decisions based on that.

    The technique I'm using is detailed here - with a rather lively follow-on discussion:

    http://www.sqlservercentral.com/articles/Advanced+Querying/61716/[/url]

    Take a look through the data, and see if you can pinpoint things that don't work for you in what I'm doing. We'd have to formalize what the actual business rules for what belongs together vs what should not, since there still seems to be some disconnect.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • I've got it!

    The problem was in the order of the set statements.

    Because @currrange_start and @currrange_end rely on the value of @currfile, setting @currfile to the value of the current record before using them in the tests to set @currrange_start and @currrange_end them was what was causing the problem.

    Here's what it should look like;

    SET @groupnum=groupnum= CASE WHEN @currfile=FILE_NAME AND range_start BETWEEN @currrange_start AND @currrange_end THEN @groupnum

    WHEN NOT(@currfile=FILE_NAME) THEN 1

    ELSE @groupnum+1 END,

    @currrange_start= CASE WHEN NOT(@currfile=FILE_NAME) THEN range_start ELSE @currrange_start END,

    @currrange_end= CASE WHEN NOT(@currfile=FILE_NAME) OR @currrange_end<range_end THEN range_end

    ELSE @currrange_end END

    @currfile=FILE_NAME,

    This gives me exactly the same answer as the other code and completes in about 45 seconds!

    This is awesome!

    Thank you SO much for your help!

    One more question if you don't mind... What is the purpose of the WITH (tablock, INDEX(pk_tempfiles)) statement? I know that it's a table hint but I don't understand what the performance impact/benefit is.

  • FYI...

    The total time for my process now has gone from ~ 4:30 to ~ 1:00!

    Now that's performance!

  • Michael (4/10/2008)


    One more question if you don't mind... What is the purpose of the WITH (tablock, INDEX(pk_tempfiles)) statement? I know that it's a table hint but I don't understand what the performance impact/benefit is.

    Sure - this is the "forcing the order" I mentioned earlier and that you were picking up on. The Table lock is icing on the cake (since that will happen anyway during a full table update), but the INDEX(PK_tempfiles) "hint" is telling/commanding the optimizer to use the order set up by the PK_TEMPFILES clustered index. It's how we make sure it doesn't get creative and use whatever order it likes....

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Michael (4/10/2008)


    FYI...

    The total time for my process now has gone from ~ 4:30 to ~ 1:00!

    Now that's performance!

    Glad it helped!

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply