Previous row where condition exists

  • Yes that's correct.  In the new dataset I renamed RIN# to Cl_ID and License# to VEH.  But yes, you would be correct.

  • tacy.highland wrote:

    There are definitely rows without dupes. And in the full dataset that's the vast majority.  Any records that have a 1 rownum, (which are not followed by a 2 or 3 rownum in the next row) are non-duplicates.  And you were also right before, where I used the Row_number to create the rownum column.  If you have a better way, I'm all ears!

    Thanks!

    I'm trying to get you to tell me what you partitioned the ROW_NUMBER() by so I can show you what I'm thinking instead of depending on the ROW_NUMBER().

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • tacy.highland wrote:

    Yes that's correct.  In the new dataset I renamed RIN# to Cl_ID and License# to VEH.  But yes, you would be correct.

    Ah.. ok.  Thanks.

     

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Ok... give this a try and let me know if this does the trick or not.  I left your RowNum in there instead of sorting by the ID.  I think you'll like the first 2 columns in the result set.  Takes out the guesswork...

    The code has absolutely no "previous row" dependencies  and can easily handle any number of dupes per row or total number of dupes.

       WITH ctePreAgg AS
    (
    SELECT DupeGrp = DENSE_RANK() OVER (ORDER BY ldate,CL_ID,Provider_Name,VEH,pick_up,drop_off,Modifier,CALCULATED_FARE)
    ,GrpCnt = COUNT(*) OVER (PARTITION BY ldate,CL_ID,Provider_Name,VEH,pick_up,drop_off,Modifier,CALCULATED_FARE)
    ,*
    FROM #duplicatetrips2
    )
    SELECT *
    FROM ctePreAgg
    WHERE GrpCnt > 1
    ORDER BY DupeGrp, RowNum
    ;

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • p.s.  And what I was getting at is you could do your RowNum at the same time if you really need it but I'm thinking that sorting by ID is good enough... you don't actually need ROW_NUMBER() in this case.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Nice! This also does the trick, with the additional DupeGrp and GrpCnt columns.  I like it.  And so simple, like you said, just a Dense Rank.

    Thank you, Jeff!  And thank you, Scott!  I'm so glad people like you are so willing to share the knowledge and help out others. This forum has been such a great resource to learn from!

    ~cheers

  • It has no dependencies because it sorts the values again (as I understood the q).  If you have a lot of data that has already been sorted, that may not be ideal.  Based on the q setup, it seemed that was the case, but maybe not.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • ScottPletcher wrote:

    It has no dependencies because it sorts the values again (as I understood the q).  If you have a lot of data that has already been sorted, that may not be ideal.  Based on the q setup, it seemed that was the case, but maybe not.

    I changed the RID# and License# columns so the UNION ALL method would work with the latest data posted... it has two accidental cross joins between the 30 rows and the 7 columns, each producing 210 rows.  It would be interesting to see what that does with a larger row set.  Sometimes the execution plan will change to eliminate the accidental cross joins but usually not.

    A good "quirky update" may be faster than any of this.    Maybe an UPDATE with a LAG accumulator would work as well.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden wrote:

    ScottPletcher wrote:

    It has no dependencies because it sorts the values again (as I understood the q).  If you have a lot of data that has already been sorted, that may not be ideal.  Based on the q setup, it seemed that was the case, but maybe not.

    I changed the RID# and License# columns so the UNION ALL method would work with the latest data posted... it has two accidental cross joins between the 30 rows and the 7 columns, each producing 210 rows.  It would be interesting to see what that does with a larger row set.  Sometimes the execution plan will change to eliminate the accidental cross joins but usually not.

    A good "quirky update" may be faster than any of this.    Maybe an UPDATE with a LAG accumulator would work as well.

    It would be interesting to see.  The ORDER BY / PARTITION BY will require a full sort, unless there's a clustered index that covers it (which is often best but probably problematic and unlikely here because of the number of columns involved).

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • I've been playing with this some more, Scott.  It turns out that the Partitioned COUNT(*) is just freakin' nasty (creates 145 extra reads on a 2 page table) compared to the other Windowing functions.  I do have a rather odd Lead/Lag work around that keeps reads at 2/1 that doesn't depend on the sequence of an ID column and I'll try to remember to post it tonight.

    My main concern has been that no one can predict the number of dupes that may appear and writing a SELECT to cover each possible dupe just didn't sit in my craw comfortably.  It DOES produce fewer reads (still relatively high, though) than the bloody Partitioned COUNT(*) but still to many.

    The really cool part about this problem that I'm learning "other possibilities" from this post that will help the world I work in and so I thank you for your comments and feedback.  They've been helpful.

     

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Yes, I love the compactness and "slickness" of the code, but for large number of rows, performance-wise it can be a relative disaster.

    Of course, in general, sorting is big overhead anyway.  If you have to sort large amounts of data, keep the sort result (stored) if possible, to avoid having to re-sort.  For most tables, you'll want to leverage clustered indexes for that if a table is (almost) always accessed by the same key column(s).  Especially child tables, where you also then benefit enormously from MERGE joins, which are extremely efficient when applicable.

     

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • Like I said, I didn't find the sorts to be obnoxious for the basic Windowing functions... it's the bloody partitioned COUNT(*) that was doing it and I found a work around for that.  Unfortunately, I'm still working even at this late hour and don't have the time tonight to write up my findings.

     

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • This was removed by the editor as SPAM

  • CREATE TABLE Foobar

    (rin CHAR(10) NOT NULL

    CHECK(rin LIKE '[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]'),

    provider_name VARCHAR(255) NOT NULL,

    license_nbr NVARCHAR(50) NOT NULL,

    pickup_timestamp DATETIME2(0) NOT NULL,

    dropoff_timestamp DATETIME2(0) NOT NULL,

    modifier_code NVARCHAR(519) NOT NULL,

    calulated_fare_amt DECIMAL(10,2) NOT NULL,

    PRIMARY KEY (rin, ??));

    INSERT INTO Foobar

    VALUES

    ('015953094', 'ElmedTransportatio', '808547AM', '1976-03-17 22:34:00', '1976-03-17 23:33:00', 'A0428', 340.01),

    ('134745595', 'ElmedTransportatio', '8511MC', '2017-11-24 13:01:00', '2017-11-24 13:23:00', 'A0425, A0130', 43),

    etc.

    Have you ever once thought about doing an SQL database? By definition, not as an option, a table must have a key. Again, by definition, not as an option, you can't have duplicate rows. Furthermore, since a table models a set, there is no such concept as a "previous row" in the language. You. You have just used SQL to write a very badly designed file.

    .

    Why are you storing a date as a character string when we have a date data type?

    Why are you storing times as a character string when we have a time data type?

    Why did you split a timestamp into a date and a time column instead of keeping it as a DATEDATETIME2(0)2(n) data type?

    Did you know that the octothrop (#) is not a valid character in ANSI/ISO standard SQL? The reason for this is that it's used in a lot of other ANSI/ISO standards and has special meaning in various other standards. Younger people call it a hashtag.

    Why do you have so many large NVARCHAR(n) columns? This is very unusual and invites garbage data, so if you really don't need to have the entire Heart Sutra in standard Chinese characters in a column, you want to make them simple VARCHAR(n) . This is why I guess you should have posted:

    rin CHAR(10) NOT NULL

    CHECK(rin LIKE '[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]'),

    I am assuming that people reading this know what your "RIN#" means in your trade, so it will be clear.

    When I look at modifier_code 'A0425, A0130', it looks like you violated first normal form by having a comma separated list in a single column. That is one of the most awful mistakes you can make in RDBMS!

    I'm also trying to figure out what your calculated fare amount means. My first thought was that was some kind of financial calculation, but that would be illegal because you're using a floating-point number. I don't mean just that it's going to have all kinds of cumulative errors; I mean it's illegal. Please read the GAAP or EU financial regulations

    >> I can identify the duplicates by the RowNum column; if it's > 1 then it's a duplicate of the previous row. (There are some cases where there are more than 1 duplicate so the rownum will be 3). What I need is to be able to pull BOTH rows that are duplicates of each other. <<

    Dr. Codd introduced the concept of a "degree of duplication" value, which shows how many times a row is repeated in a table, this lets you have a key without having to do tricks with row numbering, IDENTITY table properties, or other kludges. You need to start over and actually use a relational database.

    Please post DDL and follow ANSI/ISO standards when asking for help. 

Viewing 14 posts - 16 through 28 (of 28 total)

You must be logged in to reply to this topic. Login to reply