Any way of joining data on two unique records, trying to cancel out matches

  • In MSSQL 2016 I am trying to find charges that cancel each other out. For example: if a group of data contained a positive charge and a negative charge then they would be canceled out, but if they had two of the same charge and only one was cancelled out then I only want the one. Example Below:

    CREATE TABLE #tmp_charges (
        PK_ID int IDENTITY(1, 1) NOT NULL PRIMARY KEY NONCLUSTERED,
        groupid INT,
      charge_code VARCHAR(9) INDEX ix_charge_code NONCLUSTERED,
        Amount DECIMAL(8,2),
        Qty INT,
      Void BIT
    )

    INSERT INTO #tmp_charges (groupid, charge_code, amount, qty, void)
    VALUES (1, '1000', 10, 1, 0),
        (1, '1000', -10, -1, 0),
      (1, '1000', 10, 1, 0),
      (1, '1000', 10, 1, 0),
      (1, '1000', 10, 1, 0),
      (1, '1000', -10, -1, 0),
      (1, '1000', -20, -2, 0),
      (2, '1001', 50, 2, 0),
      (2, '1000', -10, -1, 0),
      (2, '1000', 20, 2, 0),
      (2, '1001', -25, -1, 0),
      (3, '1001', 25, 1, 0),
      (4, '1001', 25, 1, 0),
      (4, '1001', -25, -1, 0),
      (4, '1001', -25, -1, 0)
     
    SELECT jtc.PK_ID PK_IDMatch, jtc.qty cqty, jtc.amount camt, tc.* FROM #tmp_charges tc
        JOIN #tmp_charges jtc ON jtc.groupid = tc.groupid
        AND jtc.charge_code = tc.charge_code
        AND jtc.qty = ABS(tc.qty)
      AND jtc.amount = ABS(tc.amount)
        WHERE tc.qty < 0

    This gives me a list of all charges that match each other, which is close to what I need:


    "PK_IDMatch"    "cqty"    "camt"    "PK_ID"    "groupid"    "charge_code"    "Amount"    "Qty"    "Void"
    "1"    "1"    "10"    "2"    "1"    "1000"    "-10"    "-1"    "False"
    "3"    "1"    "10"    "2"    "1"    "1000"    "-10"    "-1"    "False"
    "4"    "1"    "10"    "2"    "1"    "1000"    "-10"    "-1"    "False"
    "5"    "1"    "10"    "2"    "1"    "1000"    "-10"    "-1"    "False"
    "1"    "1"    "10"    "6"    "1"    "1000"    "-10"    "-1"    "False"
    "3"    "1"    "10"    "6"    "1"    "1000"    "-10"    "-1"    "False"
    "4"    "1"    "10"    "6"    "1"    "1000"    "-10"    "-1"    "False"
    "5"    "1"    "10"    "6"    "1"    "1000"    "-10"    "-1"    "False"
    "13"    "1"    "25"    "14"    "4"    "1001"    "-25"    "-1"    "False"
    "13"    "1"    "25"    "15"    "4"    "1001"    "-25"    "-1"    "False"

    What I need help with is only pulling up a one to one match and not selecting the same one twice, so this as the correct output:
    "PK_IDMatch"    "cqty"    "camt"    "PK_ID"    "groupid"    "charge_code"    "Amount"    "Qty"    "Void"
    "1"    "1"    "10"    "2"    "1"    "1000"    "-10"    "-1"    "False"
    "5"    "1"    "10"    "6"    "1"    "1000"    "-10"    "-1"    "False"
    "13"    "1"    "25"    "14"    "4"    "1001"    "-25"    "-1"    "False"

    The PK_IDMatch could be (1,3,4,5) in group 1 just as long as none are the same.  So if the first record chose 5, then the next match would have to be (1,3, or 4)
    The other negative amounts in my sample data does not contain matches in their own group, so we will not worry about them.  In the case of the last row, two of them have the same match, but the one would just not match anything because the first one found that match.

    My ultimate goal is to find these one to one matches and update the void column to 1 showing that they void each other out.  So this would an example of what my final data would look like (see the void column):

    "PK_ID"    "groupid"    "charge_code"    "Amount"    "Qty"    "Void"
    "1"    "1"    "1000"    "10"    "1"    "True"
    "2"    "1"    "1000"    "-10"    "-1"    "True"
    "3"    "1"    "1000"    "10"    "1"    "False"
    "4"    "1"    "1000"    "10"    "1"    "False"
    "5"    "1"    "1000"    "10"    "1"    "True"
    "6"    "1"    "1000"    "-10"    "-1"    "True"
    "7"    "1"    "1000"    "-20"    "-2"    "True"
    "8"    "2"    "1001"    "50"    "2"    "False"
    "9"    "2"    "1000"    "-10"    "-1"    "True"
    "10"    "2"    "1000"    "20"    "2"    "False"
    "11"    "2"    "1001"    "-25"    "-1"    "True"
    "12"    "3"    "1001"    "25"    "1"    "False"
    "13"    "4"    "1001"    "25"    "1"    "True"
    "14"    "4"    "1001"    "-25"    "-1"    "True"
    "15"    "4"    "1001"    "-25"    "-1"    "True"

    I have some code working by looping over the data with code and doing the work but I need to process this one a lot of data and it is slow and I am wondering if their is a way to do it in SQL so I can then just do a UPDATE SET void=1 SELECT statement.

    Thanks for any suggestions to help speed this up or give me new ideas on how to accomplish something like this.

  • Well, what if in your supplied data, for group id = 1, the -20 is actually a void of the two most recent +10 values?   That's typically why void transactions usually need an identifier associated with the entire void transaction, that references the original transaction or transactions.  If you don't have that kind of information in your database, this is going to be a tad random, and sufficiently arbitrary that if customer statements are affected, then customers might well get confused.   Just wanted to be sure you give that some thought...

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • This gives results similar to (but not exactly the same as) your expected results.  It's not clear why you expect the second one to match on PK_ID 5 rather than PK_ID 3.

    ;
    WITH tc AS
    (
        SELECT *, ROW_NUMBER() OVER(PARTITION BY tc.groupid, tc.charge_code, tc.Qty, tc.Amount ORDER BY tc.PK_ID) AS rn, SIGN(tc.Amount) AS sgn
        FROM #tmp_charges tc
    )
    SELECT *
    FROM tc AS jtc
    INNER JOIN tc
        ON jtc.groupid = tc.groupid
            AND jtc.charge_code = tc.charge_code
            AND jtc.Qty = -tc.Qty
            AND jtc.Amount = -tc.Amount
            AND jtc.rn = tc.rn
    WHERE jtc.sgn = 1

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • sgmunson - Tuesday, March 7, 2017 11:12 AM

    Well, what if in your supplied data, for group id = 1, the -20 is actually a void of the two most recent +10 values?   That's typically why void transactions usually need an identifier associated with the entire void transaction, that references the original transaction or transactions.  If you don't have that kind of information in your database, this is going to be a tad random, and sufficiently arbitrary that if customer statements are affected, then customers might well get confused.   Just wanted to be sure you give that some thought...

    If you need to be able to take this into account, then you should read the following article.  Solving FIFO Queues Using Windowed Functions

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • sgmunson - Tuesday, March 7, 2017 11:12 AM

    Well, what if in your supplied data, for group id = 1, the -20 is actually a void of the two most recent +10 values?   That's typically why void transactions usually need an identifier associated with the entire void transaction, that references the original transaction or transactions.  If you don't have that kind of information in your database, this is going to be a tad random, and sufficiently arbitrary that if customer statements are affected, then customers might well get confused.   Just wanted to be sure you give that some thought...

    I have given this some thought and I don't see how I can possibly take care of those situations since sometimes you might have a +10, but a -5, well then I would have to add records or modify records.  In those rare cases, which do happen, then we just won't display it properly and they will have to figure it out for themselves.  But most cases are a one to one deletion and that is good enough.

    drew.allen - Tuesday, March 7, 2017 
    It's not clear why you expect the second one to match on PK_ID 5 rather than PK_ID 3

    Like I mentioned it doesn't matter as long as it is different than the first match.  I just took the PK_ID that was closest to it.  So your results are valid and look good so far.  Can you help explain to me what it is doing, because it looks awesome, but I don't quite understand how it is working right.

  • Hate to be a bad news delivery agent, but if these are financial records, and you're voiding them out, I have to wonder why?   What if you were subject to an audit?  Wouldn't you need to be able to transactionally trace the voiding of same?   You may have a bigger problem than just cancelling things out, which just doesn't seem like a good idea.   You might even run afoul of regulatory compliance (SOX ?)...

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • sgmunson - Tuesday, March 7, 2017 12:14 PM

    Hate to be a bad news delivery agent, but if these are financial records, and you're voiding them out, I have to wonder why?   What if you were subject to an audit?  Wouldn't you need to be able to transactionally trace the voiding of same?   You may have a bigger problem than just cancelling things out, which just doesn't seem like a good idea.   You might even run afoul of regulatory compliance (SOX ?)...

    Don't worry, it is for display and analytical purposes, the records still exist, thus the reason for the void column.  It is so I can easily select the records that are really charged and show information without the cancelled ones messing up the numbers, except for those rogue ones that don't match, but at least it's better.

  • douglas.t - Tuesday, March 7, 2017 12:25 PM

    sgmunson - Tuesday, March 7, 2017 12:14 PM

    Hate to be a bad news delivery agent, but if these are financial records, and you're voiding them out, I have to wonder why?   What if you were subject to an audit?  Wouldn't you need to be able to transactionally trace the voiding of same?   You may have a bigger problem than just cancelling things out, which just doesn't seem like a good idea.   You might even run afoul of regulatory compliance (SOX ?)...

    Don't worry, it is for display and analytical purposes, the records still exist, thus the reason for the void column.  It is so I can easily select the records that are really charged and show information without the cancelled ones messing up the numbers, except for those rogue ones that don't match, but at least it's better.

    From a practical perspective, that's kind of like putting lipstick on a pig... of course, I've been known to paint pig lips because no one was willing to invest in fixing it so that it gets done right...  so I'm with you on the "at least it's better" part.   In the long run though, it's a band-aid that just makes it that much harder to fix down the road because it's one more piece of stuff that has to get "scraped off" when someone finally decides they've had enough...

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • sgmunson - Tuesday, March 7, 2017 12:32 PM

    From a practical perspective, that's kind of like putting lipstick on a pig... of course, I've been known to paint pig lips because no one was willing to invest in fixing it so that it gets done right...  so I'm with you on the "at least it's better" part.   In the long run though, it's a band-aid that just makes it that much harder to fix down the road because it's one more piece of stuff that has to get "scraped off" when someone finally decides they've had enough...

    I don't completely agree, we can query the sum of the balance or qty and get accurate numbers, what I am using this for is to do some association analysis on the charge codes.  I want to select all the charge codes per transaction without bringing across the ones that shouldn't really be on the transaction.  So with that in mind, if we had a +10 and -5 then the charge code actually would be valid.  In the case of a +10 and +10 and we had a -20, then it wouldn't match and the charge codes may still come across.  It may change the report slightly but it doesn't matter on this.  This does not seem like a band-aid, this seems like a more accurate and faster way of doing it than we had before.

  • douglas.t - Tuesday, March 7, 2017 12:40 PM

    sgmunson - Tuesday, March 7, 2017 12:32 PM

    From a practical perspective, that's kind of like putting lipstick on a pig... of course, I've been known to paint pig lips because no one was willing to invest in fixing it so that it gets done right...  so I'm with you on the "at least it's better" part.   In the long run though, it's a band-aid that just makes it that much harder to fix down the road because it's one more piece of stuff that has to get "scraped off" when someone finally decides they've had enough...

    I don't completely agree, we can query the sum of the balance or qty and get accurate numbers, what I am using this for is to do some association analysis on the charge codes.  I want to select all the charge codes per transaction without bringing across the ones that shouldn't really be on the transaction.  So with that in mind, if we had a +10 and -5 then the charge code actually would be valid.  In the case of a +10 and +10 and we had a -20, then it wouldn't match and the charge codes may still come across.  It may change the report slightly but it doesn't matter on this.  This does not seem like a band-aid, this seems like a more accurate and faster way of doing it than we had before.

    I guess I didn't do a very good job of communicating my concern.   If these charges and subsequent voids are somehow not relevant, then not displaying them just hides the truth of exactly what took place from a transactional point of view.   If that data represents mistakes, that's one thing, but once a mistake is made, it's not that it's too late to fix it, it's just the how that matters.  The other thing you didn't mention is why you need to void these transactions.   After all, they ARE a part of what ACTUALLY took place.   If the only reason to void them is convenience, then I'd wonder if the business process surrounding data entry is where it needs to be from a quality perspective.   Of course, I may not have enough information yet about your "association analysis".   Seems to me that the term would suggest analysis of the number of charges followed by a subsequent negative charge would be useful to analyze.   Voiding these transactions might make them invisible to other processes aimed at getting a full history or other analysis.   In any case, I'd really want to be able to transactionally trace these updates, and have a void transaction identifier that is added to that table to represent the single transaction that associates a given charge with it's negative twin.   That way, you can always know which transactions were paired by this process.

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • sgmunson - Tuesday, March 7, 2017 1:04 PM

    douglas.t - Tuesday, March 7, 2017 12:40 PM

    sgmunson - Tuesday, March 7, 2017 12:32 PM

    From a practical perspective, that's kind of like putting lipstick on a pig... of course, I've been known to paint pig lips because no one was willing to invest in fixing it so that it gets done right...  so I'm with you on the "at least it's better" part.   In the long run though, it's a band-aid that just makes it that much harder to fix down the road because it's one more piece of stuff that has to get "scraped off" when someone finally decides they've had enough...

    I don't completely agree, we can query the sum of the balance or qty and get accurate numbers, what I am using this for is to do some association analysis on the charge codes.  I want to select all the charge codes per transaction without bringing across the ones that shouldn't really be on the transaction.  So with that in mind, if we had a +10 and -5 then the charge code actually would be valid.  In the case of a +10 and +10 and we had a -20, then it wouldn't match and the charge codes may still come across.  It may change the report slightly but it doesn't matter on this.  This does not seem like a band-aid, this seems like a more accurate and faster way of doing it than we had before.

    I guess I didn't do a very good job of communicating my concern.   If these charges and subsequent voids are somehow not relevant, then not displaying them just hides the truth of exactly what took place from a transactional point of view.   If that data represents mistakes, that's one thing, but once a mistake is made, it's not that it's too late to fix it, it's just the how that matters.  The other thing you didn't mention is why you need to void these transactions.   After all, they ARE a part of what ACTUALLY took place.   If the only reason to void them is convenience, then I'd wonder if the business process surrounding data entry is where it needs to be from a quality perspective.   Of course, I may not have enough information yet about your "association analysis".   Seems to me that the term would suggest analysis of the number of charges followed by a subsequent negative charge would be useful to analyze.   Voiding these transactions might make them invisible to other processes aimed at getting a full history or other analysis.   In any case, I'd really want to be able to transactionally trace these updates, and have a void transaction identifier that is added to that table to represent the single transaction that associates a given charge with it's negative twin.   That way, you can always know which transactions were paired by this process.

    Thank you for caring enough to point out potential problems and in case you do want to know what I am doing better I will try to explain: Let's say you purchase some batteries and they scan it twice, then they void one, or better yet, you change your mind and you don't buy the batteries.  The history shows it was scanned and removed and always will.  Association Analysis or Market Basket or Affinity, are methods that determine when someone bought a flashlight they also bought batteries.  If things were added by mistake and removed I don't want to know about those, I just want the associations that were made.  I don't have access to the actual financial tables, I just get a copy of the data to do my stuff and I wanted a better way of querying the data excluding those that really shouldn't be included.  Does that make sense?

  • douglas.t - Tuesday, March 7, 2017 1:56 PM

    sgmunson - Tuesday, March 7, 2017 1:04 PM

    douglas.t - Tuesday, March 7, 2017 12:40 PM

    sgmunson - Tuesday, March 7, 2017 12:32 PM

    From a practical perspective, that's kind of like putting lipstick on a pig... of course, I've been known to paint pig lips because no one was willing to invest in fixing it so that it gets done right...  so I'm with you on the "at least it's better" part.   In the long run though, it's a band-aid that just makes it that much harder to fix down the road because it's one more piece of stuff that has to get "scraped off" when someone finally decides they've had enough...

    I don't completely agree, we can query the sum of the balance or qty and get accurate numbers, what I am using this for is to do some association analysis on the charge codes.  I want to select all the charge codes per transaction without bringing across the ones that shouldn't really be on the transaction.  So with that in mind, if we had a +10 and -5 then the charge code actually would be valid.  In the case of a +10 and +10 and we had a -20, then it wouldn't match and the charge codes may still come across.  It may change the report slightly but it doesn't matter on this.  This does not seem like a band-aid, this seems like a more accurate and faster way of doing it than we had before.

    I guess I didn't do a very good job of communicating my concern.   If these charges and subsequent voids are somehow not relevant, then not displaying them just hides the truth of exactly what took place from a transactional point of view.   If that data represents mistakes, that's one thing, but once a mistake is made, it's not that it's too late to fix it, it's just the how that matters.  The other thing you didn't mention is why you need to void these transactions.   After all, they ARE a part of what ACTUALLY took place.   If the only reason to void them is convenience, then I'd wonder if the business process surrounding data entry is where it needs to be from a quality perspective.   Of course, I may not have enough information yet about your "association analysis".   Seems to me that the term would suggest analysis of the number of charges followed by a subsequent negative charge would be useful to analyze.   Voiding these transactions might make them invisible to other processes aimed at getting a full history or other analysis.   In any case, I'd really want to be able to transactionally trace these updates, and have a void transaction identifier that is added to that table to represent the single transaction that associates a given charge with it's negative twin.   That way, you can always know which transactions were paired by this process.

    Thank you for caring enough to point out potential problems and in case you do want to know what I am doing better I will try to explain: Let's say you purchase some batteries and they scan it twice, then they void one, or better yet, you change your mind and you don't buy the batteries.  The history shows it was scanned and removed and always will.  Association Analysis or Market Basket or Affinity, are methods that determine when someone bought a flashlight they also bought batteries.  If things were added by mistake and removed I don't want to know about those, I just want the associations that were made.  I don't have access to the actual financial tables, I just get a copy of the data to do my stuff and I wanted a better way of querying the data excluding those that really shouldn't be included.  Does that make sense?

    Yes, considerably more sense.  Just knowing it's a copy of the real data is a huge difference.  Wondering if it might be best to use an SSIS package to copy the data over, and just not keep those records in your copy to begin with?

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • douglas.t - Tuesday, March 7, 2017 10:56 AM

    INSERT INTO #tmp_charges (groupid, charge_code, amount, qty, void)
    VALUES (1, '1000', 10, 1, 0),
        (1, '1000', -10, -1, 0),
      (1, '1000', 10, 1, 0),
      (1, '1000', 10, 1, 0),
      (1, '1000', 10, 1, 0),
      (1, '1000', -10, -1, 0)

    Requirements need some clarification.

    Let's take a subset of data, like the above.
    How do you decide which of 4 "charges" have to be voided by those 2 "discharges"?
    How you can distinguish identical charge records from each other to choose which one of them is voided by any particular discharge?

    _____________
    Code for TallyGenerator

  • Sergiy - Tuesday, March 7, 2017 7:55 PM

    douglas.t - Tuesday, March 7, 2017 10:56 AM

    INSERT INTO #tmp_charges (groupid, charge_code, amount, qty, void)
    VALUES (1, '1000', 10, 1, 0),
        (1, '1000', -10, -1, 0),
      (1, '1000', 10, 1, 0),
      (1, '1000', 10, 1, 0),
      (1, '1000', 10, 1, 0),
      (1, '1000', -10, -1, 0)

    Requirements need some clarification.

    Let's take a subset of data, like the above.
    How do you decide which of 4 "charges" have to be voided by those 2 "discharges"?
    How you can distinguish identical charge records from each other to choose which one of them is voided by any particular discharge?

    The answer by drew.allen works great for me.  It doesn't matter which one it chooses, it just can't choose the same one for two of the same on the claim, so whatever matches the first negative then shouldn't be selected as the next match for the second negative.

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply