How to find out outlier between records [Rosolved]

  • Hello,

    I have a table that contains every month stats, the sample data is as below:

    CREATE TABLE dbo.Transactions(
    Date1 Varchar(5) Not NULL,
    Amount1 FLOAT,
    Amount2 FLOAT
    )

    INSERT INTO Transactions (Date1, Amount1, Amount2) VALUES (201801, 1000000, 2000000)
    INSERT INTO Transactions (Date1, Amount1, Amount2) VALUES (201802, 980000 , 1980000)
    INSERT INTO Transactions (Date1, Amount1, Amount2) VALUES (201803, 1500000, 18500000)
    INSERT INTO Transactions (Date1, Amount1, Amount2) VALUES (201804, 1400000, 19400000)
    INSERT INTO Transactions (Date1, Amount1, Amount2) VALUES (201805, 1350000, 19500000)
    INSERT INTO Transactions (Date1, Amount1, Amount2) VALUES (201806, 1443000, 14430000)
    INSERT INTO Transactions (Date1, Amount1, Amount2) VALUES (201807, 1380000, 13800000)
    INSERT INTO Transactions (Date1, Amount1, Amount2) VALUES (201808, 1400000, 14000000)
    INSERT INTO Transactions (Date1, Amount1, Amount2) VALUES (201809, 1490000, 14490000)

    I need to compare both Amount1 and Amount2 (in fact there are more stats columns in the real table)
    Using Amount1 as example, as you can see, 201803 is an outlier to 201802 because the delta is too much compare to 201802 vs 201801;

    We need only compare the previous month because this is run monthly basis; in 201803 we have 201801-201803 data, in 201808 we have 201801-201808 data, and the ourlier check is made to only the previous month, with that being said, 201804 is NOT outlier anymore, however, 201809 is.

    Threshold will be defined (e.g. 20% to Amount1 and 10% to Amount2), hard coded in the query is fine.

    Thank you very much.

  • I think 201809 is not an outlier as percentage deviation is less than 20% for amount1 and less than 10 % for amount2

    CREATE TABLE dbo.Transactions(
     Date1 Varchar(6) Not NULL,
     Amount1 FLOAT,
     Amount2 FLOAT 
    )

    INSERT INTO Transactions (Date1, Amount1, Amount2) VALUES (201801, 1000000, 2000000)
    INSERT INTO Transactions (Date1, Amount1, Amount2) VALUES (201802, 980000 , 1980000)
    INSERT INTO Transactions (Date1, Amount1, Amount2) VALUES (201803, 1500000, 18500000)
    INSERT INTO Transactions (Date1, Amount1, Amount2) VALUES (201804, 1400000, 19400000)
    INSERT INTO Transactions (Date1, Amount1, Amount2) VALUES (201805, 1350000, 19500000)
    INSERT INTO Transactions (Date1, Amount1, Amount2) VALUES (201806, 1443000, 14430000)
    INSERT INTO Transactions (Date1, Amount1, Amount2) VALUES (201807, 1380000, 13800000)
    INSERT INTO Transactions (Date1, Amount1, Amount2) VALUES (201808, 1400000, 14000000)
    INSERT INTO Transactions (Date1, Amount1, Amount2) VALUES (201809, 1490000, 14490000)

    ; with cte as
    (
    Select
    Date1,
    Amount1 as cur_Amount1,
    lag(Amount1) over ( order by Date1 asc) as prv_Amount1,
    Amount2 as cur_Amount2,
    lag(Amount2) over ( order by Date1 asc) as prv_Amount2
    from Transactions
    )

    Select Date1
    ,CASE WHEN (cur_Amount1-prv_Amount1)*100/prv_Amount1 >=20  THEN 'Amount1 is an outlier' else 'Amount1 is not an outlier' END AS AMOUNT1_OUTLIER
    ,CASE WHEN (cur_Amount2-prv_Amount2)*100/prv_Amount2 >=10  THEN 'Amount1 is an outlier' else 'Amount1 is not an outlier' END AS AMOUNT2_OUTLIER

    from cte

  • Here's my attempt in one pass:
    CREATE TABLE #Transactions (
        Date1 date NOT NULL PRIMARY KEY CLUSTERED,
        Amount1 int,
        Amount2 int
    );
    INSERT INTO #Transactions (Date1, Amount1, Amount2)
        VALUES    ('20180101', 1000000, 2000000),
                ('20180201', 980000 , 1980000),
                ('20180301', 1500000, 18500000),
                ('20180401', 1400000, 19400000),
                ('20180501', 1350000, 19500000),
                ('20180601', 1443000, 14430000),
                ('20180701', 1380000, 13800000),
                ('20180801', 1400000, 14000000),
                ('20180901', 1490000, 14490000);

    SELECT
        CURR.Date1,
        CURR.Amount1,
        CURR.Amount2,
        CASE
            WHEN ABS(((CURR.Amount1 - LAG(CURR.Amount1, 1) OVER(ORDER BY CURR.Date1)) * 100)/LAG(CURR.Amount1, 1) OVER(ORDER BY CURR.Date1)) >= 20 THEN 1
            ELSE 0
        END AS Amount1_Is_OutlierFlag,
        CASE
            WHEN ABS(((CURR.Amount2 - LAG(CURR.Amount2, 1) OVER(ORDER BY CURR.Date1)) * 100)/LAG(CURR.Amount2, 1) OVER(ORDER BY CURR.Date1)) >= 10 THEN 1
            ELSE 0
        END AS Amount2_Is_OutlierFlag
    FROM #Transactions AS CURR

    DROP TABLE #Transactions;

    Note that the determination of what is an outlier does NOT agree with the original poster's claim of what is an outlier, and also identifies an outlier based on a change in amounts in EITHER direction instead of just for increases.  Not sure if that is the intent, so will need the original poster to speak up.

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • Thank you both very much for the replies, both work!

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply