Getting most common value from several fields into one record

  • Here's the example that I'm working with:

    CREATE TABLE #TestTable

    (

    ID INT,

    Field1 VARCHAR(MAX),

    Field2 VARCHAR(MAX),

    Field3 VARCHAR(MAX),

    Field4 VARCHAR(MAX),

    Field5 VARCHAR(MAX)

    )

    INSERT INTO #TestTable(ID, Field1, Field2, Field3, Field4, Field5)

    VALUES(1, 'a', 'b', 'c', 'd', 'l')

    INSERT INTO #TestTable(ID, Field1, Field2, Field3, Field4, Field5)

    VALUES(1, 'a', 'b', 'e', 'f', 'g')

    INSERT INTO #TestTable(ID, Field1, Field2, Field3, Field4, Field5)

    VALUES(1, 'b', 'b', 'c', 'g', 'l')

    INSERT INTO #TestTable(ID, Field1, Field2, Field3, Field4, Field5)

    VALUES(1, 'c', 'b', 'c', 'f', 'e')

    INSERT INTO #TestTable(ID, Field1, Field2, Field3, Field4, Field5)

    VALUES(2, 'd', 'e', 'c', 'i', 'r')

    INSERT INTO #TestTable(ID, Field1, Field2, Field3, Field4, Field5)

    VALUES(2, 'd', 'f', 'd', 'j', 'q')

    INSERT INTO #TestTable(ID, Field1, Field2, Field3, Field4, Field5)

    VALUES(2, 'i', 'a', 'g', 'j', 'q')

    INSERT INTO #TestTable(ID, Field1, Field2, Field3, Field4, Field5)

    VALUES(2, 'e', 'c', 'a', 'j', 'e')

    INSERT INTO #TestTable(ID, Field1, Field2, Field3, Field4, Field5)

    VALUES(2, 'f', 'e', 'f', 'j', 'q')

    SELECT * FROM #TestTable

    CREATE TABLE #ResultTable

    (

    ID INT,

    Field1 VARCHAR(MAX),

    Field2 VARCHAR(MAX),

    Field3 VARCHAR(MAX),

    Field4 VARCHAR(MAX),

    Field5 VARCHAR(MAX)

    )

    INSERT INTO #ResultTable

    SELECT DISTINCT ID, '', '', '', '', ''

    FROM #TestTable

    SELECT * FROM #ResultTable

    DROP TABLE #TestTable

    DROP TABLE #ResultTable

    What I'm trying to accomplish is this. Basically, let's say I have several columns in a table which is identified with a non-unique identifier. What I want to do is get the values from each column which are the most commonly occurring values for each identifier.

    So, from the sample data which I provided above, what the result should be in #ResultTable, is:

    ID Field1 Field2 Field3 Field4 Field5

    1 a b c f l

    2 d e ! j q

    The ! represents an "unknown". For the purposes of this example, we can just assume any value will suffice.

    I can't really come up with an efficient way of doing this. The best I can think of, is to create a CTE for each field, and do something like:

    WITH

    cte1 AS

    (

    SELECT ID, Field1, ROW_NUMBER() OVER (PARTITION BY ID ORDER BY COUNT(*) DESC) AS rowNum

    FROM #TestTable

    GROUP BY ID, Field1

    ),

    cte2 AS

    (

    SELECT ID, Field2, ROW_NUMBER() OVER (PARTITION BY ID ORDER BY COUNT(*) DESC) AS rowNum

    FROM #TestTable

    GROUP BY ID, Field2

    ),

    cte3 AS

    (

    SELECT ID, Field3, ROW_NUMBER() OVER (PARTITION BY ID ORDER BY COUNT(*) DESC) AS rowNum

    FROM #TestTable

    GROUP BY ID, Field3

    ),

    cte4 AS

    (

    SELECT ID, Field4, ROW_NUMBER() OVER (PARTITION BY ID ORDER BY COUNT(*) DESC) AS rowNum

    FROM #TestTable

    GROUP BY ID, Field4

    ),

    cte5 AS

    (

    SELECT ID, Field5, ROW_NUMBER() OVER (PARTITION BY ID ORDER BY COUNT(*) DESC) AS rowNum

    FROM #TestTable

    GROUP BY ID, Field5

    )

    UPDATE r

    SET r.Field1 = c1.Field1,

    r.Field2 = c2.Field2,

    r.Field3 = c3.Field3,

    r.Field4 = c4.Field4,

    r.Field5 = c5.Field5

    FROM #ResultTable r

    JOIN cte1 c1 ON c1.ID = r.ID AND c1.rowNum = 1

    JOIN cte2 c2 ON c2.ID = r.ID AND c2.rowNum = 1

    JOIN cte3 c3 ON c3.ID = r.ID AND c3.rowNum = 1

    JOIN cte4 c4 ON c4.ID = r.ID AND c4.rowNum = 1

    JOIN cte5 c5 ON c5.ID = r.ID AND c5.rowNum = 1

    SELECT * FROM #ResultTable

  • I would create aggregate CLR function for this in C#.

    _____________________________________________
    "The only true wisdom is in knowing you know nothing"
    "O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
    (So many miracle inventions provided by MS to us...)

    How to post your question to get the best and quick help[/url]

  • A CLR would definitely work, but I don't think it's necessary here.

    I used a table variable instead of temp table for my own reasons but you can do whatever on your system, of course.

    declare @t_temp table

    (

    ID INT,

    Field1 VARCHAR(MAX),

    Field2 VARCHAR(MAX),

    Field3 VARCHAR(MAX),

    Field4 VARCHAR(MAX),

    Field5 VARCHAR(MAX)

    )

    INSERT INTO @t_temp(ID, Field1, Field2, Field3, Field4, Field5)

    select 1, 'a', 'b', 'c', 'd', 'l'

    unionselect 1, 'a', 'b', 'e', 'f', 'g'

    unionselect 1, 'b', 'b', 'c', 'g', 'l'

    unionselect 1, 'c', 'b', 'c', 'f', 'e'

    unionselect 2, 'd', 'e', 'c', 'i', 'r'

    unionselect 2, 'd', 'f', 'd', 'j', 'q'

    unionselect 2, 'i', 'a', 'g', 'j', 'q'

    unionselect 2, 'e', 'c', 'a', 'j', 'e'

    unionselect 2, 'f', 'e', 'f', 'j', 'q';

    This is one way to do what you're trying to accomplish:

    with cteTemp (ID, Field, FieldVal, FieldCount)

    as

    (

    select ID,

    1 as Field,

    Field1 as FieldVal,

    COUNT(Field1) as FieldCount

    from @t_temp

    group by ID, Field1

    union

    select ID,

    2,

    Field2,

    COUNT(Field2)

    from @t_temp

    group by ID, Field2

    union

    select ID,

    3,

    Field3,

    COUNT(Field3)

    from @t_temp

    group by ID, Field3

    union

    select ID,

    4,

    Field4,

    COUNT(Field4)

    from @t_temp

    group by ID, Field4

    union

    select ID,

    5,

    Field5,

    COUNT(Field5)

    from @t_temp

    group by ID, Field5

    )

    select ID,

    isnull(max([1]), '!') as Field1,

    isnull(max([2]), '!') as Field2,

    isnull(max([3]), '!') as Field3,

    isnull(max([4]), '!') as Field4,

    isnull(max([5]), '!') as Field5

    from

    (

    select ID,

    Field,

    FieldVal,

    FieldCount

    from cteTemp c

    where not exists

    (

    select sq.FieldVal

    from cteTemp sq

    where sq.ID = c.ID

    and sq.Field = c.Field

    and sq.FieldVal <> c.FieldVal

    and sq.FieldCount >= c.FieldCount

    )

    ) as pvt

    pivot

    (

    MIN(FieldVal)

    FOR Field IN ([1],[2],[3],[4],[5])

    ) as PivotTable

    group by ID;

    There are certainly some improvements you could make, but hopefully you get the idea.

    └> bt



    Forum Etiquette: How to post data/code on a forum to get the best help[/url]

  • *nod* this one is pretty similar to the one that I listed.

    Basically what I was hoping for was a solution which would not require me to run X repeat queries against the same set of data. I mean, thinking about it, I can't really come up with any logical way that you could do what I'm asking for without running repeated queries, but if someone else could, it would be awesome 😛

  • Not sure what you mean by repeated queries.

    └> bt



    Forum Etiquette: How to post data/code on a forum to get the best help[/url]

  • Both our solutions are doing N number of queries against the same table, in order to get the most common occurring value for each column.

  • kramaswamy (5/20/2010)


    Both our solutions are doing N number of queries against the same table, in order to get the most common occurring value for each column.

    Maybe I'm not understanding what you're actually trying to do. If there are a static 5 columns this really doesn't matter. If you have a variable number of columns you may want to look at using dynamic sql. Independent of that, if it bothers you to have separate select statements you can use UNPIVOT instead. I don't think that it will make a performance difference however.

    └> bt



    Forum Etiquette: How to post data/code on a forum to get the best help[/url]

  • elutin (5/20/2010)


    I would create aggregate CLR function for this in C#.

    Cool... lets see it.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Details are in the comments folks. Heh... ya wanted to get rid of "repeating code", right?

    --===== Conditionally drop the test table. It makes reruns easier

    -- and leaves a persistent table for easier troubleshooting

    -- than a table variable.

    IF OBJECT_ID('TempDB..#TestTable','U') IS NOT NULL

    DROP TABLE #TestTable

    --===== Recreate the test table.

    CREATE TABLE #TestTable

    (

    ID INT,

    Field1 VARCHAR(MAX),

    Field2 VARCHAR(MAX),

    Field3 VARCHAR(MAX),

    Field4 VARCHAR(MAX),

    Field5 VARCHAR(MAX)

    )

    ;

    --===== Populate the test table... I added an extra scenario in ID 3

    INSERT INTO #TestTable

    (ID, Field1, Field2, Field3, Field4, Field5)

    SELECT 1, 'a', 'b', 'c', 'd', 'l' UNION ALL

    SELECT 1, 'a', 'b', 'e', 'f', 'g' UNION ALL

    SELECT 1, 'b', 'b', 'c', 'g', 'l' UNION ALL

    SELECT 1, 'c', 'b', 'c', 'f', 'e' UNION ALL

    SELECT 2, 'd', 'e', 'c', 'i', 'r' UNION ALL

    SELECT 2, 'd', 'f', 'd', 'j', 'q' UNION ALL

    SELECT 2, 'i', 'a', 'g', 'j', 'q' UNION ALL

    SELECT 2, 'e', 'c', 'a', 'j', 'e' UNION ALL

    SELECT 2, 'f', 'e', 'f', 'j', 'q' UNION ALL

    SELECT 3, 'd', 'e', 'c', 'i', 'r' UNION ALL

    SELECT 3, 'd', 'f', 'c', 'j', 'q' UNION ALL

    SELECT 3, 'i', 'a', 'g', 'j', 'q' UNION ALL

    SELECT 3, 'e', 'c', 'g', 'j', 'e' UNION ALL

    SELECT 3, 'f', 'e', 'f', 'j', 'q'

    ;

    --===== Solve the problem. This gets rid of 119 logical reads and 8 scans

    -- which are caused by a "Triangular Join" in the other methods.

    WITH

    cteUnpivotAndCount AS

    ( --=== Unpivot the data and count the occurances at the same time.

    SELECT unpvt.ID, unpvt.Field, unpvt.Value, COUNT(*) AS ValueCount

    FROM (SELECT ID, Field1, Field2, Field3, Field4, Field5 FROM #TestTable) src

    UNPIVOT (Value FOR Field IN (Field1, Field2, Field3, Field4, Field5))AS unpvt

    GROUP BY unpvt.ID, unpvt.Field, unpvt.Value

    )

    ,

    cteTotalItems AS

    ( --=== Count the number of rows for each ID

    SELECT ID, COUNT(*) AS TotalCount

    FROM #TestTable

    GROUP BY ID

    )

    ,

    cteRank AS

    ( --=== Create "opposing" rank columns

    SELECT uac.ID, uac.Field, uac.Value, uac.ValueCount,

    ROW_NUMBER() OVER (PARTITION BY uac.ID, uac.Field ORDER BY uac.ValueCount ASC) AS RevRank,

    ROW_NUMBER() OVER (PARTITION BY uac.ID, uac.Field ORDER BY uac.ValueCount DESC) AS FwdRank,

    ti.TotalCount

    FROM cteUnpivotAndCount uac

    INNER JOIN cteTotalItems ti

    ON uac.ID = ti.ID

    ) --=== We have enough information for the criteria to work now. A Cross Tab like this is a little faster than PIVOT.

    SELECT ID,

    MAX(CASE WHEN Field = 'Field1' THEN Value ELSE '!' END) AS Field1,

    MAX(CASE WHEN Field = 'Field2' THEN Value ELSE '!' END) AS Field2,

    MAX(CASE WHEN Field = 'Field3' THEN Value ELSE '!' END) AS Field3,

    MAX(CASE WHEN Field = 'Field4' THEN Value ELSE '!' END) AS Field4,

    MAX(CASE WHEN Field = 'Field5' THEN Value ELSE '!' END) AS Field5

    FROM cteRank

    WHERE (FwdRank = 1 AND ValueCount + RevRank + FwdRank -2 >= TotalCount)

    GROUP BY ID

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • And what I wanted was:

    select ID

    ,dbo.MostCommonValue(Field1)

    ,dbo.MostCommonValue(Field2)

    ,dbo.MostCommonValue(Field3)

    ,dbo.MostCommonValue(Field4)

    ,dbo.MostCommonValue(Field5)

    from #t_temp

    group by ID

    Easy hey?

    Now, only what is left is just to write CLR.

    OK, it may be not the best implementation, but...

    Here we are:

    using System;

    using System.Collections.Generic;

    using System.Collections;

    using System.Text;

    using Microsoft.SqlServer.Server;

    using System.Data.SqlTypes;

    using System.IO;

    namespace StrAggr

    {

    [Serializable]

    [SqlUserDefinedAggregate(

    Format.UserDefined, //use custom serialization to serialize the intermediate result

    IsInvariantToNulls = true, //optimizer property

    IsInvariantToDuplicates = false, //optimizer property

    IsInvariantToOrder = false, //optimizer property

    MaxByteSize = 8000) //maximum size in bytes of persisted value

    ]

    public class MostCommonValue : IBinarySerialize

    {

    private Dictionary<string, int> vals;

    public void Init()

    {

    this.vals = new Dictionary<string, int>();

    }

    // If the next value is not null, and not in dictionary already - add it, if it's in dictionary - increase count

    public void Accumulate(SqlString value)

    {

    if (value.IsNull)

    {

    return;

    }

    if (this.vals.ContainsKey(value.Value))

    {

    this.vals[value.Value]++;

    }

    else

    {

    this.vals.Add(value.Value, 1);

    }

    }

    //Merges the partial aggregate with this aggregate

    public void Merge(MostCommonValue part)

    {

    foreach (KeyValuePair<string, int> item in part.vals)

    {

    if (this.vals.ContainsKey(item.Key))

    {

    this.vals[item.Key] = this.vals[item.Key] + item.Value;

    }

    else

    {

    this.vals.Add(item.Key, item.Value);

    }

    }

    }

    //Returns the result of the aggregation when finished

    public SqlString Terminate()

    {

    //there are other ways of finding maximum by using List and Sort,

    //but SQL Server doesn't like using delegates...

    int maxOcc = 0;

    string res = "!";

    if (this.vals != null)

    {

    foreach (KeyValuePair<string, int> item in this.vals)

    {

    if (maxOcc < item.Value)

    {

    res = item.Key;

    maxOcc = item.Value;

    }

    else if (maxOcc == item.Value)

    {

    res = "!";

    }

    }

    }

    return new SqlString(res);

    }

    public void Write(BinaryWriter writer)

    {

    //would be really helpfull if Dictionary was serializable....

    //we need to simulate serialization here.

    //probably, better implemenation required (I hate to choose separators...)

    StringBuilder sb = new StringBuilder();

    foreach (KeyValuePair<string, int> item in this.vals)

    {

    sb.Append(item.Key);

    sb.Append("{:}");

    sb.Append(item.Value.ToString());

    sb.Append("{-}");

    }

    writer.Write(sb.ToString());

    }

    public void Read(BinaryReader reader)

    {

    this.vals = new Dictionary<string, int>();

    string sVals = reader.ReadString();

    string [] valKP = sVals.Split(new string[]{"{-}"},StringSplitOptions.None);

    foreach (string kp in valKP)

    {

    string[] valpair = kp.Split(new string[] { "{:}" }, StringSplitOptions.None);

    if (valpair.Length==2) this.vals.Add(valpair[0], Int32.Parse(valpair[1]));

    }

    }

    }

    }

    Compile it into library and registry assemply:

    -- just in case if it's disabled

    EXEC sp_configure 'clr enabled', 1;

    GO

    RECONFIGURE WITH OVERRIDE;

    GO

    --DROP AGGREGATE MostCommonValue

    --DROP ASSEMBLY StrAggr

    GO

    CREATE ASSEMBLY StrAggr FROM 'C:\_temp\SQLDLL\straggr.dll'

    GO

    CREATE AGGREGATE MostCommonValue (@input nvarchar(4000)) RETURNS nvarchar(4000)

    EXTERNAL NAME StrAggr.[StrAggr.MostCommonValue]

    GO

    Now you can try my first query...

    Cheers,

    Me

    _____________________________________________
    "The only true wisdom is in knowing you know nothing"
    "O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
    (So many miracle inventions provided by MS to us...)

    How to post your question to get the best and quick help[/url]

  • I have an idea for Read and Write method which will not use any separators to simulate serialization of Dictionary. I will do it on Monday.

    Is anyone know the best way to serialize Dictionary object in C#? I don't need support fo generic Dictionary. It will be ok just for string & int KeyValuePairs.

    _____________________________________________
    "The only true wisdom is in knowing you know nothing"
    "O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
    (So many miracle inventions provided by MS to us...)

    How to post your question to get the best and quick help[/url]

  • I like your solution a lot Jeff, definitely avoids the necessity of adding a new CTE for every column you have. Only thing that needs to be added now is another set of columns in the CTE and in the SELECT query.

    I've modified it slightly though. I noticed you had invested a decent amount of code into handling scenarios where there is not one distinct value which is the highest in count. This scenario is not actually important to me - I'm fine with choosing any one of them. So, here's my modification of your code:

    --===== Conditionally drop the test table. It makes reruns easier

    -- and leaves a persistent table for easier troubleshooting

    -- than a table variable.

    IF OBJECT_ID('TempDB..#TestTable','U') IS NOT NULL

    DROP TABLE #TestTable

    --===== Recreate the test table.

    CREATE TABLE #TestTable

    (

    ID INT,

    Field1 VARCHAR(MAX),

    Field2 VARCHAR(MAX),

    Field3 VARCHAR(MAX),

    Field4 VARCHAR(MAX),

    Field5 VARCHAR(MAX)

    )

    ;

    --===== Populate the test table... I added an extra scenario in ID 3

    INSERT INTO #TestTable (ID, Field1, Field2, Field3, Field4, Field5)

    SELECT 1, 'a', 'b', 'c', 'd', 'l' UNION ALL

    SELECT 1, 'a', 'b', 'e', 'f', 'g' UNION ALL

    SELECT 1, 'b', 'b', 'c', 'g', 'l' UNION ALL

    SELECT 1, 'c', 'b', 'c', 'f', 'e' UNION ALL

    SELECT 2, 'd', 'e', 'c', 'i', 'r' UNION ALL

    SELECT 2, 'd', 'f', 'd', 'j', 'q' UNION ALL

    SELECT 2, 'i', 'a', 'g', 'j', 'q' UNION ALL

    SELECT 2, 'e', 'c', 'a', 'j', 'e' UNION ALL

    SELECT 2, 'f', 'e', 'f', 'j', 'q' UNION ALL

    SELECT 3, 'd', 'e', 'c', 'i', 'r' UNION ALL

    SELECT 3, 'd', 'f', 'c', 'j', 'q' UNION ALL

    SELECT 3, 'i', 'a', 'g', 'j', 'q' UNION ALL

    SELECT 3, 'e', 'c', 'g', 'j', 'e' UNION ALL

    SELECT 3, 'f', 'e', 'f', 'j', 'q'

    ;

    --===== Solve the problem. This gets rid of 119 logical reads and 8 scans

    -- which are caused by a "Triangular Join" in the other methods.

    WITH cteUnpivotAndCount AS

    ( --=== Unpivot the data and count the occurances at the same time.

    SELECT

    unpvt.ID,

    unpvt.Field,

    unpvt.Value,

    COUNT(*) AS ValueCount,

    ROW_NUMBER() OVER (PARTITION BY unpvt.ID, unpvt.Field ORDER BY COUNT(*) DESC) AS FwdRank

    FROM (SELECT ID, Field1, Field2, Field3, Field4, Field5 FROM #TestTable) src

    UNPIVOT (Value FOR Field IN (Field1, Field2, Field3, Field4, Field5)) AS unpvt

    GROUP BY unpvt.ID, unpvt.Field, unpvt.Value

    )

    SELECT

    ID,

    MAX(CASE WHEN Field = 'Field1' THEN Value END) AS Field1,

    MAX(CASE WHEN Field = 'Field2' THEN Value END) AS Field2,

    MAX(CASE WHEN Field = 'Field3' THEN Value END) AS Field3,

    MAX(CASE WHEN Field = 'Field4' THEN Value END) AS Field4,

    MAX(CASE WHEN Field = 'Field5' THEN Value END) AS Field5

    FROM cteUnpivotAndCount

    WHERE FwdRank = 1

    GROUP BY ID

  • kramaswamy (5/25/2010)


    This scenario is not actually important to me - I'm fine with choosing any one of them.

    Heh... now that would have been nice to know up front. 😛 Thanks for the feedback and the code, kramaswamy. I appreciate the time and I'm glad I could help.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply