count occurrences/frequency in a multiple value column

  • Hi every one,

    Wondering how to extract the number of occurrence in a multiple value column. See example below:

    Table User_Language

    User      |  Language

    Person 1   English, Spanish, French

    Person 2  Advanced English, Intermidiate Spanish

    Person 3  English, Intermidiate English

    Person 4  Japanese, Fluent English

    Person 5  Advanced English, Spanish

    Person 6 Advanced English

    Expected Result

    Languange           |      Frequency

    Advanced English      3

    English                          2

    Spanish                          2

    French                            1

    Intermidiate Spanish  1

    Intermidiate English    1

    Fluent English              1

    Japanese                        1

    For instance, have tried CASE and IFs, not being able to achieve the result above, since the case statement will count only once per row.

    Need SQL clause to retrieve the info from the table User_Language

  • You have to split the values first, so you have a normalized structure so you can do the aggregation.  I used Jeff Moden's DelimitedSplit8K function

    SELECT Lang2, COUNT(*) AS Freq
    FROM
    (SELECT SpeakerID
    , Lang2 = TRIM(ds.Item)
    FROM
    (SELECT SpeakerID
    , Lang1 = REPLACE(REPLACE(REPLACE(Lang,'Advanced ',''),'Intermidiate',''),'Fluent','')
    FROM #Langs) la
    CROSS APPLY Testdb.dbo.DelimitedSplit8K(Lang1,',') ds) lst
    GROUP BY lst.Lang2
  • If you have a short list of languages and don't mind a bunch of SUM and CASE statements, you could get by using those

    Such as:

    DECLARE @tmpTbl TABLE
    (
    [USERname] VARCHAR(255)
    , [LANG] VARCHAR(255)
    );
    INSERT INTO @tmpTbl
    VALUES
    (
    'p1'
    , 'english,spanish,french'
    )
    , (
    'p2'
    , 'advanced english, intermediate spanish'
    )
    , (
    'p3'
    , 'english,advanced english'
    );

    SELECT
    [Languages].[Lang]
    , [Languages].[LangCount]
    FROM
    (
    SELECT(SUM(CASE
    WHEN [LANG] LIKE '%english%'
    THEN 1
    ELSE 0
    END
    ) - SUM( CASE
    WHEN [LANG] LIKE '%Advanced English%'
    THEN 1
    ELSE 0
    END
    ) - SUM( CASE
    WHEN [LANG] LIKE '%Intermediate English%'
    THEN 1
    ELSE 0
    END
    )
    ) AS [English]
    , SUM( CASE
    WHEN [LANG] LIKE '%Advanced English%'
    THEN 1
    ELSE 0
    END
    ) AS [Advanced English]
    FROM@tmpTbl
    ) AS [RawData]
    UNPIVOT
    (
    [LangCount]
    FOR [Lang] IN (
    [English]
    , [Advanced English]
    )
    ) AS [Languages];

    You would need to expand on that to cover more languages, but gives you a starting point where you don't need to use a string splitter.  A HUGE advantage of the string splitter method presented by pietliden is that it is dynamic whereas my approach is very manual.  What I mean is apparent when you look at the result based on the input data.  In pietliden's solution, you would get all of the languages in the original source list.  With my approach you only get the ones you are explicitly looking for.  With the sample set I did, there are 5 distinct languages, but my result set only shows 2 and you would need to add more SUM and CASE statements to handle them.

    My intent was not to provide a better solution (mine requires more hand-holding as the language list grows), just to provide a different approach.

    Mind you, my preference is to not store the data that way to begin with.  My opinion, it makes more sense to have a languages table and a persons table and then a table that matches the 2 up.  The way it is now, if you decided that you wanted to add a new languages to a person, you need to check the length of the language list already in place (not sure on the data type you have there), and you may get poor performance on inserts and updates as you may cause page splits.  On top of that, the data is not normalized.  Lastly, it is not very search friendly data.  If you wanted a list of all users who have "English" as a language (for example), you are now needing to do your WHERE statement with something like:

    WHERE [Language] LIKE 'English%'
    OR [Language] LIKE '%, English%'
    OR [Language] LIKE '%,English%

    Now, had you normalized it and had one table for users, one table for languages and one mapping table, you could do something like:

    SELECT User.Name
    FROM User
    Join User_Language ON User.ID = User_Language.UserID
    JOIN Language ON User_Langugae.LanguageID = Language.ID
    WHERE Language.Name = 'English'

    And for your original query, you could end up with something like:

    SELECT Language.Name, COUNT(User_Langugae.LanguageID)
    FROM Language
    JOIN User_Language on Language.ID = User_Language.LanguageID

    With the "User_Language" table you also get the advantage that it is trivial to remove data duplication.

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • As noted by Brian, your table design is completely wrong.  You've then had to corrupt the data to fit it into the bad table structure.

    For example, the language is "English".  There is no language named "Advanced English" or "Intermediate English".  Since you need to keep info about the combination of a User and Language, then data modeling rules require that you have a separate table.

    With the proper table structure, you would have something like this:

    CREATE TABLE dbo.User_Languages (

    user_id int NOT NULL ... FOREIGN KEY dbo.users ( user_id )...,

    language_id smallint NOT NULL ... FOREIGN KEY dbo.languages ( language_id )...,

    language_proficiency_id tinyint NOT NULL ... FOREIGN KEY dbo.language_proficiencies ( language_proficiency_id )...,

    /* where, for example, 0=minimal; 1=written, but not conversational; 50=intermediate; 100=advanced; ...; 200=fluent; 240=native speaker */

    ...

    )

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • My answer for MS SQL 2016-2019:

     

    create table User_Language
    (
    PersonName nvarchar(50) not null
    , PersonLang nvarchar(150) not null
    )
    go
    insert into dbo.User_Language ( PersonName, PersonLang )
    values
    (N'Person 1', N'English, Spanish, French')
    , (N'Person 2 ', N'Advanced English, Intermidiate Spanish')
    , (N'Person 3', N'English, Intermidiate English')
    , (N'Person 4', N'Japanese, Fluent English')
    , (N'Person 5', N'Advanced English, Spanish')
    , (N'Person 6', N'Advanced English')
    go
    ; with cte as
    (
    select ul.PersonName, ss.Lang
    from dbo.User_Language as ul
    cross apply
    (
    select ltrim(rtrim(value ))as Lang
    from string_split(ul.PersonLang, N',')
    ) as ss
    )
    select cte.Lang, count(cte.Lang)as CntLang
    from cte
    group by cte.Lang

    • This reply was modified 4 years, 11 months ago by damir_60.
  • Hi guys, Thank you for all the solutions presented and very good explanations.

    Totally agree with you.

    I doing a freelance job and I need to report using such Bad data Structure, in other words I don´t have control,neither my client, since it is a proprietary solution.

    As mentioned by Mr. Brian Gale pietliden's solution is a very nice approach, but I´ll  need to check if it works on IBM Cognos, which is pointed to SQL Server and uses database native functions. If it not works Mr. Brian Gale solution might come in handy, since I can User COGNOS native SQL or T-SQL to retreive the data.

    The good thing about this post it that I can give my client a more assertive argument for its data Structure and the workarounds we need to implement to resolve the issue.

    Again, thank you very much!

    Best Regards,

    Roger NC

     

     

     

     

     

     

  • You can use the built-in STRING_SPLIT function:

     --==== Sample data
    Declare @testTable Table (UserName varchar(20), UserLanguage varchar(255))
    Insert Into @testTable (UserName, UserLanguage)
    Values ('Person 1', 'English,Spanish,French')
    , ('Person 2', 'Advanced English,Intermediate Spanish')
    , ('Person 3', 'English,Intermediate English')
    , ('Person 4', 'Japanese,Fluent English')
    , ('Person 5', 'Advanced English,Spanish')
    , ('Person 6', 'Advanced English');

    --==== Solution, note: Intermediate vs Intermidiate
    Select UserLanguage = ltrim(rtrim(ss.value)) -- trim(ss.value) --SQL Server 2017 or higher
    , Frequency = count(*)
    From @testTable tt
    Cross Apply string_split(tt.UserLanguage, ',') ss
    Group By
    ss.value
    Order By
    Frequency desc;

     

    Jeffrey Williams
    “We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”

    ― Charles R. Swindoll

    How to post questions to get better answers faster
    Managing Transaction Logs

Viewing 7 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply