Normalising data in semicolon-delimited format

  • I have some requirements to normalise data in a table for the purpose of reporting (I'm not allowed to modify the original table). At the moment, the databases are a mixture of SQL 7 and 2000 but will all be 2000 in next month or so, so a 2000-only solution would be fine.

    We have a variable number of surveys, each with a variable number of questions, each with a variable number of predefined multiple-choice answers. Unfortunately, the answers are stored in a semicolon-delimited field, in which case the first item is answer #1, second is answer #2, the nth is answer #n (the survey responses are stored with the answer number).

    /*

    If tally table doesn't exist, create it with minimum rows needed

    for this test

    */

    SELECT TOP 260 IDENTITY(INT, 1, 1) n

    INTO Tally

    FROM syscolumns sc1, syscolumns sc2

    SELECT * FROM [Tally]

    /*

    Table and sample data (these aren't actual survey answers - just made up

    for the purpose of this post!)

    */

    CREATE TABLE SurveyAnswer

    (

    SurveyCode INT,

    QuestionCode INT,

    Answers VARCHAR(255)

    )

    INSERT INTO [SurveyAnswer]

    SELECT 1, 1, '0 - 10; 11 - 20; 21 - 30; 30+'

    UNION ALL

    SELECT 1, 2, 'Yes; No; Unknown'

    UNION ALL

    SELECT 1, 3, 'Male; Female'

    UNION ALL

    SELECT 1, 4, 'Football; Rugby'

    UNION ALL

    SELECT 2, 1, 'Under 18; Over 18'

    UNION ALL

    SELECT 2, 2, 'Tea; Coffee'

    UNION ALL

    SELECT 2, 3, 'MS SQL; Oracle'

    UNION ALL

    SELECT 3, 1, 'Smoker; Non-smoker'

    UNION ALL

    SELECT 3, 2, 'Lager; Wine; Spirits; Other'

    UNION ALL

    SELECT 3, 3, 'Development; QA; Production'

    UNION ALL

    SELECT 3, 4, 'Simpsons; Family Guy; Futurama'

    UNION ALL

    SELECT 3, 5, 'Sunday; Monday; Tuesday; Wednesday; Thursday; Friday; Saturday'

    I want to split the lists into individual rows, and add the answer number value to each row. I can split the lists easily enough, but I find I am having to loop and use the IDENTITY function to populate the answer numbers.

    DECLARE @MaxSurveyCode INT

    SELECT @MaxSurveyCode = MAX(SurveyCode) FROM [SurveyAnswer]

    DECLARE @MaxQuestionCode INT

    SELECT @MaxQuestionCode = MAX(QuestionCode) FROM [SurveyAnswer]

    DECLARE @SurveyCode INT

    SET @SurveyCode = 1

    CREATE TABLE SurveyAnswer2

    (

    SurveyCode INT,

    QuestionCode INT,

    AnswerCode INT,

    AnswerString VARCHAR(255)

    )

    WHILE (@SurveyCode <= @MaxSurveyCode)
    BEGIN
    DECLARE @QuestionCode INT
    SET @QuestionCode = 1

    WHILE (@QuestionCode <= @MaxQuestionCode)
    BEGIN

    SELECT
    SurveyCode,
    QuestionCode,
    IDENTITY(INT, 1, 1) AnswerCode,
    LTRIM(RTRIM(SUBSTRING(';' + a.[Answers] + ';', n + 1, CHARINDEX(';', ';' + a.[Answers] + ';', n + 1) - n - 1))) AnswerString
    INTO
    #SurveyAnswerStaging
    FROM
    [SurveyAnswer] a,
    [Tally]
    WHERE
    LEN(';' + a.[Answers] + ';') > n

    AND SUBSTRING(';' + a.[Answers] + ';' , n, 1) = ';'

    AND [SurveyCode] = @SurveyCode

    AND [QuestionCode] = @QuestionCode

    INSERT INTO [SurveyAnswer2]

    SELECT * FROM [#SurveyAnswerStaging]

    DROP TABLE #SurveyAnswerStaging

    SET @QuestionCode = @QuestionCode + 1

    END

    SET @SurveyCode = @SurveyCode + 1

    END

    SELECT * FROM [SurveyAnswer2]

    DROP TABLE [SurveyAnswer]

    DROP TABLE [SurveyAnswer2]

    Can anyone give me any tips to improve this?

    Cheers,

    Simon 🙂

  • This returns the same results as your loop code:

    /*

    If tally table doesn't exist, create it with minimum rows needed

    for this test

    */

    SELECT TOP 260 IDENTITY(INT, 1, 1) n

    INTO Tally

    FROM syscolumns sc1, syscolumns sc2

    /*

    Table and sample data (these aren't actual survey answers - just made up

    for the purpose of this post!)

    */

    CREATE TABLE SurveyAnswer

    (

    SurveyCode INT,

    QuestionCode INT,

    Answers VARCHAR(255)

    )

    INSERT INTO [SurveyAnswer]

    SELECT 1, 1, '0 - 10; 11 - 20; 21 - 30; 30+'

    UNION ALL

    SELECT 1, 2, 'Yes; No; Unknown'

    UNION ALL

    SELECT 1, 3, 'Male; Female'

    UNION ALL

    SELECT 1, 4, 'Football; Rugby'

    UNION ALL

    SELECT 2, 1, 'Under 18; Over 18'

    UNION ALL

    SELECT 2, 2, 'Tea; Coffee'

    UNION ALL

    SELECT 2, 3, 'MS SQL; Oracle'

    UNION ALL

    SELECT 3, 1, 'Smoker; Non-smoker'

    UNION ALL

    SELECT 3, 2, 'Lager; Wine; Spirits; Other'

    UNION ALL

    SELECT 3, 3, 'Development; QA; Production'

    UNION ALL

    SELECT 3, 4, 'Simpsons; Family Guy; Futurama'

    UNION ALL

    SELECT 3, 5, 'Sunday; Monday; Tuesday; Wednesday; Thursday; Friday; Saturday'

    DECLARE @MaxSurveyCode INT

    SELECT @MaxSurveyCode = MAX(SurveyCode) FROM [SurveyAnswer]

    DECLARE @MaxQuestionCode INT

    SELECT @MaxQuestionCode = MAX(QuestionCode) FROM [SurveyAnswer]

    DECLARE @SurveyCode INT

    SET @SurveyCode = 1

    CREATE TABLE SurveyAnswer2

    (

    SurveyCode INT,

    QuestionCode INT,

    AnswerCode INT,

    AnswerString VARCHAR(255)

    )

    INSERT INTO [SurveyAnswer2]

    SELECT

    SurveyCode,

    QuestionCode,

    n,--IDENTITY(INT, 1, 1) AnswerCode,

    LTRIM(RTRIM(SUBSTRING(';' + a.[Answers] + ';', n + 1, CHARINDEX(';', ';' + a.[Answers] + ';', n + 1) - n - 1))) AnswerString

    /* INTO

    #SurveyAnswerStaging

    */FROM

    [SurveyAnswer] a,

    [Tally] T

    WHERE

    LEN(';' + a.[Answers] + ';') > n

    AND SUBSTRING(';' + a.[Answers] + ';' , n, 1) = ';'

    ORDER BY QuestionCode

    SELECT * FROM [SurveyAnswer2] ORDER BY SurveyCode, QuestionCode

    DROP TABLE [SurveyAnswer]

    DROP TABLE [SurveyAnswer2]

    DROP TABLE [tally]

  • Hi Jack,

    I'm getting different result sets when I run your query and my query.

    Just to clarify, for each question of a survey, I want to split the answers out into separate rows, numbered by their position in the list (not their position in the string). So the answer list '0 - 10; 11 - 20; 21 - 30; 30+' would become:

    AnswerCode AnswerString

    ----------- -------------

    1 0 - 10

    2 11 - 20

    3 21 - 30

    4 30+

    With your query, I am getting (I think) the position in the string of each ';'.

    Cheers,

    Simon

  • Oops, you are correct, I only verified the rows returned and the 1st 2 columns assuming I had the others correct. Boy, I wish 2000 had row_number().

  • Using the test data provided, the following does the trick on the split and the answer number. You can take it from there...

    --===== Declare a variable to hold the delimiter

    DECLARE @Delim CHAR(1)

    SET @Delim = ';'

     

    --===== Do the split with a count

    SELECT SurveyCode,

    QuestionCode,

    AnswerNum = t.N-LEN(REPLACE(LEFT(@Delim+sa.Answers,t.N), @Delim, '')),

    Answer = LTRIM(RTRIM(SUBSTRING(@Delim+sa.Answers, t.N+1,

    CHARINDEX(@Delim, sa.Answers+@Delim, t.N)-t.N)))

    FROM dbo.Tally t

    RIGHT OUTER JOIN --Necessary in case Answers is NULL

    dbo.SurveyAnswer sa

    ON SUBSTRING(@Delim+sa.Answers, t.N, 1) = @Delim

    AND t.N < LEN(@Delim+sa.Answers)

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Thankyou Jeff, I was having real trouble trying to think of a way to work the question number off the number of preceding semi-colons in the string - hadn't thought about essentially removing them and comparing the length of before and after strings. That's yet another tip of yours I've now made note of to solve similar problems in future. 🙂

  • Thanks, Simon. I can't take the credit for the original idea, though.

    As a side-bar, you did make my day. So many people take a bit of code and run, never to be seen again. I not only appreciate the fact that you came back but I'm tickled that you took the time to figure out what the code is actually doing and to share your understanding of it. Very well done, Sir. 🙂

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply