Splitting one wide column into serveral different ones

  • Ahoi,

    i have a request to deal with that is in my opinion total bullshit, but thats besides the point.

    The data source provides me with a variable layout that depends on the column length (changing the type or structure of this column is out of question). This leads to the following issue:

    • I have one wide column that is a single string
    • This string contains the content serveral different columns
    • The layout is totally messed up though due to the layout problem

     

    Here is example source code for what the problem looks like and what i want to achieve.

    • a, b and c and symbolic for the relevant strings (can by any string, seems like only c can contains | )
    • z and yy are string representatives of columns that are irrelevant and need to be cut off/ignored
    • i trimmed the results (--) for visuals but its not necessary, epsecially for the purpose of sharing ideas/problem
    • permutations of the problems are possible, like yyyy (insert 5) being also part of the last insert
    • String "b",column 2 has always the same structure: |333 --> 3 possibly different integers or a seperator | followed by an empty string by the length of 3 ("|   "),  both the int and empty string structure in combination with 3 empty strings or 3 interegers can not appear in any other column
    create table #strings(testrings nvarchar(256))
    insert into #strings
    values
    ('aaaaaaa |bbbb |ccccccccccc') --aaaaaaa,bbbb,ccccccccccc
    ,('|aaaaaaa |bbbb |ccccccccccc')--aaaaaaa,bbbb,ccccccccccc
    ,('z|aaaaaa |bbbb |ccccccccccc')--aaaaaa,bbbb,ccccccccccc
    ,('zzz|aaaaaa |bbbb |ccccccccccc')--aaaaaa,bbbb,ccccccccccc
    ,('yyyy |xxxx|aaaaaa |bbbb |ccccccccccc')--aaaaaa,bbbb,ccccccccccc
    ,('aaa |bbbbbbb |cc')--aaa,bbbbbbb,cc
    ,('aaa |bbbbbbb |ccc ccccc')--aaa,bbbbbbb,ccc ccccc
    ,('aaa |bbbbbbb |ccc|ccccc')--aaa,bbbbbbb,ccc|ccccc
    ,('aaa |bbbbbbb |ccc|ccccc')--aaa,bbbbbbb,ccc|ccccc
    ,('aaa |bbbbbbb |ccc|ccc|cc')--aaa,bbbbbbb,ccc|ccc|cc
    ,('aaa |bbbbbbb |ccc|ccc |cc')--aaa,bbbbbbb,ccc|ccc |cc
    ,('aaa |bbbbbbb |ccc|ccc | cc')--aaa,bbbbbbb,ccc|ccc | cc

    I am a bit lost of how im supposed to deal with this.

    Here are some of my thoughts i had so far:

    • I was thinking of indexing the seperators (|) and use them for substring

      • but it turns out the strings themselves can contain | aka the seperator, which ruined my idea of starting from the back of the string with charindex
      • not knowing how many and if there are a previous columns (yyyy and or yyyy+zz) that needs to be ignored, ruined my idea of starting from the front of the string with charindex

    • Jumping straight to bbbb since it has the most detectable/linear/distinct structure of all the columns and going from there

    I am happy about anyone having any suggestion/ideas on how to deal with the situation.

    Cheers

     

  • If we were to provide the code which produced the desired results, but using your a,b,c,x,y,z data, it would work fine using this sample data, but probably not for your actual data (because it will probably reference x, y and z as literals). Are you happy with that, or perhaps you are able to provide data which is more realistic (but sanitised if necessary)?

    If you haven't even tried to resolve your issue, please don't expect the hard-working volunteers here to waste their time providing links to answers which you could easily have found yourself.

  • That is a painful problem to solve.  The other problem that can arise is how many columns do you need to make to handle current state and future state?

    My opinion - I would try to fix this at the source (if possible) and the database at the same time.

    One option you could do, but would be painful to do, would be to do a string split (there are multiple articles on this site about them). I would avoid the 2019 string_split function because it appears that order is important.  But I would start by doing that and I'd toss it into a temp table with a bunch of columns so you can verify the data.  Since the column AFTER your "bbbb" one is the "last" column (it looks like), you can combine the results from the last column into 1 column with a | separating them.

    Now, what is going to make this exceptionally challenging is that your b's column won't always be at the same place.  What I mean is that using your above sample data, if you split on the | character, the b's will be in column 2, 3, or 4, and if your sample data is not a full sample of all possible scenarios, it could be in 1, 5, 6 and so on as well.  You can't reverse the string as the c's can have | in them too.

    I am thinking splitting the string and tossing it into a "wide enough" table would be a start, but after tossing it into the wide enough table, I think you will be going through row by row trying to find the b's column and then combining all columns after it into 1 column.  Likely going to be using a cursor for this which will make it incredibly slow.

    I would be curious to know how the software determines how to split the strings.  If you have access to that source of the application that is creating that string, you MAY be able to export the results in a nicely formatted format and then re-import from the export.  At least you would know the logic behind it in the application side for how it handles all of the above scenarios.

     

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • Yeah situations like this happen all the time tho.  It seems there are unmodeled relationships in the data.  Also, right away I would try to find out if you're supposed to assign a static key to these non-modeled blobs.  Ideally not, and each update of the data would be as a full replacement.

    Regarding non modeled relationships, generally it makes sense imo to normalize and constrain the data before trying to query for the results.  It's tricky to query messed up data.  A little clean up helps.  Are the strings unique within the column?  I would assume not.  So step 1 would be to assign a unique primary key to each row.  What Mr Brian Gale wrote about ordinal splitters applies here imo. Regarding the columns the modern way here would be to use an ordinal splitter, which is a tvf which splits text according to a delimiter and assigns a left-to-right ordinal sequence number to each row result.  Like this one from Jeff Moden because it's NVARCHAR.

    drop FUNCTION if exists [dbo].[DelimitedSplitN4K];
    go
    CREATE FUNCTION [dbo].[DelimitedSplitN4K]
    --===== Define I/O parameters
    (@pString NVARCHAR(4000), @pDelimiter NCHAR(1))
    RETURNS TABLE WITH SCHEMABINDING AS
    RETURN
    --===== "Inline" CTE Driven "Tally Table" produces values from 0 up to 10,000...
    -- enough to cover NVARCHAR(4000)
    WITH E1(N) AS (
    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
    ), --10E+1 or 10 rows
    E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
    E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
    cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front
    -- for both a performance gain and prevention of accidental "overruns"
    SELECT TOP (ISNULL(DATALENGTH(@pString)/2,0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
    ),
    cteStart(N1) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter)
    SELECT 1 UNION ALL
    SELECT t.N+1 FROM cteTally t WHERE SUBSTRING(@pString,t.N,1) = @pDelimiter
    ),
    cteLen(N1,L1) AS(--==== Return start and length (for use in substring)
    SELECT s.N1,
    ISNULL(NULLIF(CHARINDEX(@pDelimiter,@pString,s.N1),0)-s.N1,4000)
    FROM cteStart s
    )
    --===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found.
    SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N1),
    Item = SUBSTRING(@pString, l.N1, l.L1)
    FROM cteLen l
    ;

    drop table if exists #strings;
    go
    create table #strings(
    id int identity(1, 1) not null,
    testrings nvarchar(256));

    insert into #strings(testrings) values
    ('aaaaaaa |bbbb |ccccccccccc')--aaaaaaa,bbbb,ccccccccccc
    ,('|aaaaaaa |bbbb |ccccccccccc')--aaaaaaa,bbbb,ccccccccccc
    ,('z|aaaaaa |bbbb |ccccccccccc')--aaaaaa,bbbb,ccccccccccc
    ,('zzz|aaaaaa |bbbb |ccccccccccc')--aaaaaa,bbbb,ccccccccccc
    ,('yyyy |xxxx|aaaaaa |bbbb |ccccccccccc')--aaaaaa,bbbb,ccccccccccc
    ,('aaa |bbbbbbb |cc')--aaa,bbbbbbb,cc
    ,('aaa |bbbbbbb |ccc ccccc')--aaa,bbbbbbb,ccc ccccc
    ,('aaa |bbbbbbb |ccc|ccccc')--aaa,bbbbbbb,ccc|ccccc
    ,('aaa |bbbbbbb |ccc|ccccc')--aaa,bbbbbbb,ccc|ccccc
    ,('aaa |bbbbbbb |ccc|ccc|cc')--aaa,bbbbbbb,ccc|ccc|cc
    ,('aaa |bbbbbbb |ccc|ccc |cc')--aaa,bbbbbbb,ccc|ccc |cc
    ,('aaa |bbbbbbb |ccc|ccc | cc');--aaa,bbbbbbb,ccc|ccc | cc

    select s.*, ds.*, len(ds.Item) item_len, len(trim(ds.Item)) item_len_trim,
    row_number() over (partition by id order by ItemNumber) row_num
    from #strings s
    cross apply dbo.DelimitedSplitN4K(s.testrings, '|') ds
    where ds.Item not like '%x%'
    and ds.Item not like '%y%'
    and ds.Item not like '%z%'
    and len(trim(Item))>0;

     

    Aus dem Paradies, das Cantor uns geschaffen, soll uns niemand vertreiben können

  • The link for the splitter I think is in this article.  For id=7 and ItemNumber=3 the (split) Item equals 'ccc ccccc' which is maybe space delimited and needs further splitting?  Without the WHERE clause the ds.ItemNumber and the row_num would always be equal

    Aus dem Paradies, das Cantor uns geschaffen, soll uns niemand vertreiben können

  • I see no way to accurately parse that data will all the variations presented (and any number even worse than that that could happen).

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • How about this approach?

    Parse out the columns that do not contain an embedded delimiter.

    Use the REPLACE function to strip out the parsed columns.

    What's left, in this example, is the CCC column.

    Michael L John
    If you assassinate a DBA, would you pull a trigger?
    To properly post on a forum:
    http://www.sqlservercentral.com/articles/61537/

  • The "solution" is the following:

    i had another talk with the user, we discussed the relevancy of the individual columns

    • C was the most important out of a,b,c
    • C, as previously stated, can include the seperator within the string, but the proportions are not the same as stated in the example, meaning the number of strings including them is so low, it can be ignored
    • so the solution was, i provide only column C (margin of error including C Strings that have a seperator) by the following logic:

      • Reverse the whole string
      • get charindex of seperator |
      • left the string until this seperator -1
      • rereverse the reversed string to turn it into its original form

        • Basiscally: get the back part of the string from end till the first seperator |

  • That sounds like a suitable temporary solution. If you are in discussions with the customer can you get the data format changed to either XML or JSON or even an EDI format.  At least you can validate the data against a schema.

     

  • I'll go one step further. This is not a problem for SQL and we never intended the language to be use this way. Can you preprocess your raw data? You can write a parser in Snobal or Icon.or anything that was designed for string handling. This thing looks so bad that even consider YACC:)

    Then of course we have to consider rolling over the guy that designed this mess in the parking lot, and look at the rest of his code to see how much else he screwed up.

    Please post DDL and follow ANSI/ISO standards when asking for help. 

  • As helpful as ever Joe.  It would be lovely if we could all live in the world of perfect 3NF data with proper entity identifiers and considered data types, unfortunately we choose to take on roles where we inherit other peoples crap.  Introducing a new technology into the stack is not the solution; it further limits your ability to find staff who know how to manage the systems after you have gone.

    That said, if you are on one of the later editions of SQL you might be able to clean the data in Python or R.

    SQL is a very powerful tool and (like Excel) it often gets abused because we can get away with a lot.

  • I agree with Joe on this one and would disagree with aaron.reese.  Parsing a string like that in SQL is going to be painful and adding more logic into the SQL side of things is just creating more technical debt to handle a problem that appears to be at the application layer.  Running over the guy in the parking lot is a bit extreme, but I've felt like doing that to developers before... Heck, I thought that about some horrid code I wrote back in the day.  "What kind of idiot develops the code like this... <scroll to the boilerplate> I am the kind of idiot that develops code like this...".

    It could very well be that the original developer of the application had little to no experience with SQL server and didn't know how to design the tables.  So they take the data from the form, concat it together with a delimiter, and dump it to the table.  Add a note that the delimiter is a known limitation of the tool (last field can't contain |'s) and release the app with acceptance from the consumer of the application.  Years go by with no problems, and the consumer starts to mis-use the app and put | in where it shouldn't exist and nobody tells the developer.  Bad data goes in and you no longer have a good way to get the bad data out.

    But, if possible, I would still tackle this in the originating system rather than adding complexity onto the SQL side.  And adding in Python or R is the same problem as adding in Snobal or Icon or YACC - you are still introducing new technology into the stack that somebody has to support.  Now, I agree that Python is a more common language than Snobal, but if nobody on your team knows Python, I wouldn't want to implement that.  Plus using Python or R on the SQL instance eats up more resources on the server side.  I prefer to do my string parsing like that client side and save the server resources for the CRUD and SELECT operations on my database.

    My approach for a future state of this would be to tackle this at the originator of the data (application).  If this isn't possible, I would try to tackle it at the placement of the data (stored procedure).  But either approach you take, you need to handle the current state as well.  The current state needs a "fix" before you can implement a future planning fix.

    IF it is a short-term fix, I would say use SQL or Excel or whatever you are comfortable with to manipulate the data and get the result you need.  If it is a long term fix you want, I would fix it in the current data and fix it in the application.  On the application side, fixing it could be simple or could be complex, but either way you will benefit from having it fixed at the application side rather than implementing a band-aid fix to get it working on the data side.

    I do agree with aaron.reese in that inherited systems can be a pain to support and not all database tables are designed to perfect 3NF with everything being perfect in them.  Far too often I've had discussions about things like "Serial Numbers" at my workplace and been told that they are 1000% certain that they will always be numeric and 10 digits or less... then we buy a company and suddenly the serial numbers contain letters and are up to 18 characters long and our existing systems MUST support it.  Database side, this is just a datatype change which is a pain in the butt, but also not impossible to do.  On the labelling side of things, it became quite a challenge because shrinking the barcodes means they are not scannable, but if you don't shrink them, the barcode spills off the edge of the label.

    I imagine that the original tool didn't allow for | in the c's section until one day it was a requirement.  It gets added to the application side and they probably implemented it with a string split function and a few loops.  Split the string, then find the first 4 digit numeric value which would be the b's from above and everything after that is the c's which can be concatenated together to get your c-string.  In .NET it would be pretty easy and if you are only working with 1 of these strings at a time, the processing on them would be almost instant.

    So I can see how problems like this can come up and how they can be hard to fix.  My solution to these problems though falls into 2 buckets - "quick fix" and "long term support".  Quick fix would be the suggestions like in this thread - "make it work" or "get the data I need now".  Long term support, I want to reduce technical debt, so I am fixing this at the application layer and re-designing the database side as I know that this problem will come up again and if I don't have a good long term fix, I am just going to be addressing this problem again in the future.

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • ktflash wrote:

    The "solution" is the following:

    i had another talk with the user, we discussed the relevancy of the individual columns

    • C was the most important out of a,b,c
    • C, as previously stated, can include the seperator within the string, but the proportions are not the same as stated in the example, meaning the number of strings including them is so low, it can be ignored
    • so the solution was, i provide only column C (margin of error including C Strings that have a seperator) by the following logic:

      • Reverse the whole string
      • get charindex of seperator |
      • left the string until this seperator -1
      • rereverse the reversed string to turn it into its original form

        • Basiscally: get the back part of the string from end till the first seperator |

    So why not still use the splitter and filter for rows LIKE '%c%'?

    select s.*, ds.*, len(ds.Item) item_len, len(trim(ds.Item)) item_len_trim
    from #strings s
    cross apply dbo.DelimitedSplitN4K(s.testrings, '|') ds
    where ds.Item like '%c%'
    order by ds.ItemNumber;

    if using Azure SQL there's a built-in ordinal splitter available using (optional) 3rd parameter to the STRING_SPLIT function.  The function doesn't emit the column 'ordinal' unless the 3rd input parameter equals 1.  This was pointed out in another thread.  It's equivalent like this

    select s.*, ss.*, len(ss.value) item_len, len(trim(ss.value)) item_len_trim
    from #strings s
    cross apply string_split(s.testrings, '|', 1) ss
    where ss.value like '%c%'
    order by ss.ordinal;

    Aus dem Paradies, das Cantor uns geschaffen, soll uns niemand vertreiben können

  • ktflash - Please can you remove the word bullsh*t from the description of this article. It keeps on blocking emails from SQL Central due to this word. Thank you.

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply