Guidance on best string manipulation technique (image included)

  • Hi folks,

    I really hope you folks can offer some advice here.

    I need to convert quite a complicated varchar columns into 4 different new columns and I’m looking for the best approach to do so really.

    Existing Description field:

    Partdescription (varchar)

    Example Table Content: (I tried to pick a subset of different variations of description)

    http://imgur.com/ASzmFvd

    4 New Fields:

    •Length (number)

    •Width (number)

    •Thickness (number)

    •Grade (varchar)

    Now, I understand nothing is going to be perfect in my situation but I want to do the best I can as this will affect multiple thousands of rows and any that aren’t right will require more human correction and time to complete. What i didnt mention in the image above is that some of the spaces are "double spaces" and some even "triple spaces".

    Unfortunately my SQL knowledge is not great, I attempted to start writing a select statement using charindex, and replace to “delimit” the string a bit but it’s ending up being absolutely miles long and almost impossible to decipher now and there’s probably something easier I can do.

    I’d really appreciate any help or guidance you could give folks 🙂

    Many, many thanks!!

    Matt

  • Using CHARINDEX and SUBSTRING would be one option. To simplify it, you might want to use CROSS APPLY.

    If You want more help, please post your data in a consumable format (CREATE TABLE and INSERT INTO statements).

    You might also want to check the following thread: http://www.sqlservercentral.com/Forums/Topic1585850-391-1.aspx

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • While it's best to capture your data broken into individual parts, it isn't always possible. What you're doing may be possible. It depends on the rules for the split. Luis is right - we need to see some consumable data and the rules for the split.

  • Thanks a lot for the hasty reply.

    I generated a table from a subset of the data i'm working with. I've attempted to attach it.

    I'm currently having a read through the post you have linked regarding stripping non numeric data to see if there's anything relevant. This is quite above my standard knowledge at the moment - i'll see if anything stands out as being the right method but I may be in over my head! :w00t:

    Cheers!!

  • Wow, that's a lot of data (to be considered a sample).

    I'll give it a try when I get home in the afternoon, because I want to do some tests that won't be great with SQL Fiddle and I don't have access to a real instance right now.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Just a first crack at it, using a Split String function I have posted on this site as a tool http://www.sqlservercentral.com/scripts/SUBSTRING/124330/:

    SELECT partnum

    , partdescription

    , MAX(CASE WHEN split1.Record = 2 THEN LTRIM(RTRIM(split1.Value)) END) AS [Value1]

    , MAX(CASE WHEN split1.Record = 4 THEN LTRIM(RTRIM(split1.Value)) END) AS [Value2]

    , MAX(CASE WHEN split1.Record = 6 THEN LTRIM(RTRIM(split1.Value)) END) AS [Value3]

    , MAX(CASE WHEN split2.Record = 2 THEN LTRIM(RTRIM(split2.Value)) END) AS [Value4]

    FROM [dbo].[partTable] pt CROSS APPLY

    [dbo].[udf_SplitString](REPLACE(partdescription,' ',' '),' ') as split1 Cross Apply

    [dbo].[udf_SplitString](REPLACE(partdescription,'mm ','|'),'|') as split2

    GROUP BY partnum

    , partdescription

    It picked up a good portion of your records, but not all. This technique could be fine tuned to get to where you need to be though.

  • Christopher Kutsch (7/21/2015)


    Just a first crack at it, using a Split String function I have posted on this site as a tool http://www.sqlservercentral.com/scripts/SUBSTRING/124330/:

    SELECT partnum

    , partdescription

    , MAX(CASE WHEN split1.Record = 2 THEN LTRIM(RTRIM(split1.Value)) END) AS [Value1]

    , MAX(CASE WHEN split1.Record = 4 THEN LTRIM(RTRIM(split1.Value)) END) AS [Value2]

    , MAX(CASE WHEN split1.Record = 6 THEN LTRIM(RTRIM(split1.Value)) END) AS [Value3]

    , MAX(CASE WHEN split2.Record = 2 THEN LTRIM(RTRIM(split2.Value)) END) AS [Value4]

    FROM [dbo].[partTable] pt CROSS APPLY

    [dbo].[udf_SplitString](REPLACE(partdescription,' ',' '),' ') as split1 Cross Apply

    [dbo].[udf_SplitString](REPLACE(partdescription,'mm ','|'),'|') as split2

    GROUP BY partnum

    , partdescription

    It picked up a good portion of your records, but not all. This technique could be fine tuned to get to where you need to be though.

    Have you tested your function against the DelimitedSplit8K? That's the fastest pure T-SQL splitter around, unless proven otherwise. http://www.sqlservercentral.com/articles/Tally+Table/72993/

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • This version cannot keep up with XML splitting or CLRs, but won't fail on reserved characters like ampersands. For a tool that is used in ETL and non-OLTP transactions, it fits the requirements I have had in the past.

  • Christopher Kutsch (7/21/2015)


    This version cannot keep up with XML splitting or CLRs, but won't fail on reserved characters like ampersands. For a tool that is used in ETL and non-OLTP transactions, it fits the requirements I have had in the past.

    I encourage you to try the DelimitedSplit8K. It won't have problems with reserved characters and it's almost as fast as the CLR option and faster than the XML splitter. The article is a great resource to learn as well.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • What OP is asking is next to impossible to achieve.

    Using Jeff Moden splitter may not be the best in this case, as having few more "x" in the strings will make it quite useless.

    Here the code which works for some of the values.

    It is really easy to break if some other "unexpected" variations of values will come.

    DECLARE @t TABLE (partdesc varchar(1000))

    INSERT @t

    SELECT 'ABCD 3000 x 1500 x 6mm some text 1'

    UNION ALL SELECT 'BCDE 2025mm x 150 x 1 some text 2'

    UNION ALL SELECT 'CDEF 2000 x 1000 x 6-7mm some xxx text 3'

    UNION ALL SELECT 'DEFG 525x150x10 some text 4'

    UNION ALL SELECT 'EFGK 1000x200x20mm, some text 5 '

    UNION ALL SELECT 'FGKI 125 x 150mm some text 6'

    ;WITH parsing_q

    AS

    (

    SELECT t.partdesc, DPP.D, DF.V, LTRIM(RTRIM(P.Txt)) Txt

    FROM @t t

    CROSS APPLY (SELECT PATINDEX('%[0-9]%',partdesc)) F1(FirstDigitPos)

    CROSS APPLY (SELECT SUBSTRING(partdesc,F1.FirstDigitPos, 1000)) F2(PartToParse)

    CROSS APPLY (SELECT CHARINDEX('x',F2.PartToParse)) X1(P)

    CROSS APPLY (SELECT CHARINDEX('x',F2.PartToParse,X1.P+1)) X2(P)

    CROSS APPLY (SELECT CHARINDEX('x',F2.PartToParse,X2.P+1)) X3(P)

    CROSS APPLY (VALUES ('L', LEFT(F2.PartToParse, ISNULL(NULLIF(X1.P,0) - 1,0)))

    ,('W', SUBSTRING(F2.PartToParse, NULLIF(X1.P+1,0),ISNULL(NULLIF(X2.P,0),1000)))

    ,('T', SUBSTRING(F2.PartToParse, NULLIF(X2.P+1,0),ISNULL(NULLIF(X3.P,0),1000)))) DPP(D,V)

    CROSS APPLY (SELECT LTRIM(RTRIM(DPP.V)), PATINDEX('%[^0-9]%',LTRIM(RTRIM(DPP.V)))) D(V,FirstNonDigit)

    CROSS APPLY (SELECT LEFT(D.V, ISNULL(NULLIF(D.FirstNonDigit,0) - 1,1000))) DF(V)

    CROSS APPLY (SELECT SUBSTRING(D.V, ISNULL(NULLIF(D.FirstNonDigit,0),1),1000)) P(Txt)

    )

    SELECT partdesc

    ,MAX(CASE WHEN D = 'L' THEN V ELSE NULL END) AS L

    ,MAX(CASE WHEN D = 'W' THEN V ELSE NULL END) AS W

    ,MAX(CASE WHEN D = 'T' THEN V ELSE NULL END) AS T

    ,COALESCE(MAX(CASE WHEN D = 'T' AND V!='' THEN Txt ELSE NULL END)

    ,MAX(CASE WHEN D = 'W' AND V!='' THEN Txt ELSE NULL END)

    ,MAX(CASE WHEN D = 'L' AND V!='' THEN Txt ELSE NULL END)) AS Grade

    FROM parsing_q

    GROUP BY partdesc

    I have not even attempted to remove "mm" from grade, it is possible, but I would suggest to have a table of predefined values to be removed from the start of the "Grade" (eg. mm, cm, ft, in etc.)

    _____________________________________________
    "The only true wisdom is in knowing you know nothing"
    "O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
    (So many miracle inventions provided by MS to us...)

    How to post your question to get the best and quick help[/url]

  • What we need to do here is to look at the patterns.

    The Length is from the first numeric value to the first non-numeric value.

    The Width is from the next numeric value to the next non-numeric value.

    The Thickness is from the next numeric value to the next non-numeric value.

    And the Grade is from the next space to the end of the string.

    We can only use CHARINDEX for finding that space. For the others, we need to use PATINDEX to look for numerics, and PATINDEX to look for non-numerics.

    So, I submit to you (just change the FROM to dbo.partTable):

    SELECT t.*

    --,caLenS.LengthStart, caLenT.LengthText, caLenE.LengthEnd

    ,CASE WHEN caLenE.LengthEnd > 0 AND caLenS.LengthStart > 0

    THEN LEFT(caLenT.LengthText, caLenE.LengthEnd-1)

    ELSE NULL END AS CalcLength

    ,CASE WHEN caWidthS.WidthStart > 0 AND caWidthE.WidthEnd > 0

    THEN LEFT(caWidthT.WidthText, caWidthE.WidthEnd-1)

    ELSE NULL END AS CalcWidth

    ,CASE WHEN caThickS.ThickStart > 0 AND caThickE.ThickEnd > 0

    THEN LEFT(caThickT.ThickText, caThickE.ThickEnd-1)

    WHEN caThickS.ThickStart > 0

    THEN caThickT.ThickText

    ELSE NULL END AS CalcThick

    ,caThickS.ThickStart, caThickE.ThickEnd, caThickT.ThickText

    ,CASE WHEN caGrade.Pos > 0

    THEN LTRIM(SUBSTRING(caThickT.ThickText, caGrade.Pos+1, LEN(caThickT.ThickText)))

    ELSE NULL END AS CalcGrade

    FROM #partTable t

    -- find the first numeric that follows a space

    CROSS APPLY (SELECT PATINDEX('% [0-9]%', t.partdescription) ) caLenS(LengthStart)

    -- get the first non-numeric that follows this.

    -- CHARINDEX doesn't use the like stuff, so we need to use PATINDEX

    -- PATINDEX doesn't have a starting position, so we need to get the string from the starting position so that PATINDEX can be used

    CROSS APPLY (SELECT SUBSTRING(t.partdescription, caLenS.LengthStart+1, LEN(t.partdescription))) caLenT(LengthText)

    CROSS APPLY (SELECT PATINDEX('%[^0-9.]%', caLenT.LengthText)) caLenE(LengthEnd)

    -- repeat to find the Width. First, need new text

    CROSS APPLY (SELECT SUBSTRING(caLenT.LengthText, caLenE.LengthEnd, LEN(caLenT.LengthText))) caWidth(StartText)

    CROSS APPLY (SELECT PATINDEX('% [0-9]%', caWidth.StartText)) caWidthS(WidthStart)

    CROSS APPLY (SELECT SUBSTRING(caWidth.StartText, caWidthS.WidthStart+1, LEN(caWidth.StartText))) caWidthT(WidthText)

    CROSS APPLY (SELECT PATINDEX('%[^0-9.]%', caWidthT.WidthText)) caWidthE(WidthEnd)

    -- repeat to find the Thickness. First, need new text

    CROSS APPLY (SELECT SUBSTRING(caWidthT.WidthText, caWidthE.WidthEnd, LEN(caWidthT.WidthText))) caThick(StartText)

    CROSS APPLY (SELECT PATINDEX('% [0-9]%', caThick.StartText)) caThickS(ThickStart)

    CROSS APPLY (SELECT SUBSTRING(caThick.StartText, caThickS.ThickStart+1, LEN(caThick.StartText))) caThickT(ThickText)

    CROSS APPLY (SELECT PATINDEX('%[^0-9.]%', caThickT.ThickText)) caThickE(ThickEnd)

    -- Find the first space after this

    CROSS APPLY (SELECT CHARINDEX(' ', caThickT.ThickText, caThickE.ThickEnd+1)) caGrade(Pos)

    Wayne
    Microsoft Certified Master: SQL Server 2008
    Author - SQL Server T-SQL Recipes


    If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!
    Links:
    For better assistance in answering your questions
    Performance Problems
    Common date/time routines
    Understanding and Using APPLY Part 1 & Part 2

  • WayneS (7/21/2015)


    What we need to do here is to look at the patterns.

    The Length is from the first numeric value to the first non-numeric value.

    The Width is from the next numeric value to the next non-numeric value.

    The Thickness is from the next numeric value to the next non-numeric value.

    And the Grade is from the next space to the end of the string.

    We can only use CHARINDEX for finding that space. For the others, we need to use PATINDEX to look for numerics, and PATINDEX to look for non-numerics.

    So, I submit to you (just change the FROM to dbo.partTable):

    SELECT t.*

    --,caLenS.LengthStart, caLenT.LengthText, caLenE.LengthEnd

    ,CASE WHEN caLenE.LengthEnd > 0 AND caLenS.LengthStart > 0

    THEN LEFT(caLenT.LengthText, caLenE.LengthEnd-1)

    ELSE NULL END AS CalcLength

    ,CASE WHEN caWidthS.WidthStart > 0 AND caWidthE.WidthEnd > 0

    THEN LEFT(caWidthT.WidthText, caWidthE.WidthEnd-1)

    ELSE NULL END AS CalcWidth

    ,CASE WHEN caThickS.ThickStart > 0 AND caThickE.ThickEnd > 0

    THEN LEFT(caThickT.ThickText, caThickE.ThickEnd-1)

    WHEN caThickS.ThickStart > 0

    THEN caThickT.ThickText

    ELSE NULL END AS CalcThick

    ,caThickS.ThickStart, caThickE.ThickEnd, caThickT.ThickText

    ,CASE WHEN caGrade.Pos > 0

    THEN LTRIM(SUBSTRING(caThickT.ThickText, caGrade.Pos+1, LEN(caThickT.ThickText)))

    ELSE NULL END AS CalcGrade

    FROM #partTable t

    -- find the first numeric that follows a space

    CROSS APPLY (SELECT PATINDEX('% [0-9]%', t.partdescription) ) caLenS(LengthStart)

    -- get the first non-numeric that follows this.

    -- CHARINDEX doesn't use the like stuff, so we need to use PATINDEX

    -- PATINDEX doesn't have a starting position, so we need to get the string from the starting position so that PATINDEX can be used

    CROSS APPLY (SELECT SUBSTRING(t.partdescription, caLenS.LengthStart+1, LEN(t.partdescription))) caLenT(LengthText)

    CROSS APPLY (SELECT PATINDEX('%[^0-9.]%', caLenT.LengthText)) caLenE(LengthEnd)

    -- repeat to find the Width. First, need new text

    CROSS APPLY (SELECT SUBSTRING(caLenT.LengthText, caLenE.LengthEnd, LEN(caLenT.LengthText))) caWidth(StartText)

    CROSS APPLY (SELECT PATINDEX('% [0-9]%', caWidth.StartText)) caWidthS(WidthStart)

    CROSS APPLY (SELECT SUBSTRING(caWidth.StartText, caWidthS.WidthStart+1, LEN(caWidth.StartText))) caWidthT(WidthText)

    CROSS APPLY (SELECT PATINDEX('%[^0-9.]%', caWidthT.WidthText)) caWidthE(WidthEnd)

    -- repeat to find the Thickness. First, need new text

    CROSS APPLY (SELECT SUBSTRING(caWidthT.WidthText, caWidthE.WidthEnd, LEN(caWidthT.WidthText))) caThick(StartText)

    CROSS APPLY (SELECT PATINDEX('% [0-9]%', caThick.StartText)) caThickS(ThickStart)

    CROSS APPLY (SELECT SUBSTRING(caThick.StartText, caThickS.ThickStart+1, LEN(caThick.StartText))) caThickT(ThickText)

    CROSS APPLY (SELECT PATINDEX('%[^0-9.]%', caThickT.ThickText)) caThickE(ThickEnd)

    -- Find the first space after this

    CROSS APPLY (SELECT CHARINDEX(' ', caThickT.ThickText, caThickE.ThickEnd+1)) caGrade(Pos)

    Wayne, that's just flat-out impressive. Nice work!

  • Ed Wagner (7/21/2015)


    WayneS (7/21/2015)


    What we need to do here is to look at the patterns.

    The Length is from the first numeric value to the first non-numeric value.

    The Width is from the next numeric value to the next non-numeric value.

    The Thickness is from the next numeric value to the next non-numeric value.

    And the Grade is from the next space to the end of the string.

    We can only use CHARINDEX for finding that space. For the others, we need to use PATINDEX to look for numerics, and PATINDEX to look for non-numerics.

    So, I submit to you (just change the FROM to dbo.partTable):

    SELECT t.*

    --,caLenS.LengthStart, caLenT.LengthText, caLenE.LengthEnd

    ,CASE WHEN caLenE.LengthEnd > 0 AND caLenS.LengthStart > 0

    THEN LEFT(caLenT.LengthText, caLenE.LengthEnd-1)

    ELSE NULL END AS CalcLength

    ,CASE WHEN caWidthS.WidthStart > 0 AND caWidthE.WidthEnd > 0

    THEN LEFT(caWidthT.WidthText, caWidthE.WidthEnd-1)

    ELSE NULL END AS CalcWidth

    ,CASE WHEN caThickS.ThickStart > 0 AND caThickE.ThickEnd > 0

    THEN LEFT(caThickT.ThickText, caThickE.ThickEnd-1)

    WHEN caThickS.ThickStart > 0

    THEN caThickT.ThickText

    ELSE NULL END AS CalcThick

    ,caThickS.ThickStart, caThickE.ThickEnd, caThickT.ThickText

    ,CASE WHEN caGrade.Pos > 0

    THEN LTRIM(SUBSTRING(caThickT.ThickText, caGrade.Pos+1, LEN(caThickT.ThickText)))

    ELSE NULL END AS CalcGrade

    FROM #partTable t

    -- find the first numeric that follows a space

    CROSS APPLY (SELECT PATINDEX('% [0-9]%', t.partdescription) ) caLenS(LengthStart)

    -- get the first non-numeric that follows this.

    -- CHARINDEX doesn't use the like stuff, so we need to use PATINDEX

    -- PATINDEX doesn't have a starting position, so we need to get the string from the starting position so that PATINDEX can be used

    CROSS APPLY (SELECT SUBSTRING(t.partdescription, caLenS.LengthStart+1, LEN(t.partdescription))) caLenT(LengthText)

    CROSS APPLY (SELECT PATINDEX('%[^0-9.]%', caLenT.LengthText)) caLenE(LengthEnd)

    -- repeat to find the Width. First, need new text

    CROSS APPLY (SELECT SUBSTRING(caLenT.LengthText, caLenE.LengthEnd, LEN(caLenT.LengthText))) caWidth(StartText)

    CROSS APPLY (SELECT PATINDEX('% [0-9]%', caWidth.StartText)) caWidthS(WidthStart)

    CROSS APPLY (SELECT SUBSTRING(caWidth.StartText, caWidthS.WidthStart+1, LEN(caWidth.StartText))) caWidthT(WidthText)

    CROSS APPLY (SELECT PATINDEX('%[^0-9.]%', caWidthT.WidthText)) caWidthE(WidthEnd)

    -- repeat to find the Thickness. First, need new text

    CROSS APPLY (SELECT SUBSTRING(caWidthT.WidthText, caWidthE.WidthEnd, LEN(caWidthT.WidthText))) caThick(StartText)

    CROSS APPLY (SELECT PATINDEX('% [0-9]%', caThick.StartText)) caThickS(ThickStart)

    CROSS APPLY (SELECT SUBSTRING(caThick.StartText, caThickS.ThickStart+1, LEN(caThick.StartText))) caThickT(ThickText)

    CROSS APPLY (SELECT PATINDEX('%[^0-9.]%', caThickT.ThickText)) caThickE(ThickEnd)

    -- Find the first space after this

    CROSS APPLY (SELECT CHARINDEX(' ', caThickT.ThickText, caThickE.ThickEnd+1)) caGrade(Pos)

    Wayne, that's just flat-out impressive. Nice work!

    Thanks Ed

    MPF - does this work for you?

    Wayne
    Microsoft Certified Master: SQL Server 2008
    Author - SQL Server T-SQL Recipes


    If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!
    Links:
    For better assistance in answering your questions
    Performance Problems
    Common date/time routines
    Understanding and Using APPLY Part 1 & Part 2

  • WayneS (7/21/2015)


    Ed Wagner (7/21/2015)


    WayneS (7/21/2015)


    What we need to do here is to look at the patterns.

    The Length is from the first numeric value to the first non-numeric value.

    The Width is from the next numeric value to the next non-numeric value.

    The Thickness is from the next numeric value to the next non-numeric value.

    And the Grade is from the next space to the end of the string.

    We can only use CHARINDEX for finding that space. For the others, we need to use PATINDEX to look for numerics, and PATINDEX to look for non-numerics.

    So, I submit to you (just change the FROM to dbo.partTable):

    SELECT t.*

    --,caLenS.LengthStart, caLenT.LengthText, caLenE.LengthEnd

    ,CASE WHEN caLenE.LengthEnd > 0 AND caLenS.LengthStart > 0

    THEN LEFT(caLenT.LengthText, caLenE.LengthEnd-1)

    ELSE NULL END AS CalcLength

    ,CASE WHEN caWidthS.WidthStart > 0 AND caWidthE.WidthEnd > 0

    THEN LEFT(caWidthT.WidthText, caWidthE.WidthEnd-1)

    ELSE NULL END AS CalcWidth

    ,CASE WHEN caThickS.ThickStart > 0 AND caThickE.ThickEnd > 0

    THEN LEFT(caThickT.ThickText, caThickE.ThickEnd-1)

    WHEN caThickS.ThickStart > 0

    THEN caThickT.ThickText

    ELSE NULL END AS CalcThick

    ,caThickS.ThickStart, caThickE.ThickEnd, caThickT.ThickText

    ,CASE WHEN caGrade.Pos > 0

    THEN LTRIM(SUBSTRING(caThickT.ThickText, caGrade.Pos+1, LEN(caThickT.ThickText)))

    ELSE NULL END AS CalcGrade

    FROM #partTable t

    -- find the first numeric that follows a space

    CROSS APPLY (SELECT PATINDEX('% [0-9]%', t.partdescription) ) caLenS(LengthStart)

    -- get the first non-numeric that follows this.

    -- CHARINDEX doesn't use the like stuff, so we need to use PATINDEX

    -- PATINDEX doesn't have a starting position, so we need to get the string from the starting position so that PATINDEX can be used

    CROSS APPLY (SELECT SUBSTRING(t.partdescription, caLenS.LengthStart+1, LEN(t.partdescription))) caLenT(LengthText)

    CROSS APPLY (SELECT PATINDEX('%[^0-9.]%', caLenT.LengthText)) caLenE(LengthEnd)

    -- repeat to find the Width. First, need new text

    CROSS APPLY (SELECT SUBSTRING(caLenT.LengthText, caLenE.LengthEnd, LEN(caLenT.LengthText))) caWidth(StartText)

    CROSS APPLY (SELECT PATINDEX('% [0-9]%', caWidth.StartText)) caWidthS(WidthStart)

    CROSS APPLY (SELECT SUBSTRING(caWidth.StartText, caWidthS.WidthStart+1, LEN(caWidth.StartText))) caWidthT(WidthText)

    CROSS APPLY (SELECT PATINDEX('%[^0-9.]%', caWidthT.WidthText)) caWidthE(WidthEnd)

    -- repeat to find the Thickness. First, need new text

    CROSS APPLY (SELECT SUBSTRING(caWidthT.WidthText, caWidthE.WidthEnd, LEN(caWidthT.WidthText))) caThick(StartText)

    CROSS APPLY (SELECT PATINDEX('% [0-9]%', caThick.StartText)) caThickS(ThickStart)

    CROSS APPLY (SELECT SUBSTRING(caThick.StartText, caThickS.ThickStart+1, LEN(caThick.StartText))) caThickT(ThickText)

    CROSS APPLY (SELECT PATINDEX('%[^0-9.]%', caThickT.ThickText)) caThickE(ThickEnd)

    -- Find the first space after this

    CROSS APPLY (SELECT CHARINDEX(' ', caThickT.ThickText, caThickE.ThickEnd+1)) caGrade(Pos)

    Wayne, that's just flat-out impressive. Nice work!

    Thanks Ed

    MPF - does this work for you?

    Hi there,

    sorry its been almost a day - I have been away from the office for the whole morning so i've only just have a chance to look at some of the replies.

    Wayne, what you've made is frankly amazing. I can't believe how accurate that is.

    The SQL used in the is far and beyond my knowledge and despite me really trying to understand how it's doing what it is, I really cant. But the logic is spot on!!

    I can't thank you enough for taking the time to help me here! I'm so incredibly grateful! This is going to save someone a huge amount of time!! 🙂

    What an amazing effort from yourself especially but the whole SQL Server Central forum!

    Many, many thanks!

  • MPF (7/22/2015)


    WayneS (7/21/2015)


    Ed Wagner (7/21/2015)


    WayneS (7/21/2015)


    What we need to do here is to look at the patterns.

    The Length is from the first numeric value to the first non-numeric value.

    The Width is from the next numeric value to the next non-numeric value.

    The Thickness is from the next numeric value to the next non-numeric value.

    And the Grade is from the next space to the end of the string.

    We can only use CHARINDEX for finding that space. For the others, we need to use PATINDEX to look for numerics, and PATINDEX to look for non-numerics.

    So, I submit to you (just change the FROM to dbo.partTable):

    SELECT t.*

    --,caLenS.LengthStart, caLenT.LengthText, caLenE.LengthEnd

    ,CASE WHEN caLenE.LengthEnd > 0 AND caLenS.LengthStart > 0

    THEN LEFT(caLenT.LengthText, caLenE.LengthEnd-1)

    ELSE NULL END AS CalcLength

    ,CASE WHEN caWidthS.WidthStart > 0 AND caWidthE.WidthEnd > 0

    THEN LEFT(caWidthT.WidthText, caWidthE.WidthEnd-1)

    ELSE NULL END AS CalcWidth

    ,CASE WHEN caThickS.ThickStart > 0 AND caThickE.ThickEnd > 0

    THEN LEFT(caThickT.ThickText, caThickE.ThickEnd-1)

    WHEN caThickS.ThickStart > 0

    THEN caThickT.ThickText

    ELSE NULL END AS CalcThick

    ,caThickS.ThickStart, caThickE.ThickEnd, caThickT.ThickText

    ,CASE WHEN caGrade.Pos > 0

    THEN LTRIM(SUBSTRING(caThickT.ThickText, caGrade.Pos+1, LEN(caThickT.ThickText)))

    ELSE NULL END AS CalcGrade

    FROM #partTable t

    -- find the first numeric that follows a space

    CROSS APPLY (SELECT PATINDEX('% [0-9]%', t.partdescription) ) caLenS(LengthStart)

    -- get the first non-numeric that follows this.

    -- CHARINDEX doesn't use the like stuff, so we need to use PATINDEX

    -- PATINDEX doesn't have a starting position, so we need to get the string from the starting position so that PATINDEX can be used

    CROSS APPLY (SELECT SUBSTRING(t.partdescription, caLenS.LengthStart+1, LEN(t.partdescription))) caLenT(LengthText)

    CROSS APPLY (SELECT PATINDEX('%[^0-9.]%', caLenT.LengthText)) caLenE(LengthEnd)

    -- repeat to find the Width. First, need new text

    CROSS APPLY (SELECT SUBSTRING(caLenT.LengthText, caLenE.LengthEnd, LEN(caLenT.LengthText))) caWidth(StartText)

    CROSS APPLY (SELECT PATINDEX('% [0-9]%', caWidth.StartText)) caWidthS(WidthStart)

    CROSS APPLY (SELECT SUBSTRING(caWidth.StartText, caWidthS.WidthStart+1, LEN(caWidth.StartText))) caWidthT(WidthText)

    CROSS APPLY (SELECT PATINDEX('%[^0-9.]%', caWidthT.WidthText)) caWidthE(WidthEnd)

    -- repeat to find the Thickness. First, need new text

    CROSS APPLY (SELECT SUBSTRING(caWidthT.WidthText, caWidthE.WidthEnd, LEN(caWidthT.WidthText))) caThick(StartText)

    CROSS APPLY (SELECT PATINDEX('% [0-9]%', caThick.StartText)) caThickS(ThickStart)

    CROSS APPLY (SELECT SUBSTRING(caThick.StartText, caThickS.ThickStart+1, LEN(caThick.StartText))) caThickT(ThickText)

    CROSS APPLY (SELECT PATINDEX('%[^0-9.]%', caThickT.ThickText)) caThickE(ThickEnd)

    -- Find the first space after this

    CROSS APPLY (SELECT CHARINDEX(' ', caThickT.ThickText, caThickE.ThickEnd+1)) caGrade(Pos)

    Wayne, that's just flat-out impressive. Nice work!

    Thanks Ed

    MPF - does this work for you?

    Hi there,

    sorry its been almost a day - I have been away from the office for the whole morning so i've only just have a chance to look at some of the replies.

    Wayne, what you've made is frankly amazing. I can't believe how accurate that is.

    The SQL used in the is far and beyond my knowledge and despite me really trying to understand how it's doing what it is, I really cant. But the logic is spot on!!

    I can't thank you enough for taking the time to help me here! I'm so incredibly grateful! This is going to save someone a huge amount of time!! 🙂

    What an amazing effort from yourself especially but the whole SQL Server Central forum!

    Many, many thanks!

    MPF,

    First, you're welcome.

    and despite me really trying to understand how it's doing what it is, I really cant

    Now, for the important stuff. Perhaps you noticed this little line in my signature?

    If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!

    I'm serious. If you don't understand what this is doing, then don't use it until you do.

    There's nothing magical about this... it's all pretty simple, and I tried to document it. But I'll go through it again, in more detail. If there is something that you still don't understand, ask for more clarification. But... don't use it until you can explain what it is doing to someone else.

    First was identifying the patterns necessary. Those were all identified previously, so I won't do it again here.

    The first step is to identify the first number following a space. We use the PATINDEX function for that.

    SELECT PATINDEX('% [0-9]%', t.partdescription)

    PATINDEX utilizes LIKE patterns, so the first "%" means anything preceding the search pattern. The space following that specifies that the search expression needs to start with a space. The "[]" specifies a set of values to look for in the next position - here we are specifying the values 0-9, which will get any numeric value. This is followed with another "%" to signify that anything can follow the search string. If the starting "%" were not there, the string would have to START with this search expression; if the ending "%" were not there, then the string would have to END with the search expression. By using both, the search expression has to be within it - and it can start or end with it. PATINDEX will search the partdescription column for this search pattern, and if found it will return the column position where it starts at. If not found, it will return 0.

    The next step is to get the string starting at this position. The SUBSTRING function does this, using the PATINDEX value as the starting position, and the length of the string as the number of characters to retrieve. If you specify to return more characters than the string contains, then it will return to the end of the string, so I'm just using the length of the string. This string is obtained because, unlike CHARINDEX, the PATINDEX function does not allow to specify a starting position. So it needs to search this string for the first non-numeric string. Which is the next step...

    The PATINDEX function is then used again to find the first non-numeric character. Notice that the search expression is slightly different:

    SELECT PATINDEX('%[^0-9.]%', caLenT.LengthText)

    The "%" are used for the same reason as above.

    The space has been removed, so that a space will be considered to be a non-numeric character to match on.

    The "[]" are used again as specified above.

    The first difference is the use of the "^" in the search expression. This specifies to use the set of characters "NOT" including the ones specified, so it will get the first character that is NOT in the specified set. I also included the "." in the set of characters to be allowed, since we can have something like 12.5mm.

    With this, the PATINDEX function returns the first non-numeric value in the string.

    All of these SELECT statements are put into individual CROSS APPLY operations in the FROM clause, essentially making a pseudo-table. This will introduce this expression into the query as a column from a pseudo-table. This pseudo-table is aliased (for instance, caLenS), and then the column names are specified (for instance, LengthStart).

    At this point, three new pseudo-tables have been introduced into the query, along with their columns. Using these columns, we can now utilize the SUBSTRING or LEFT functions in the <column-list> portion of the SELECT statement. But, we only want to do this if the characters were found, so a CASE statement is utilized to verify that the values returned by PATINDEX are > 0. If the values were found, it will get the LEFT characters from the string returned from the SUBSTRING in the CROSS APPLY, up to (but not including) the first non-numeric character.

    So, this explains the first set. All that is left is to repeat this process for each other portion of the string being extracted (Length, Width, Thickness). There is one caveat - since we're using PATINDEX, we first need to get the rest of the string from that last ending position that PATINDEX can use. This is just a repeat of the SUBSTRING performed earlier, with a different starting position (and using the final string).

    To calculate the Grade, we just look for the first space after the end of the Thickness, and take everything from that.

    So, while this "looks" complicated, it's really very simple (it actually took me < 15 minutes to do this). PATINDEX. SUBSTRING. PATINDEX. Use in a CROSS APPLY. Use a CASE statement that utilizes LEFT or SUBSTRING to return the value. Repeat.

    Selfless plug: all of these functions / apply operators are discussed in my newly released book - see the link in my signature below. This seems like it would be a good book for you to get.

    Wayne
    Microsoft Certified Master: SQL Server 2008
    Author - SQL Server T-SQL Recipes


    If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!
    Links:
    For better assistance in answering your questions
    Performance Problems
    Common date/time routines
    Understanding and Using APPLY Part 1 & Part 2

Viewing 15 posts - 1 through 15 (of 17 total)

You must be logged in to reply to this topic. Login to reply