Parse field with Numeric ranges and comma deliminated values

  • Ok I am a bit stuck

    I have a field that contains a mix of ranged data. All data is numeric; field type is varchar.

    Example of field contents:

    44100-44110, 44150

    I need to derive a result in a NEW table that will contain one record each of

    44100

    44101

    44102

    44103

    44104

    44105

    44106

    44107

    44108

    44109

    44110

    44150

    Can anyone help with a piece of code that would accomplish this?

    J

    <hr noshade size=1 width=250 color=#BBC8E5> Regards,Jeffery Williams http://www.linkedin.com/in/jwilliamsoh

  • This looks interesting, and I'd like to help you out, but there just isn't enough information to begin. So, to start off, please read the first link in my signature and then post some table DDL statements, and some DML to put some representative data into it, and then I'll be able to see what you're working with.

    Will all of the data always have one range, followed by a comma and one additional number? If not, ensure that you include enough variations of data so that we can see what we really need to work with.

    Wayne
    Microsoft Certified Master: SQL Server 2008
    Author - SQL Server T-SQL Recipes


    If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!
    Links:
    For better assistance in answering your questions
    Performance Problems
    Common date/time routines
    Understanding and Using APPLY Part 1 & Part 2

  • Based on the single row of test data you supplied, and assuming that all other rows are in the same exact format, this will work:

    declare @test-2 table (MixedData varchar(50))

    insert into @test-2 values ('44100-44110, 44150')

    ;WITH CTE (RangeData, SingleData) AS

    (SELECT SUBSTRING(MixedData, 1, Charindex(',', MixedData)-1),

    convert(int, SUBSTRING(MixedData, CharIndex(',', MixedData)+1, 50))

    FROM @test-2),

    CTE2 (RangeStart, RangeEnd, SingleData) AS

    (SELECT convert(int, SUBSTRING(RangeData, 1, CharIndex('-', RangeData)-1)),

    convert(int, SUBSTRING(RangeData, CharIndex('-', RangeData)+1, 50)),

    SingleData

    FROM CTE),

    Tens (N) AS (SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0 UNION ALL

    SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0),

    Thousands(N) AS (SELECT 1 FROM Tens t1 CROSS JOIN Tens t2 CROSS JOIN Tens t3),

    Millions (N) AS (SELECT 1 FROM Thousands t1 CROSS JOIN Thousands t2),

    Tally (N) AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 0)) FROM Millions)

    SELECT N

    FROM Tally, CTE2

    WHERE N BETWEEN CTE2.RangeStart and CTE2.RangeEnd

    OR N = CTE2.SingleData

    Note that this utilizes a virtual tally table. If you already have your own tally table, you can omit the parts that build it.

    Wayne
    Microsoft Certified Master: SQL Server 2008
    Author - SQL Server T-SQL Recipes


    If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!
    Links:
    For better assistance in answering your questions
    Performance Problems
    Common date/time routines
    Understanding and Using APPLY Part 1 & Part 2

  • Thank you, I was just preparing to post the create statements and a text file of some data.

    I will take a look at your code. With regards to all rows containing the same exact format; not really...

    SOME will have no range in it (44100-44110) and only contain things like 44110,44125,44157, etc (maybe three or so values, maybe 20)...

    Other fields might have multiple ranges

    ex: 441100-44110, 44128, 44524-44555, 44112, 52144

    Things like this... Makes it interesting for sure.

    <hr noshade size=1 width=250 color=#BBC8E5> Regards,Jeffery Williams http://www.linkedin.com/in/jwilliamsoh

  • -- See how this starts off by creating a table

    -- and inserting representative test data into it?

    -- If you do this, it makes it a LOT easier for all

    -- of the volunteers on this site to just copy/paste

    -- this into a query window and start working on it.

    DECLARE @test-2 TABLE (MixedData varchar(50))

    INSERT INTO @test-2

    SELECT '44100-44110, 44150' UNION ALL

    SELECT '44115,44125,44157' UNION ALL

    SELECT ' 44250-44350, 44128, 44524-44555, 44112, 52144'

    ;WITH

    -- This begins a virtual tally table. This query will be a LOT faster (< 1 second) with a permanent one.

    -- See the article below for how to build one.

    Tens (N) AS (SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0 UNION ALL

    SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0),

    Thousands(N) AS (SELECT 1 FROM Tens t1 CROSS JOIN Tens t2 CROSS JOIN Tens t3),

    Millions (N) AS (SELECT 1 FROM Thousands t1 CROSS JOIN Thousands t2),

    Tally (N) AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 0)) FROM Millions),

    -- first, need to break down each mixed data row into separate items, either a single item or a range.

    -- See Jeff Moden's article The "Numbers" or "Tally" Table: What it is and how it replaces a loop.

    -- at http://www.sqlservercentral.com/articles/T-SQL/62867/ for how a tally table can split strings apart.

    Elements (Items) AS (

    SELECT RTRIM(LTRIM(SUBSTRING(',' + MixedData + ',',N+1,CHARINDEX(',',',' + MixedData + ',',N+1)-N-1)))

    FROM Tally, @test-2

    WHERE N < LEN(',' + MixedData + ',')

    AND SUBSTRING(',' + MixedData + ',',N,1) = ','),

    -- Now, for each item, get the range start/end. A range will have a "-".

    -- If just a single item, use the item for both the range start and end.

    Elements2 (RangeStart, RangeEnd) AS (

    SELECT CONVERT(INT, CASE WHEN CharIndex('-', Items) > 0 THEN LEFT(Items, CharIndex('-', Items)-1)

    ELSE Items

    END),

    CONVERT(INT, CASE WHEN CharIndex('-', Items) > 0 THEN SUBSTRING(Items, CharIndex('-', Items)+1, len(Items))

    ELSE Items

    END)

    FROM Elements)

    -- Finally, return all the individual items

    SELECT N

    FROM Tally, Elements2

    WHERE N BETWEEN Elements2.RangeStart and Elements2.RangeEnd

    Edit: corrected name misspelling

    Wayne
    Microsoft Certified Master: SQL Server 2008
    Author - SQL Server T-SQL Recipes


    If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!
    Links:
    For better assistance in answering your questions
    Performance Problems
    Common date/time routines
    Understanding and Using APPLY Part 1 & Part 2

  • nice breakthrough Wayne; I couldn't get my mind around the multiple ranges;

    performance on this solution sux, took my decent dev machine 1:09 to return the 151 rows, we might look at optimizing it , but i'm just happy to add a solution like that to my toolbox.

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • Same here. I PM'd but should have replied publicly. Yes the results of this script are amazing and I just could not get my head around it either.

    Performance is really bad (BUT IT WORKS)... I do need to tune it however and I am going to be working on that. The performance hit is in the joins. If anyone has a recommendation or modifications that would be great. Otherwise I will post what I come up with here.

    Jeff

    <hr noshade size=1 width=250 color=#BBC8E5> Regards,Jeffery Williams http://www.linkedin.com/in/jwilliamsoh

  • The new code is working well. Need to make changes to include some additional code I need for my purpose.

    <hr noshade size=1 width=250 color=#BBC8E5> Regards,Jeffery Williams http://www.linkedin.com/in/jwilliamsoh

  • Lowell (3/25/2010)


    nice breakthrough Wayne; I couldn't get my mind around the multiple ranges;

    performance on this solution sux, took my decent dev machine 1:09 to return the 151 rows, we might look at optimizing it , but i'm just happy to add a solution like that to my toolbox.

    Yes, the performance isn't very good. Several things contribute to this:

    1. using a virtual (aka in-line or dynamic) tally table vs. a physical tally table with a good clustered index

    2. the tally table is being called twice

    3. we're looking at numbers that require building it into the million-row range

    If you replace the virtual tally table with a physical one, things perform much better. On my system, it goes from 50+ seconds to <1 second.

    To create the permanent tally table with a clustered index:

    ;WITH

    Tens (N) AS (SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0 UNION ALL

    SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0),

    Thousands(N) AS (SELECT 1 FROM Tens t1 CROSS JOIN Tens t2 CROSS JOIN Tens t3),

    Millions (N) AS (SELECT 1 FROM Thousands t1 CROSS JOIN Thousands t2),

    Tally (N) AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 0)) FROM Millions)

    SELECT N

    INTO dbo.TALLY

    FROM Tally

    ORDER BY N

    ALTER TABLE dbo.TALLY ALTER COLUMN N int NOT NULL

    ALTER TABLE dbo.TALLY ADD CONSTRAINT [PK_TALLY] PRIMARY KEY CLUSTERED (N)

    Then, change the previously posted code for this solution to:

    -- See how this starts off by creating a table

    -- and inserting representative test data into it?

    -- If you do this, it makes it a LOT easier for all

    -- of the volunteers on this site to just copy/paste

    -- this into a query window and start working on it.

    DECLARE @test-2 TABLE (MixedData varchar(50))

    INSERT INTO @test-2

    SELECT '44100-44110, 44150' UNION ALL

    SELECT '44115,44125,44157' UNION ALL

    SELECT ' 44250-44350, 44128, 44524-44555, 44112, 52144'

    ;WITH

    -- first, need to break down each mixed data row into separate items,

    -- either a single item or a range. See Jeff Moden's article

    -- The "Numbers" or "Tally" Table: What it is and how it replaces a loop.

    -- at http://www.sqlservercentral.com/articles/T-SQL/62867/ for how

    -- a tally table can split strings apart.

    Elements (Items) AS (

    SELECT RTRIM(LTRIM(SUBSTRING(',' + MixedData + ',',N+1,CHARINDEX(',',',' + MixedData + ',',N+1)-N-1)))

    FROM dbo.Tally, @test-2

    WHERE N < LEN(',' + MixedData + ',')

    AND SUBSTRING(',' + MixedData + ',',N,1) = ','),

    -- Now, for each item, get the range start/end. A range will have a "-".

    -- If just a single item, use the item for both the range start and end.

    Elements2 (RangeStart, RangeEnd) AS (

    SELECT CONVERT(INT, CASE WHEN CharIndex('-', Items) > 0 THEN LEFT(Items, CharIndex('-', Items)-1)

    ELSE Items

    END),

    CONVERT(INT, CASE WHEN CharIndex('-', Items) > 0 THEN SUBSTRING(Items, CharIndex('-', Items)+1, len(Items))

    ELSE Items

    END)

    FROM Elements)

    -- Finally, return all the individual items

    SELECT N

    FROM dbo.Tally, Elements2

    WHERE N BETWEEN Elements2.RangeStart and Elements2.RangeEnd

    Lowell and Jeff, I'm interested in how this improves the performance for you.

    My results:

    Table '#07020F21'. Scan count 1, logical reads 1, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'TALLY'. Scan count 13, logical reads 73, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 15 ms, elapsed time = 165 ms.

    Execution plan is attached.

    Edit: added performance results and execution plan

    Edit2: corrected name misspelling

    Wayne
    Microsoft Certified Master: SQL Server 2008
    Author - SQL Server T-SQL Recipes


    If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!
    Links:
    For better assistance in answering your questions
    Performance Problems
    Common date/time routines
    Understanding and Using APPLY Part 1 & Part 2

  • From 4.5 hours to get through HALF the records to 13 seconds to process the entire job.

    <hr noshade size=1 width=250 color=#BBC8E5> Regards,Jeffery Williams http://www.linkedin.com/in/jwilliamsoh

  • Sounds like if you blinked, you missed it. See why you should have a permanent tally table?!? 🙂

    Wayne
    Microsoft Certified Master: SQL Server 2008
    Author - SQL Server T-SQL Recipes


    If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!
    Links:
    For better assistance in answering your questions
    Performance Problems
    Common date/time routines
    Understanding and Using APPLY Part 1 & Part 2

  • Yeah it was really quick. Thanks a TON... I don't think I could have pulled it off.

    <hr noshade size=1 width=250 color=#BBC8E5> Regards,Jeffery Williams http://www.linkedin.com/in/jwilliamsoh

  • Hey everyone, you probably won't believe this, but the following performs pretty well too. Two milliseconds elapsed for the 151 row generating sample data. Nested recursive CTEs wrapped in a TVF:

    -- In-line table-valued function

    CREATE FUNCTION dbo.ListValues

    (

    @data VARCHAR(50)

    )

    RETURNS TABLE

    WITH SCHEMABINDING

    AS RETURN

    WITH Split

    AS (

    SELECT pos = ISNULL(NULLIF(CHARINDEX(',', @data, 0), 0), 50),

    item = SUBSTRING(@data, 1, ISNULL(NULLIF(CHARINDEX(',', @data, 0), 0), 50) - 1)

    UNION ALL

    SELECT pos = ISNULL(NULLIF(CHARINDEX(',', @data, pos + 1), 0), 50),

    item = SUBSTRING(@data, pos + 1, ISNULL(NULLIF(CHARINDEX(',', @data, pos + 1), 0), 50) - pos - 1)

    FROM Split

    WHERE pos < 50

    ),

    Listed

    AS (

    SELECT value =

    CASE

    WHEN CHARINDEX('-', S.item) = 0 THEN CONVERT(INT, S.item)

    ELSE CONVERT(INT, LEFT(S.item, CHARINDEX('-', S.item) - 1))

    END,

    max_value =

    CASE

    WHEN CHARINDEX('-', S.item) = 0 THEN 0

    ELSE CONVERT(INT, SUBSTRING(S.item, CHARINDEX('-', S.item) + 1, 50))

    END

    FROM Split S

    UNION ALL

    SELECT L.value + 1,

    L.max_value

    FROM Listed L

    WHERE L.max_value > L.value

    )

    SELECT value

    FROM Listed;

    GO

    -- Show that the function is deterministic

    SELECT is_deterministic = OBJECTPROPERTYEX(OBJECT_ID('dbo.ListValues', 'IF'), 'IsDeterministic');

    GO

    SET NOCOUNT ON;

    -- Sample table

    DECLARE @Sample

    TABLE (

    data VARCHAR(50)

    COLLATE Latin1_General_BIN

    NOT NULL

    );

    -- Sample data

    INSERT @Sample

    SELECT '44100-44110, 44150' UNION ALL

    SELECT '44115, 44125, 44157' UNION ALL

    SELECT '44250-44350, 44128, 44524-44555, 44112, 52144';

    -- Results

    SET STATISTICS IO, TIME ON;

    SELECT LV.value

    FROM @Sample S

    CROSS

    APPLY dbo.ListValues (S.data) LV;

    SET STATISTICS IO, TIME OFF;

    GO

    -- Tidy up

    DROP FUNCTION dbo.ListValues;

    -- @Sample : scan 1 reads 1

    -- Worktable: scan 8 reads 962

    -- CPU : 0 ms

    -- Elapsed : 2 ms

  • Nice Paul, very nice! No tally table at all, just using CharIndex's ability to get the next occurrence. And, of course, your always-present CROSS APPLY 😀

    Good job. I also learned that you can combine multiple SET STATISTICS on the same statement... I never realized that before.

    Wayne
    Microsoft Certified Master: SQL Server 2008
    Author - SQL Server T-SQL Recipes


    If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!
    Links:
    For better assistance in answering your questions
    Performance Problems
    Common date/time routines
    Understanding and Using APPLY Part 1 & Part 2

  • WayneS (3/26/2010)


    And, of course, your always-present CROSS APPLY 😀

    I knew I should have posted a CLR solution - I am becoming type-cast! 😉

    I also learned that you can combine multiple SET STATISTICS on the same statement.

    You can use the same shorthand elsewhere too. Scripts that I write that invoke XML methods, indexed views...or one of any number of features that require a certain configuration of session SET statements...are often prefixed with:

    SET ANSI_NULLS,

    ANSI_PADDING,

    ANSI_WARNINGS,

    ARITHABORT,

    CONCAT_NULL_YIELDS_NULL,

    QUOTED_IDENTIFIER

    ON;

    SET NUMERIC_ROUNDABORT

    OFF;

    Thank you for the kind comments.

    Paul

Viewing 15 posts - 1 through 15 (of 27 total)

You must be logged in to reply to this topic. Login to reply