Parsing large varbinary fields

  • Hello all,

    I have a table with raw scientific test results in a single field, some of which are over 25Mb field. I need to parse into the field to find and aggregate selected values from the field.

    Table structure is

    CREATE TABLE [dbo].[Gxxx_Data](

    [id] [uniqueidentifier] NOT NULL,

    [Status] [nvarchar](50) NULL,

    [GxxxItem_ID] [int] NULL,

    [Stats_Data] [varbinary](max) NULL,

    ) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]

    The data, after conversion should look something like this:

    fmod.ThreadContext.Inst_Cnt.12120 2040 # 1TS[highlight=#ffff11]mov[/highlight] (1) r3.4<1>:ud r1.2<0;1,0>:ud { Align1, Q1 }

    fmod.ThreadContext.Inst_Cnt.12136 2040 # 1TS[highlight=#ffff11]shl[/highlight] (1) r4.0<1>:d r2.7<0;1,0>:ud 0x50005:uw { Align1, Q1 }

    fmod.ThreadContext.Inst_Cnt.12152 2040 # 1TS[highlight=#ffff11]mov[/highlight] (8) r5.0<1>:ud r0.0<8;8,1>:ud { Align1, Q1, NoMask, Compacted }

    From which I need to parse and summarize the (Assembler) opcodes (MOV,CMPi, SHR etc...)

    I need to parse the large field [Stats_Data] to locate the target data.

    The internal result strings are delimited with Char(10), conservative counts are from 64k to over 100k lines in each record.

    Is there a way to parse the individual lines into another table (temp) that would be queried/regexed ?

    Thank you in advance

    Tom in Sacramento For better, quicker answers on T-SQL questions, click on the following...http://www.sqlservercentral.com/articles/Best+Practices/61537/

  • The easiest way would be to do things "on the way in". How do these large Stats_Data values get into the table you posted to begin with?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Hi Jeff,

    I agree that would be easiest however that still leaves the existing records, is there a smarter way of dealing with them?

    Maybe I should build the search in binary format and patindex to the first 'line' which is usually around 55M bytes into the file. I know that cursors are spoken of badly, but does a cte or set based solution come to mind?

    I appreciate your thoughts on this

    Tom in Sacramento For better, quicker answers on T-SQL questions, click on the following...http://www.sqlservercentral.com/articles/Best+Practices/61537/

  • The internal result strings are delimited with Char(10), conservative counts are from 64k to over 100k lines in each record. ...

    Is there a way to parse the individual lines into another table (temp) that would be queried/regexed ?

    Since the strings are delimited with a CHAR(10) you could split the strings using Jeff's splitter (referenced in my signature). You would have to modify the splitter to deal with varchar(max). I have a modified (and unofficial) version of it that I use for strings longer that 8K. Using the splitter you could also filter out rows that don't contain the data that you need to aggregate. Then parse the ones that you do accordingly.

    Tom_Sacramento (6/9/2015)


    Hi Jeff,

    ... is there a smarter way of dealing with them?

    Maybe I should build the search in binary format and patindex to the first 'line' which is usually around 55M bytes into the file. I know that cursors are spoken of badly, but does a cte or set based solution come to mind?

    I appreciate your thoughts on this

    Depending on exactly what you are parsing I am confident that you could parse out the data that you need without Regex using a purely set-based solution. Is it possible for you to post an example of what you need to parse? Also are you saying that that data that you need is in the first "line"?

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Hi Alan,

    The converted string (from varbinary) looks like these 3 records (targets around position 40 but may be up to position 73, ergo my thought on REGEX...):

    1 2 3 4 5 6

    01234567890123456789012345678901234567890123456789012345678901234567890

    fmod.ThreadContext.Inst_Cnt.99920 2040 # 1TSmov (1) r3.4<1>:ud r1.2<0;1,0>:ud { Align1, Q1 }

    fmod.ThreadContext.Inst_Cnt.99936 2040 # 1TSshl (1) r4.0<1>:d r2.7<0;1,0>:ud 0x50005:uw { Align1, Q1 }

    fmod.ThreadContext.Inst_Cnt.99952 2040 # 1TSmov (8) r5.0<1>:ud r0.0<8;8,1>:ud { Align1, Q1, NoMask, Compacted }

    From these 'lines' I need the OpCodes (mnemonics @pos 40).

    The row markers are consistent, starting with 'fmod.ThreadContext' of which there are usually 1,000 to 1,500 records of interest

    Not sure how split this much data with the 8k splitter

    Thank you

    Tom in Sacramento For better, quicker answers on T-SQL questions, click on the following...http://www.sqlservercentral.com/articles/Best+Practices/61537/

  • Hi Tom,

    Understood on the existing rows.

    Peeling 1 potato at a time, can you use SQL CLR? The reason I ask is because there's a decent SQL CLR splitter that will do the trick on the first pass on these rows. Lemme know on that.

    On the rest of it, I need to know two things, please.

    1) What is the primary key column of the table and

    2) Given the following row, what exactly do you want returned as part of the Assembly Mnemonic (or otherwise)?

    fmod.ThreadContext.Inst_Cnt.99936 2040 # 1 TS shl (1) r4.0<1>:d r2.7<0;1,0>:ud 0x50005:uw { Align1, Q1 }

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • He Jeff,

    The primary key is a GUID called ID - not my idea, but this is an inherited database...

    What I need is a count of each of the 3 or 4 character OpCodes - they are usually found between position 40 and 73 on a given line. (I put together a list of opcodes in a table in case that is useful since there are only @12 different OpCodes - Add, And, ASL, ASR, Cmp, JmpI, Mov, SHL, SHR...)

    The lines translate from varbinary to varchar like this:

    fmod.ThreadContext.Inst_Cnt.12136 2040 # 1 TS shl (1) r4.0<1>:d r2.7<0;1,0>:ud 0x50005:uw { Align1, Q1 }

    Note the first half of the line - all targets start with 'fmod.ThreadContext.Inst_Cnt', followed by more variable text, followed by 5 or more spaces then the OpCode

    Thank you,

    Tom

    Tom in Sacramento For better, quicker answers on T-SQL questions, click on the following...http://www.sqlservercentral.com/articles/Best+Practices/61537/

  • ok, using pure string manipulation, can you ASSUME, the value you want is between a tab and the first left parenthesis?

    someone will pretty this up with a cross apply, which looks prettier than the inline stuff, but the concept seems to hold true.

    this would work dynamically, i think.

    /*--Results

    MySubString | MyReversedAndTrimmed SubStringToTab SubStringToTab

    fmod.ThreadContext.Inst_Cnt.99920 2040 # 1 TS mov | vom ST 1 # 0402 02999.tnC_tsnI.txetnoCdaerhT.domf vom mov

    fmod.ThreadContext.Inst_Cnt.99936 2040 # 1 TS shl | lhs ST 1 # 0402 63999.tnC_tsnI.txetnoCdaerhT.domf lhs shl

    fmod.ThreadContext.Inst_Cnt.99952 2040 # 1 TS mov | vom ST 1 # 0402 25999.tnC_tsnI.txetnoCdaerhT.domf vom mov

    */

    --create demo data

    IF OBJECT_ID('tempdb.[dbo].[#Gxxx_Data]') IS NOT NULL

    DROP TABLE [dbo].[#Gxxx_Data]

    CREATE TABLE #Gxxx_Data(

    [id] [uniqueidentifier] NOT NULL,

    [Status] [nvarchar](50) NULL,

    [GxxxItem_ID] [int] NULL,

    [Stats_Data] [varbinary](max) NULL)

    INSERT INTO #Gxxx_Data([id],[Stats_Data])

    SELECT NEWID(),CONVERT(varbinary(max),'fmod.ThreadContext.Inst_Cnt.99920 2040 # 1TSmov (1) r3.4<1>:ud r1.2<0;1,0>:ud { Align1, Q1 }') UNION ALL

    SELECT NEWID(),CONVERT(varbinary(max),'fmod.ThreadContext.Inst_Cnt.99936 2040 # 1TSshl (1) r4.0<1>:d r2.7<0;1,0>:ud 0x50005:uw { Align1, Q1 }') UNION ALL

    SELECT NEWID(),CONVERT(varbinary(max),'fmod.ThreadContext.Inst_Cnt.99952 2040 # 1TSmov (8) r5.0<1>:ud r0.0<8;8,1>:ud { Align1, Q1, NoMask, Compacted }');

    WITH MyCTE

    AS

    (

    SELECT

    id,

    CONVERT(VARCHAR(max),[Stats_Data]) AS [Stats_Data],

    CHARINDEX('(',CONVERT(VARCHAR(max),[Stats_Data]))-1 As FirstLeftParen

    FROM #Gxxx_Data

    )

    SELECT

    LEFT([Stats_Data],FirstLeftParen) AS MySubString,

    RTRIM(REVERSE(LEFT([Stats_Data],FirstLeftParen))) As MyReversedAndTrimmed,

    LEFT(RTRIM(REVERSE(LEFT([Stats_Data],FirstLeftParen))),CHARINDEX(CHAR(9),RTRIM(REVERSE(LEFT([Stats_Data],FirstLeftParen))))) As SubStringToTab,

    REVERSE(LEFT(RTRIM(REVERSE(LEFT([Stats_Data],FirstLeftParen))),CHARINDEX(CHAR(9),RTRIM(REVERSE(LEFT([Stats_Data],FirstLeftParen)))))) As SubStringToTab

    FROM MyCTE

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • Tom_Sacramento (6/10/2015)


    He Jeff,

    The primary key is a GUID called ID - not my idea, but this is an inherited database...

    What I need is a count of each of the 3 or 4 character OpCodes - they are usually found between position 40 and 73 on a given line. (I put together a list of opcodes in a table in case that is useful since there are only @12 different OpCodes - Add, And, ASL, ASR, Cmp, JmpI, Mov, SHL, SHR...)

    The lines translate from varbinary to varchar like this:

    fmod.ThreadContext.Inst_Cnt.12136 2040 # 1 TS shl (1) r4.0<1>:d r2.7<0;1,0>:ud 0x50005:uw { Align1, Q1 }

    Note the first half of the line - all targets start with 'fmod.ThreadContext.Inst_Cnt', followed by more variable text, followed by 5 or more spaces then the OpCode

    Thank you,

    Tom

    Thanks, Tom. Lowell did pretty much the same thing I would have. I'd give that a try. It's a generic solution that relies on the position of the TAB rather than on textual content, which could change.

    I guess my only question would be, does the data ever contain a Mnemonic that does not include a left parenthesis?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Thanks Jeff and Lowell,

    Just to clarify the data, it looks like this in the table:

    0x0A0D302064657070696B5365426F5464656B72614D73766F4D6D754E646E

    From this I need to break it into manageable (VARCHAR) pieces, like this into a temp table:

    fmod.weSamplerEuBypass.weNumEuBypassSendsApplied 0

    fmod.weSamplerEuBypass.weNumEuBypassSendsExecuted 0

    Then search each temp table record for one of the known OpCodes

    Move to the next temp table record and repeat until the EOF

    Jeff - you mentioned a SQLCLR splitter ... since the field is so large(64-100Mb) it sounds like the right tool due to limitations of varchar(max)

    Thanks much

    Tom in Sacramento For better, quicker answers on T-SQL questions, click on the following...http://www.sqlservercentral.com/articles/Best+Practices/61537/

  • What I need is a count of each of the 3 or 4 character OpCodes - they are usually found between position 40 and 73 on a given line. (I put together a list of opcodes in a table in case that is useful since there are only @12 different OpCodes

    Are you trying to count number of occurances of some particular codes?

    If theat is what you really want, you don't need to parse/split your values/

    Based on your sample data example you can do just this:

    declare @val varchar(max)

    set @val = 'fmod.ThreadContext.Inst_Cnt.12120 2040 # 1 TS [highlight=#ffff11]mov[/highlight] (1) r3.4<1>:ud r1.2<0;1,0>:ud { Align1, Q1 }

    fmod.ThreadContext.Inst_Cnt.12136 2040 # 1 TS [highlight=#ffff11]shl[/highlight] (1) r4.0<1>:d r2.7<0;1,0>:ud 0x50005:uw { Align1, Q1 }

    fmod.ThreadContext.Inst_Cnt.12152 2040 # 1 TS [highlight=#ffff11]mov[/highlight] (8) r5.0<1>:ud r0.0<8;8,1>:ud { Align1, Q1, NoMask, Compacted }

    fmod.ThreadContext.Inst_Cnt.12136 2040 # 1 TS [highlight=#ffff11]shl[/highlight] (1) r4.0<1>:d r2.7<0;1,0>:ud 0x50005:uw { Align1, Q1 }

    fmod.ThreadContext.Inst_Cnt.12120 2040 # 1 TS [highlight=#ffff11]mov[/highlight] (1) r3.4<1>:ud r1.2<0;1,0>:ud { Align1, Q1 }

    fmod.ThreadContext.Inst_Cnt.12136 2040 # 1 TS [highlight=#ffff11]shl[/highlight] (1) r4.0<1>:d r2.7<0;1,0>:ud 0x50005:uw { Align1, Q1 }

    fmod.ThreadContext.Inst_Cnt.12152 2040 # 1 TS [highlight=#ffff11]mov[/highlight] (8) r5.0<1>:ud r0.0<8;8,1>:ud { Align1, Q1, NoMask, Compacted }

    fmod.ThreadContext.Inst_Cnt.12136 2040 # 1 TS [highlight=#ffff11]shl[/highlight] (1) r4.0<1>:d r2.7<0;1,0>:ud 0x50005:uw { Align1, Q1 }

    fmod.ThreadContext.Inst_Cnt.12152 2040 # 1 TS [highlight=#ffff11]shl[/highlight] (8) r5.0<1>:ud r0.0<8;8,1>:ud { Align1, Q1, NoMask, Compacted }

    fmod.ThreadContext.Inst_Cnt.12136 2040 # 1 TS [highlight=#ffff11]shl[/highlight] (1) r4.0<1>:d r2.7<0;1,0>:ud 0x50005:uw { Align1, Q1 }

    '

    -- That will give you 100,000 lines

    select top 10 @val = @val + @val

    from sys.syscolumns

    -- get count (just apply it to your column for each opcode you have)

    SELECT (LEN(@val ) - LEN(REPLACE(@val, ']shl[', '')))/LEN(']shl[') AS Count_SHL

    ,(LEN(@val ) - LEN(REPLACE(@val, ']mov[', '')))/LEN(']mov[') AS Count_MOV

    ops. small code correction

    even better one:

    -- get count (just apply it to your column)

    SELECT OpCode, (LEN(@val ) - LEN(REPLACE(@val, ChkVal, '')))/LEN(ChkVal) AS Count

    FROM (VALUES ('Add'), ('And'), ('ASL'), ('ASR'), ('Cmp'), ('JmpI'), ('Mov'), ('SHL'), ('SHR')) c(OpCode)

    CROSS APPLY (SELECT ']' + OpCode + '[') f(ChkVal)

    _____________________________________________
    "The only true wisdom is in knowing you know nothing"
    "O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
    (So many miracle inventions provided by MS to us...)

    How to post your question to get the best and quick help[/url]

  • Hi Eugene,

    I wish it were that easy but the first problem is that the data is in a very large varbinary field(64-100Mb) that contains 50-150,000 lines (delimited Char(10)) as hex:

    0x232053696D436F72652E646C6C0D0A232076657273696F6E3A2031352D30342D323820623035393438382028723132383130290D0A23206275696C643A20202031352D30342D3239206C61625F63692052656C656173654943432833322D626974290D0A23204

    From this big piece of data I need to break it into lines, that have standard segment markers. Since I can't get PatIndex to deal with Varbinary(max) I'm thinking that I will have to character-by-character read the field until I find the line break (0x0a) and then if the line starts with my target string then write that into a temp table that I can count the OpCodes (1 per line)

    Hopefully that clarifies my dilemma

    Thank you,

    Tom

    Tom in Sacramento For better, quicker answers on T-SQL questions, click on the following...http://www.sqlservercentral.com/articles/Best+Practices/61537/

  • Tom_Sacramento (6/10/2015)


    Thanks Jeff and Lowell,

    Just to clarify the data, it looks like this in the table:

    0x0A0D302064657070696B5365426F5464656B72614D73766F4D6D754E646E

    From this I need to break it into manageable (VARCHAR) pieces, like this into a temp table:

    fmod.weSamplerEuBypass.weNumEuBypassSendsApplied 0

    fmod.weSamplerEuBypass.weNumEuBypassSendsExecuted 0

    Then search each temp table record for one of the known OpCodes

    Move to the next temp table record and repeat until the EOF

    Jeff - you mentioned a SQLCLR splitter ... since the field is so large(64-100Mb) it sounds like the right tool due to limitations of varchar(max)

    Thanks much

    Apologies for the delay. I'll dig it out and see if it will handle multi-character parameters.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Crud. I found it but it won't work because it only accepts a single character delimiter. I'll try something else.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Ah. wait a minute. I've apparently gotten two different threads mixed up. The individual internal "rows" in your large binary are delimited by CHAR(10), correct?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply