Need help parsing HTML data stored as varchar(max)!!!!

  • Hi guys,

    I am a bit new to this, but here is my question-

    How do i recursively parse through HTML data stored as one unique record and extract each occcurence of a special character?

    For Example-

    Create table #A(ID int , Content vacrhar(max)

    Insert into #A(1, <a href="Some text #id = 'AB123'> sometext </a> <a href = "Some text #id ='BC234'>sometext</a>)

    Insert into #B(1, <a href="Some text #id = 'XY234'> sometext </a> <a href = "Some text #id ='JKL11'>sometext</a>)

    I only need all the occurences of data after text #id , i.e. AB123 , BC234 etc

    Im able to use the Substring and PATINDEX function to succesfully retrieve the first instance-AB123

    select SUBSTRING(a.content,

    PATINDEX('%text#id=%', a.content) +7, 6)

    from #A a

    But how do 'move' the pointer so that it moves to the next occurence BC234.

    I would like to know if it is possible and if so which functions to use.

    Thanks

  • Before trying to resolve this issue one question:

    Would it be possible to store the data in xml format instead of html?

    Reason:

    String parsing is dangerous since it might lead to unexpected results.

    How would you successfully parse the following html string?

    <a href="Some text #id = 'text #id =123'"> sometext </a> <a href = "Some text #id ='JKL11'">sometext</a>



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • @Imu- Yes i could, i have not done it before, but if it can be simply stored using a cast or convert function then i could as i simply have to retrive the text, i can modify the source data.

    As to your second question-i am not sure i can definitely obtain the first text value retreving the second, third etc is more challenging as i am pretty unfamiliar with using pointers and text.

  • If you could get the source data in xml format it'll be an easy task to do...

    DECLARE @t TABLE (ID INT , Conten XML)

    INSERT INTO @t VALUES(1, '<a id="AB123">sometext </a><a id="BC234">sometext </a>')

    INSERT INTO @t VALUES(2, '<a id="XY234">sometext </a><a id="JKL11">sometext </a>')

    SELECT

    id,

    c.value('@id[1]','varchar(30)') a

    FROM @t t

    CROSS APPLY

    Conten.nodes('a') u(c)

    /* result set

    ida

    1AB123

    1BC234

    2XY234

    2JKL11

    */



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • Thankyou..

    But how do i get it into XML format?- when i tried to insert the content (which is html stored as varchar) i get the following error-

    XML parsing: line 1, character 9, A string literal was expected.

  • Based on your statement

    i can modify the source data.

    I assumed you'd be able to convert it into "real" xml format.

    I tried to use the html format and got the following (Please not that I had to double the quotation marks):

    DECLARE @t TABLE (ID INT , Conten XML)

    INSERT INTO @t VALUES(1, '<a href="Some text #id = ''AB123''"> sometext </a> <a href = "Some text #id =''BC234''">sometext</a>')

    Insert into @t values(2, '<a href="Some text #id = ''XY234''"> sometext </a> <a href = "Some text #id =''JKL11''">sometext</a>')

    SELECT

    id,

    c.value('@href[1]','varchar(30)') a

    FROM @t t

    CROSS APPLY

    Conten.nodes('a') u(c)

    /* result set

    ida

    1Some text #id = 'AB123'

    1Some text #id ='BC234'

    2Some text #id = 'XY234'

    2Some text #id ='JKL11'

    */

    Since I had to make some more modification to your sample data to make it valid html data as well as creating a valid SQL statement I'd recommend that you verify the sample data you posted to be correct and we'll work from there.



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • I'm sorry what i had meant was- It would be possible to store the source data as XML. There are thousands of these records in the DB. If there is a tool/function to store(convert) the html as XML i could use that. I cannot do it manually though because of the sheer volume. I ahve also not done this conversion before.

    As for the data it is eseentially dummy data that i provided, very simply what i was trying to do in each record was-

    "#####zzfileidzz12####zzfileidzz23#######zzfileidzz44#######" [This is a single record]

    Extract -12

    23

    44

    I am hoping not to post the actual data as i would like it to be secure.

    hope that helps..

  • Pac123 (1/22/2010)


    I'm sorry what i had meant was- It would be possible to store the source data as XML. There are thousands of these records in the DB. If there is a tool/function to store(convert) the html as XML i could use that. I cannot do it manually though because of the sheer volume. I ahve also not done this conversion before.

    As for the data it is eseentially dummy data that i provided, very simply what i was trying to do in each record was-

    "#####zzfileidzz12####zzfileidzz23#######zzfileidzz44#######" [This is a single record]

    ...

    I am hoping not to post the actual data as i would like it to be secure.

    hope that helps..

    I'm totally lost now... :unsure:

    Your statements so far:

    PATINDEX('%text#id=%',

    against sample data like

    <a href="Some text #id = 'AB123'...

    vs.

    It would be possible to store the source data as XML

    vs.

    If there is a tool/function to store(convert) the html as XML i could use that.

    vs.

    "#####zzfileidzz12####zzfileidzz23#######zzfileidzz44#######" [This is a single record]

    - What HTML format is that?????

    Please note that we can't look over your shoulder nor do we know the business case you need to work on. Everything we have is what you post.

    Since it seems like you can't store the data in xml format you should look into the following concept:

    step 1:

    replace the repeating stuff you don't need with a character that's not part of the string pattern you need (e.g. comma).

    Example:

    select replace(replace('#####zzfileidzz12####zzfileidzz23#######zzfileidzz44#######','zzfileidzz',','),'#','')

    -- result set: ,12,23,44

    step 2:

    use a string split function to get the comma separated list into separate rows.

    select * from dbo.DelimitedSplit (replace(replace('#####zzfileidzz12####zzfileidzz23#######zzfileidzz44#######','zzfileidzz',','),'#',''),',')

    /* result set

    ItemIDItem

    1

    212

    323

    444*/

    For details regarding string split functions please see the Tally table link in my signature.

    If that solution won't help you either you'd need to provide sample data that describe your scenario more clearly than what provided so far...



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply