Strip HTML Tags

  • Guys,

    Guys I have a table with a column that has html text. The column with html text is pretty big datatye varchar(max)... I wanted to check if any of you have any function that I can use to Strip out the HTML tags... I saw couple of version online, but it was running too slow..

    This is the one I used:

    http://cosier.wordpress.com/2008/10/22/tsql-strip-html-function/

    Any suggestion is helpful.

    Thanks,

    Laura

    Strip Out 🙂

  • Laura_SqlNovice (10/28/2011)


    Guys,

    Guys I have a table with a column that has html text. The column with html text is pretty big datatye varchar(max)... I wanted to check if any of you have any function that I can use to Strip out the HTML tags... I saw couple of version online, but it was running too slow..

    This is the one I used:

    http://cosier.wordpress.com/2008/10/22/tsql-strip-html-function/

    Any suggestion is helpful.

    Thanks,

    Laura

    Strip Out 🙂

    What do you want to do with the characters that remain? Just leave them in the string?

    Also, what about individual lines? If there's a new line in the HTML, do you want a new row in the result set?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • is it HTML or XHTML?

    also, +1 on Jeff's questions...

    MM



    select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

  • Forum Etiquette: How to post Reporting Services problems
  • [/url]
  • Forum Etiquette: How to post data/code on a forum to get the best help - by Jeff Moden
  • [/url]
  • How to Post Performance Problems - by Gail Shaw
  • [/url]

  • Thanks jeff. Yes I just need the string with no HTML tags. IF there is new line I do not need another line, but need one string for each html string. Hope I am making sense. Thanks again.

  • Understood. Let's nail the requirements right to the floor now. If you have the following string, what precisely do you want to get back?

    <data1>somedata1a</data1><data2>somedata2a</data2>

    <data1>somedata1b</data1><data2>somedata2b</data2>

    <data1>somedata1c</data1><data2>somedata2c</data2>

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (10/30/2011)


    Understood. Let's nail the requirements right to the floor now. If you have the following string, what precisely do you want to get back?

    <data1>somedata1a</data1><data2>somedata2a</data2>

    <data1>somedata1b</data1><data2>somedata2b</data2>

    <data1>somedata1c</data1><data2>somedata2c</data2>

    If these strings are loaded in three rows then I need three rows with no html tags... I need to replace the html tags with '' on a column which has html text. The data type is ntext for that sql column.

    Thanks a lot.

  • Got it. Thanks.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I refactored the function that I found at the following URL... look for the answer posted by "dudeNumber4"...

    CREATE FUNCTION dbo.StripHtmlTags

    (@HtmlText XML )

    RETURNS NVARCHAR(MAX)

    AS

    BEGIN

    RETURN (

    SELECT contents.value('.', 'nvarchar(max)')

    FROM (

    SELECT contents = chunks.chunk.query('.') FROM @HtmlText.nodes('/') AS chunks(chunk)

    ) doc

    )

    END

    ;

    I tried turning it into an iTVF instead of a scalar function and it took twice as long to run.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Thanks a lot Jeff. This is a good solution for well formatted html. Thanks a lot again Jeff.

  • Jeff Moden (10/30/2011)


    I tried turning it into an iTVF instead of a scalar function and it took twice as long to run.

    I also tried and it took half as long....

    CREATE FUNCTION dbo.StripHtmlTags2

    (@HtmlText XML )

    RETURNS TABLE

    AS

    RETURN

    (

    SELECT

    @HtmlText.value('(.)[1]', 'nvarchar(max)') AS result

    )

    Is this how yours ended up?

    MM



    select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

  • Forum Etiquette: How to post Reporting Services problems
  • [/url]
  • Forum Etiquette: How to post data/code on a forum to get the best help - by Jeff Moden
  • [/url]
  • How to Post Performance Problems - by Gail Shaw
  • [/url]

  • Jeff Moden (10/30/2011)


    I refactored the function that I found at the following URL... look for the answer posted by "dudeNumber4"...

    CREATE FUNCTION dbo.StripHtmlTags

    (@HtmlText XML )

    RETURNS NVARCHAR(MAX)

    AS

    BEGIN

    RETURN (

    SELECT contents.value('.', 'nvarchar(max)')

    FROM (

    SELECT contents = chunks.chunk.query('.') FROM @HtmlText.nodes('/') AS chunks(chunk)

    ) doc

    )

    END

    ;

    I tried turning it into an iTVF instead of a scalar function and it took twice as long to run.

    This is pretty slick. Of course for must of the stuff our marketing department puts together it will fail because they seem incapable of putting together actual valid html. They routinely leave off closing tags which will cause this to fail parsing the html string.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • mister.magoo (10/31/2011)


    Jeff Moden (10/30/2011)


    I tried turning it into an iTVF instead of a scalar function and it took twice as long to run.

    I also tried and it took half as long....

    CREATE FUNCTION dbo.StripHtmlTags2

    (@HtmlText XML )

    RETURNS TABLE

    AS

    RETURN

    (

    SELECT

    @HtmlText.value('(.)[1]', 'nvarchar(max)') AS result

    )

    Is this how yours ended up?

    No... that's even more clever than the one I found and posted. I'll have to give that I try. Thanks, Magoo.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Found this code for removing HTML tags from my data, but I am ashamed to say, I have no clue how it works. Could someone point me in the right direction so I can understand what it is doing?

    create function dbo.StripHTML( @text varchar(max) ) returns varchar(max) as

    begin

    declare @textXML xml

    declare @result varchar(max)

    set @textXML = @text;

    with doc(contents) as

    (

    select chunks.chunk.query('.') from @textXML.nodes('/') as chunks(chunk)

    )

    select @result = contents.value('.', 'varchar(max)') from doc

    return @result

    end

    go

    What the heck is chunks.chunk.query?? I consider myself to be a Sr. DBA but this has me stumped. I figure it is a xml function of some type but can't seem to find any information on it. Any help greatly appreicated!!!

    Jim

  • I know it has been quite some time since this was posted, but I have a question. The code is looking for an EXACT match to the start tag. However, if you have additional info (e.g. <span style=...> or <div class=...> and </span> or </div> you get a parsing error.

    Is there any way to avoid that? I've found some code that parses character by character but we know that is, like 🙁

  • Jim Youmans-439383 (4/6/2012)


    Found this code for removing HTML tags from my data, but I am ashamed to say, I have no clue how it works. Could someone point me in the right direction so I can understand what it is doing?

    create function dbo.StripHTML( @text varchar(max) ) returns varchar(max) as

    begin

    declare @textXML xml

    declare @result varchar(max)

    set @textXML = @text;

    with doc(contents) as

    (

    select chunks.chunk.query('.') from @textXML.nodes('/') as chunks(chunk)

    )

    select @result = contents.value('.', 'varchar(max)') from doc

    return @result

    end

    go

    What the heck is chunks.chunk.query?? I consider myself to be a Sr. DBA but this has me stumped. I figure it is a xml function of some type but can't seem to find any information on it. Any help greatly appreicated!!!

    Jim

    chunks.chunk is the name of the derived table and column name created in the FROM clause (as chunks(chunk)), that's all.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Viewing 15 posts - 1 through 15 (of 19 total)

    You must be logged in to reply to this topic. Login to reply