Remove HTML from varchar field?

  • I've got a varchar(max) field which I've just discovered contains data like this:

    Blah blah blah <a href=" /xxx/yy/bbb.aspx?ID=111111 "> Entity Name </a>, blah blah blah...

    I'm trying to retrieve the column minus the html part - i.e. blah blah, entity name, blah blah.

    In this field the above could occur once, multiple times, or nothing. In addition the length of the ID parameter can be different (from 1 to something like 100000).

    Sort of scratching my head on this, thinking of all sorts of replaces, finds, patindex etc and wondering if there's a more elegant solution I've not thought of?!

    Thoughts and suggestions much appreciated!

  • Try something like this:

    SELECT CONVERT( XML, N'Blah blah blah <a href=" /xxx/yy/bbb.aspx?ID=111111 "> Entity Name </a>, blah blah blah' ).value( '.', 'VARCHAR(MAX)' )

  • That will work if ALL your HTML is properly formatted so it can be parsed into xml. This is likely not going to be the case and properly removing all html is then going to be tricky.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • this is where regular expressions will help out;

    can you install a CLR on your server to let you do regular expressions?

    converting a string is a two part process, I think;

    you have to convert HTML entities like [& lt ;] and [& nbsp ;] and all the other possible html escape values, and then strip out everything between < and > brackets.

    here's a VB function i use, and I added one example that describes what i meant:

    *edit] the forum software stripped out my nbsp, which i edited with {nbsp;}, you need to substitute the real html entity

    Friend Function StripHTML(ByVal htmlString As String) As String

    'This pattern Matches everything found inside html tags;

    '(.|) - > Look for any character or a new line

    ' *? -> 0 or more occurrences, and make a non-greedy search meaning

    'That the match will stop at the first available '>' it sees, and not at the last one

    '(if it stopped at the last one we could have overlooked

    'nested HTML tags inside a bigger HTML tag..)

    Try

    If htmlString Is Nothing OrElse htmlString = String.Empty Then Return String.Empty

    htmlString = Replace(htmlString.ToString, "{nbsp;}", " ", , , CompareMethod.Text)

    Dim pattern As String = "<(.|)*?>"

    Return System.Text.RegularExpressions.Regex.Replace(htmlString, pattern, String.Empty)

    Catch ex As Exception

    Return ""

    End Try

    End Function

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • Will This do what you need?

    ==========================================================================================================================
    A computer lets you make more mistakes faster than any invention in human history - with the possible exceptions of handguns and tequila. Mitch Ratcliffe

  • @Recurs1on: this seemed to work for me on the few examples I tried, however, I considered whether ALL of the html would be correct, as it's users (not developers) entering this there's a high chance it might not be!

    I was playing around with this when a colleague chipped in that he had a function to achieve this, its very similar to the one andyhyslop just linked to, my colleague did admit to finding it on the web!

    Thanks guys, the function seems to solve it.

  • I agree with Lowell - CLR RegEx is the way to go here.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • Rob-350472 (2/8/2012)


    I've got a varchar(max) field which I've just discovered contains data like this:

    Blah blah blah <a href=" /xxx/yy/bbb.aspx?ID=111111 "> Entity Name </a>, blah blah blah...

    I'm trying to retrieve the column minus the html part - i.e. blah blah, entity name, blah blah.

    In this field the above could occur once, multiple times, or nothing. In addition the length of the ID parameter can be different (from 1 to something like 100000).

    Sort of scratching my head on this, thinking of all sorts of replaces, finds, patindex etc and wondering if there's a more elegant solution I've not thought of?!

    Thoughts and suggestions much appreciated!

    This could actually be fairly easy. Are you looking for ONLY those tags that start with "<a href" ?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 8 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply