Remove HTML from varchar field?

Question

Remove HTML from varchar field?

Rob-350472

SSCertifiable

Points: 7197
More actions
February 8, 2012 at 10:30 am

#256528

I've got a varchar(max) field which I've just discovered contains data like this:
Blah blah blah <a href=" /xxx/yy/bbb.aspx?ID=111111 "> Entity Name </a>, blah blah blah...
I'm trying to retrieve the column minus the html part - i.e. blah blah, entity name, blah blah.
In this field the above could occur once, multiple times, or nothing. In addition the length of the ID parameter can be different (from 1 to something like 100000).
Sort of scratching my head on this, thinking of all sorts of replaces, finds, patindex etc and wondering if there's a more elegant solution I've not thought of?!
Thoughts and suggestions much appreciated!

Viewing 8 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply

Recurs1on SSCommitted Points: 1974 More actions · Answer 1

Try something like this:

SELECT CONVERT( XML, N'Blah blah blah <a href=" /xxx/yy/bbb.aspx?ID=111111 "> Entity Name </a>, blah blah blah' ).value( '.', 'VARCHAR(MAX)' )

Sean Lange SSC Guru Points: 286602 More actions · Answer 2

That will work if ALL your HTML is properly formatted so it can be parsed into xml. This is likely not going to be the case and properly removing all html is then going to be tricky.

_______________________________________________________________

Need help? Help us help you.

Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

Lowell SSC Guru Points: 323527 More actions · Answer 3

this is where regular expressions will help out;

can you install a CLR on your server to let you do regular expressions?

converting a string is a two part process, I think;

you have to convert HTML entities like [& lt ;] and [& nbsp ;] and all the other possible html escape values, and then strip out everything between < and > brackets.

here's a VB function i use, and I added one example that describes what i meant:

*edit] the forum software stripped out my nbsp, which i edited with {nbsp;}, you need to substitute the real html entity

Friend Function StripHTML(ByVal htmlString As String) As String

'This pattern Matches everything found inside html tags;

'(.|) - > Look for any character or a new line

' *? -> 0 or more occurrences, and make a non-greedy search meaning

'That the match will stop at the first available '>' it sees, and not at the last one

'(if it stopped at the last one we could have overlooked

'nested HTML tags inside a bigger HTML tag..)

Try

If htmlString Is Nothing OrElse htmlString = String.Empty Then Return String.Empty

htmlString = Replace(htmlString.ToString, "{nbsp;}", " ", , , CompareMethod.Text)

Dim pattern As String = "<(.|)*?>"

Return System.Text.RegularExpressions.Regex.Replace(htmlString, pattern, String.Empty)

Catch ex As Exception

Return ""

End Try

End Function

Lowell

--help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

Andy Hyslop SSCrazy Eights Points: 9198 More actions · Answer 4

Will This do what you need?

==========================================================================================================================
A computer lets you make more mistakes faster than any invention in human history - with the possible exceptions of handguns and tequila. Mitch Ratcliffe

Rob-350472 SSCertifiable Points: 7197 More actions · Answer 5

@Recurs1on: this seemed to work for me on the few examples I tried, however, I considered whether ALL of the html would be correct, as it's users (not developers) entering this there's a high chance it might not be!

I was playing around with this when a colleague chipped in that he had a function to achieve this, its very similar to the one andyhyslop just linked to, my colleague did admit to finding it on the web!

Thanks guys, the function seems to solve it.

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 6

I agree with Lowell - CLR RegEx is the way to go here.

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

Jeff Moden SSC Guru Points: 1004605 More actions · Answer 7

Rob-350472 (2/8/2012)
I've got a varchar(max) field which I've just discovered contains data like this:
Blah blah blah <a href=" /xxx/yy/bbb.aspx?ID=111111 "> Entity Name </a>, blah blah blah...
I'm trying to retrieve the column minus the html part - i.e. blah blah, entity name, blah blah.
In this field the above could occur once, multiple times, or nothing. In addition the length of the ID parameter can be different (from 1 to something like 100000).
Sort of scratching my head on this, thinking of all sorts of replaces, finds, patindex etc and wondering if there's a more elegant solution I've not thought of?!
Thoughts and suggestions much appreciated!

This could actually be fairly easy. Are you looking for ONLY those tags that start with "<a href" ?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Remove HTML from varchar field?

Cookies on SQLServerCentral