February 8, 2012 at 10:30 am
I've got a varchar(max) field which I've just discovered contains data like this:
Blah blah blah <a href=" /xxx/yy/bbb.aspx?ID=111111 "> Entity Name </a>, blah blah blah...
I'm trying to retrieve the column minus the html part - i.e. blah blah, entity name, blah blah.
In this field the above could occur once, multiple times, or nothing. In addition the length of the ID parameter can be different (from 1 to something like 100000).
Sort of scratching my head on this, thinking of all sorts of replaces, finds, patindex etc and wondering if there's a more elegant solution I've not thought of?!
Thoughts and suggestions much appreciated!
February 8, 2012 at 10:48 am
Try something like this:
SELECT CONVERT( XML, N'Blah blah blah <a href=" /xxx/yy/bbb.aspx?ID=111111 "> Entity Name </a>, blah blah blah' ).value( '.', 'VARCHAR(MAX)' )
February 8, 2012 at 11:02 am
That will work if ALL your HTML is properly formatted so it can be parsed into xml. This is likely not going to be the case and properly removing all html is then going to be tricky.
_______________________________________________________________
Need help? Help us help you.
Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.
Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.
Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/
February 8, 2012 at 12:24 pm
this is where regular expressions will help out;
can you install a CLR on your server to let you do regular expressions?
converting a string is a two part process, I think;
you have to convert HTML entities like [& lt ;] and [& nbsp ;] and all the other possible html escape values, and then strip out everything between < and > brackets.
here's a VB function i use, and I added one example that describes what i meant:
*edit] the forum software stripped out my nbsp, which i edited with {nbsp;}, you need to substitute the real html entity
Friend Function StripHTML(ByVal htmlString As String) As String
'This pattern Matches everything found inside html tags;
'(.|) - > Look for any character or a new line
' *? -> 0 or more occurrences, and make a non-greedy search meaning
'That the match will stop at the first available '>' it sees, and not at the last one
'(if it stopped at the last one we could have overlooked
'nested HTML tags inside a bigger HTML tag..)
Try
If htmlString Is Nothing OrElse htmlString = String.Empty Then Return String.Empty
htmlString = Replace(htmlString.ToString, "{nbsp;}", " ", , , CompareMethod.Text)
Dim pattern As String = "<(.|)*?>"
Return System.Text.RegularExpressions.Regex.Replace(htmlString, pattern, String.Empty)
Catch ex As Exception
Return ""
End Try
End Function
Lowell
February 9, 2012 at 4:17 am
Will This do what you need?
==========================================================================================================================
A computer lets you make more mistakes faster than any invention in human history - with the possible exceptions of handguns and tequila. Mitch Ratcliffe
February 9, 2012 at 5:02 am
@Recurs1on: this seemed to work for me on the few examples I tried, however, I considered whether ALL of the html would be correct, as it's users (not developers) entering this there's a high chance it might not be!
I was playing around with this when a colleague chipped in that he had a function to achieve this, its very similar to the one andyhyslop just linked to, my colleague did admit to finding it on the web!
Thanks guys, the function seems to solve it.
February 10, 2012 at 5:33 am
I agree with Lowell - CLR RegEx is the way to go here.
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
February 11, 2012 at 3:49 am
Rob-350472 (2/8/2012)
I've got a varchar(max) field which I've just discovered contains data like this:Blah blah blah <a href=" /xxx/yy/bbb.aspx?ID=111111 "> Entity Name </a>, blah blah blah...
I'm trying to retrieve the column minus the html part - i.e. blah blah, entity name, blah blah.
In this field the above could occur once, multiple times, or nothing. In addition the length of the ID parameter can be different (from 1 to something like 100000).
Sort of scratching my head on this, thinking of all sorts of replaces, finds, patindex etc and wondering if there's a more elegant solution I've not thought of?!
Thoughts and suggestions much appreciated!
This could actually be fairly easy. Are you looking for ONLY those tags that start with "<a href" ?
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 8 posts - 1 through 8 (of 8 total)
You must be logged in to reply to this topic. Login to reply
This website stores cookies on your computer.
These cookies are used to improve your website experience and provide more personalized services to you, both on this website and through other media.
To find out more about the cookies we use, see our Privacy Policy