Removing HTML tags

  • I've read many pages on the web about doing this and many come back to this page:

    https://www.sqlservercentral.com/Forums/1197668/Strip-HTML-Tags

    I have a SQL view that is called my the application framework I use. I have no control over the application framework other than being able to create views that screens in the framework can call to populate fields on a screen. A change in that framework a year or so ago caused a little problem for me. Communication data, eg, calls, emails, letters, that sort of stuff, is recorded in the system. Text was put into a Notes field of type nvarchar(max). Now when emails are filed the email data goes into an Email field instead of the Notes field. If the communication is an email, then Notes will be null otherwise everything goes into Notes. However, the HTML code is also stored in there to allow the original email to be displayed. So, in my view I have a calculated field that reads:

    isnull(notes, email) as customnotes

    Then when I call the view I can pull out the notes and email text into a single field as it used to be, but it now contains the HTML. So, I tried one of the many offerings of code that strips HTML. Fine, that works, until after a year or two of use the table is now getting very large and displaying the screen can now take 20 - 30 seconds as it calls the view to strip the HTML. I don't have access to the database. I can't add any more indexing, so any thoughts of that are out of the window. I can't change the framework. All I can do is create SQL views and SQL functions through the front end which are committed to the database. Does SQL 2014 have any inbuilt methods of deleting any HTML markup? Some of the rows contain doctype markup and some are just plain HTML, it depends on the email client that sends the email. All I want is the text of the main email in a calculated field.

    Anything new on this area?

    Regards,
    Lee

  • You may well be out of luck....   nvarchar(max) or varchar(max) is going to cause it's own set of issues for the table involved, as for some rows the data is going to be forced to be stored "off row", in a LOB storage element.   One question - is that notes or e-mail column ever updated, or is it totally static after the initial insert?   If you have absolutely ZERO control over the database, you may not be able to do much, and will have to hand the problem off to those folks that handle the database,    I'm not aware of any built-in functions that strip out HTML tags, and besides, HTML can be constructed in so many ways that such a task is anything but trivial, and it's not as if SQL Server is all that good at string handling to begin with.   I think your problem is more of a database design issue than anything else.   The question is what made folks think that storing that e-mail in a separate column was such a good idea...

    You may want to suggest to your DBA (or point them to this topic) that LOB data types be ALL stored "off row" for the table involved, as there's likely to be some page splits that aren't pretty involved in what you have.   The other thing is whether or not you really need nvarchar vs. varchar, as that just doubles how much storage it takes to keep things without a whole lot of benefit.   Unless you are using national characters from a language other than English, you don't usually ever need nvarchar data types for anything other than dynamic SQL.

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • I can't imagine what the advantage would be to capture the markup, sounds like you should try to push the software company to strip before they store it.
    Is there any way to preprocess and store in your own table somewhere?

    -------------------------------------------------------------------------------------------------------------------------------------
    Please follow Best Practices For Posting On Forums to receive quicker and higher quality responses

  • Follow the advice above, but if you need this, a couple things.

    You could use a CLR function, but really I'd suggest that you have a process every night (hour, 4 hours, whatever) that looks to go through and strip our markdown and html. I'd make these separate processes, and run them over a set of rows. Use another table to track what you've processed so that you don't reprocess rows. If you need to fix a particular record, then you can manually call your process.

    This is string manipulation and there are libraries to help. Or you can just look to replace a series of tokens like <p> or <div> with some other character or nothing. One thing to be aware of, there might not be carriage returns or line feeds in HTML, so if you strip tags, add those back.

    This is also complex because some apps won't build well formed HTML, meaning all tags closed in reverse order of opening. Many browsers work with some breakage, but it's a nightmare to strip out if you are looking for ending tags. I'd make sure you just remove all beginning tags, then separately, search and remove ending tags.

  • @sgmunson. Sadly it is an nvarchar as the system is designed to run in multiple languages and handles those with different character sets. The notes field is most likely to stay put after it is set, but it can change. The email content is not editable so that is static once set.

    @Jonathan.crawford. The markup is kept so that the email is displayed properly when viewed at a later date as the viewer uses it to format it as if it was the original. I'd personally like to see the system offer a way to get the text only but I know that isn't a trivial task.

    @steve-2 Jones. I think your point about stripping some tags may remove carriage returns and other standard formatting things puts the nail in the coffin as it then becomes a major task, not just stripping, but replacing etc.

    There are plenty of example functions but to act dynamically when called from a view on large tables makes them a bad solution. I think it might be that I end up having a separate column where I store the stripped text and have it update on a nightly basis. I just can't think of a decent way of doing it within the framework. The major problem is malformed HTML and the different ways markup can be delivered.

    Thanks all for your input.

    /L

    Regards,
    Lee

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply