Blog Post

Removing HTML formatting tag - BR - from XML using XQuery

,

This is a personal pet hate of mine.. Mixing HTML formatting tags directly with XML that aren’t escaped or contained within a CDATA section. Sure it is all perfectly valid XML at the end of the day, but it makes working with it a bit harder as each section of text between each <BR/> tag is treated as a separate text node and has to be factored into any schema for that XML structure.

I’ve been working on an XML migration project recently that has a lot of this going on. Data that when it was inputed into a web based GUI had line breaks added by the user to separate various bits of text to increase readibility, but when stored in the db, the line breaks in the GUI were changed to <BR/> tags. This was because it was “easier”? to render it back to the users when they were viewing the data apparently. Surely that is what the <pre> tag is for isn’t it so that line breaks are preserved? Anyway, I’m not an out and out web developer so I’m sure at the time there were reasons, but I still believe that if <BR/> tags are needed to be stored within your XML solely as HTML formatting, then escape them and not add them as actual XML nodes.

So during the course of this project, I had to convert these <BR/> nodes in the xml to line breaks. The obvious quick and dirty way is to cast the xml to string and then replace the expression "<BR/>" in the string with a line break i.e. CHAR(13)+CHAR(10) before casting back to XML again. Definitely won’t win you any coding awards but fine in some circumstances because it just works and if the job in hand is a one off that you just need to get done quickly and you know the data well enough that this crude method won’t actually damage the XML, then what’s the problem… I’m not ashamed to have used this method once or twice on quick ad-hoc jobs. Why spend good time on single-use throw away code.

But, it was about time that I came up with a better solution purely for my own curiosity using XQuery. Imagine this XML structure containing a top level <Root> node with a single child node <Data> that contains many text nodes because it has lots child <BR> nodes. The <Data> node is actually just one long string that has had line breaks converted to the html tag <BR/>.

<Root>
  <Data>In a strange city lying alone<BR/>Far down within the dim West<BR/><BR/>Where the good and the bad and the worst and the best<BR/>Have gone to their eternal rest.</Data>
<Root>

Now I need to replace the <BR/> nodes with line breaks to make it a single text node containing the entire string. Bizarrely the final code that I came up with to achieve this doesn’t reflect the number of evenings I sat pondering this. I went down the wrong road initially and weren't too happy with results. But finally tonight after putting it down for a couple of days, I found a solution that seems to work well:

SELECT @Xml.query('
<Root>
    <Data>
        {
            for $x in /Root/Data/node()
            return
                if ($x[local-name(.)="BR"]) then
                    text{"
"}
                else
                    $x
        }
    </Data>
</Root>'
)

Basically all the logic is within the flwor statement. It gets all the nodes under <Data> and then proceeds to iterate through them one by one checking the name of the node. If the node name is “BR”, then it returns a new text node that has the value set to a line break, otherwise it will just return the node from that iteration. Pretty simple code really.. When run, the resultant XML looks like:

<Root>
  <Data>In a strange city lying alone
Far down within the dim West

Where the good and the bad and the worst and the best
Have gone to their eternal rest.</Data>
<Root>

Perfect. Curiosity has been satisfied. Now I’m off to work out why it took me 3 or 4 evenings to come up with less than a dozen lines of code!

Enjoy!

Download Code

Rate

You rated this post out of 5. Change rating

Share

Share

Rate

You rated this post out of 5. Change rating