Blog Post

Using XQuery to remove duplicate values or duplicate nodes from an xml instance

,

Sometimes we need to retrieve a list distinct values from within an xml instance or even distinct nodes and this is relatively straight forward using XQuery. It gets a bit more complicated if you have attributes in the xml, but it is still possible to retrieve unique xml nodes. On the flip side, it is also possible to only return xml nodes that are duplicated in the xml instance. All the examples can be downloaded here

Retrieving Distinct Values

Getting a list of distinct atomic values from a sequence can be done by just using the XQuery function distinct-values(). Imagine this XML instance:

<Data>
  <Colour>Red</Colour>
  <Colour>Green</Colour>
  <Colour>Green</Colour>
  <Colour>Blue</Colour>
  <OtherColour>Red</OtherColour>
  <OtherColour>Green</OtherColour>
  <OtherColour>Green</OtherColour>
  <OtherColour>Yellow</OtherColour>
</Data>

To get a list of distinct atomic values of all the <Colour> nodes, all we need to do is this:

SELECT @XML.query('<Data>{ distinct-values(//Colour/text()) }</Data>')

This query returns a single node XML instance which contains all the distinct values from the sequence of //Colour thanks to the XQuery function distinct-values(). Running this will return the following:

<Data>Red Green Blue</Data>

Because the function distinct-values() takes in a sequence of values, we can pass in values from anywhere in the XML instance. The following example will return all the distinct values of the <Colour> and <OtherColour> nodes.

SELECT @XML.query('<Data>{ distinct-values( (//Colour/text(), //OtherColour/text()) ) }</Data>')

This query is passing in a sequence of values constructed from all the <Colour> and the <OtherColour> nodes into the function. This will return the following XML:

<Data>Red Green Blue Yellow</Data>

I sometimes use the distinct-values() function to feed a sequence of values into a bigger flwor statement. For example, take this following XML instance with a duplicated node data in it:

<Data>
  <Customer>
    <AccountNumber>111</AccountNumber>
    <Name>Green IT</Name>
  </Customer>
  <Customer>
    <AccountNumber>111</AccountNumber>
    <Name>Green IT</Name>
  </Customer>
  <Customer>
    <AccountNumber>222</AccountNumber>
    <Name>Red IT</Name>
  </Customer>
  <Customer>
    <AccountNumber>333</AccountNumber>
    <Name>Blue IT</Name>
  </Customer>
</Data>

If I wanted to return an XML instance minus any duplicated data, I could use the distinct-values() function to get a sequence of distinct <AccountNumber> values and with that, I can iterate through that sequence and return the node that contains the first instance of each <AccountNumber>. Note: In this example, I’m making a big assumption that each <Customer> node is identical based on the <AccountNumber> node alone.

SELECT @XML.query('

<Data>

       {

              for $x in distinct-values(//Customer/AccountNumber/text())

              return

                     (//Customer[AccountNumber = $x])[1]

       }

</Data>

')

The flwor statement is first getting a sequence of distinct values from <AccountNumber> using the distinct-values() function which comprises of the values “111”, “222” and “333”. Then for each of those values, it returns the first instance (denoted by the [1]) of a <Customer> node that has a matching <AccountNumber>. The above query returns:

<Data>
  <Customer>
    <AccountNumber>111</AccountNumber>
    <Name>Green IT</Name>
  </Customer>
  <Customer>
    <AccountNumber>222</AccountNumber>
    <Name>Red IT</Name>
  </Customer>
  <Customer>
    <AccountNumber>333</AccountNumber>
    <Name>Blue IT</Name>
  </Customer>
</Data>

Retrieving Distinct Nodes

The last example relied on me making an assumption that each <Customer> node that is duplicated based on the <AccountNumber> would contain the same data. In real life, something like this can be a big and potentially dangerous assumption unless you are absolutely sure about the data in the XML or if you do not care about the differences as such and the first instance of each duplicated value is good enough for your needs.

In situations where I really wanted to truly eliminate all duplicate nodes, then I would compare the entire XML node itself. Taking the XML from above and extending it a bit to the following:

<Data>
  <Customer>
    <AccountNumber>111</AccountNumber>
    <Name>Green IT</Name>
    <Location>USA</Location>
  </Customer>
  <Customer>
    <AccountNumber>111</AccountNumber>
    <Name>Green IT</Name>
    <Location>USA</Location>
  </Customer>
  <Customer>
    <AccountNumber>111</AccountNumber>
    <Name>Green IT</Name>
    <Location>France</Location>
  </Customer>
  <Customer>
    <AccountNumber>333</AccountNumber>
    <Name>Blue IT</Name>
    <Location>Russia</Location>
  </Customer>
  <Customer>
    <AccountNumber>222</AccountNumber>
    <Name>Red IT</Name>
    <Location>Germany</Location>
  </Customer>
  <Customer>
    <AccountNumber>222</AccountNumber>
    <Name>Red IT</Name>
    <Location>Germany</Location>
  </Customer>
  <Customer>
    <AccountNumber>333</AccountNumber>
    <Name>Blue IT</Name>
    <Location>Australia</Location>
  </Customer>
</Data>

By using a similar method of selecting each <Customer> node within a flwor statement, we can use the node comparison operator is to compare with the existing nodes and only return that node if the node being compared is actually the first instance of any matching nodes. It does sound more complicated than the code itself!!

SELECT @XML.query('

<Data>

       {

              for $x in (//Customer)

              return

                     if ($x is (//Customer)[. = $x][1]) then

                           $x

                     else ()

       }

</Data>

')

Returns.

<Data>
  <Customer>
    <AccountNumber>111</AccountNumber>
    <Name>Green IT</Name>
    <Location>USA</Location>
  </Customer>
  <Customer>
    <AccountNumber>111</AccountNumber>
    <Name>Green IT</Name>
    <Location>France</Location>
  </Customer>
  <Customer>
    <AccountNumber>333</AccountNumber>
    <Name>Blue IT</Name>
    <Location>Russia</Location>
  </Customer>
  <Customer>
    <AccountNumber>222</AccountNumber>
    <Name>Red IT</Name>
    <Location>Germany</Location>
  </Customer>
  <Customer>
    <AccountNumber>333</AccountNumber>
    <Name>Blue IT</Name>
    <Location>Australia</Location>
  </Customer>
</Data>

Note: The “=” operator in the query will only compare the nodes themselves. If you have attributes, then these will not be part of the comparison and are effectively ignored. This can be proved with the following simple xml, note that the only difference between each customer is the value of the attribute @AccountNumber.

<Data>
  <Customer AccountNumber="111">
    <Name>Green IT</Name>
    <Location>USA</Location>
  </Customer>
  <Customer AccountNumber="222">
    <Name>Green IT</Name>
    <Location>USA</Location>
  </Customer>
</Data>

If i was to run the same query from above against this small xml example, the query does not detect any differences with the “=” operator. The attribute was totally ignored. The flwor statement only returned one node when we really wanted it to return both nodes. This is what the query incorrectly returned:

<Data>
  <Customer AccountNumber="111">
    <Name>Green IT</Name>
    <Location>USA</Location>
  </Customer>
</Data>

Not what we wanted… The only way in this scenario that you can compare the nodes + attributes and get correct results (that I know off!) is to include the attributes as part of the predicates. Here is the same query from above that compares the nodes, but it now has an additional test that will compare the attribute @AccountNumber in addition to comparing the entire customer node as before:

SELECT @XML.query('

<Data>

       {

              for $x in (//Customer)

              return

                     if ($x is (//Customer)[. = $x]

                                  [$x/@AccountNumber = ./@AccountNumber][1]) then

                           $x

                     else ()

       }

</Data>

')

Running this query on the small example above will correctly return both nodes and in fact not detect any duplication in the xml, which is correct and what we would expect.

So, there you go. It is really easy to get distinct values from a sequence of nodes in an xml instance by just using the distinct-values() function. Getting distinct or unique nodes is also doable as well with a little bit of XQuery creativity. Even if you have attributes in your xml it is still possible to detect duplicate nodes although the predicate could get quite long if you have a large number of attributes that needed to be part of the comparison.

Another scenario that this could potentially solve is one where you need to return a sequence of distinct nodes of only those nodes that have been duplicated. This is quite possible with the query above. All you would need to do in the query, is change the [1] to a [2]. This effectively tells the query to return the second instance of each node being compared. If there is a [2] instance, then it must be a duplicate node and by only returning the [2] instance, any non-duplicated nodes are filtered out. There is an example of this in the download.

Enjoy!

Rate

5 (1)

You rated this post out of 5. Change rating

Share

Share

Rate

5 (1)

You rated this post out of 5. Change rating