Handling malformed xml,while processing valid xml

  • Everyone, we have an important legacy lob attribute that stores well formed and also malformed xml. From time to time we need data corrections to change nodes in the xml. The malformed xml have caused real problems. This is a legacy TEXT field, so a cast to XML thows a batch terminating excpetion.

    the only solution I have found that works is a cursor with a try catch, and logging the bad ones to a table in the catch block. this is very slow, and unfortunately we have 100-200k wide xml files that number in the millions.

    I have not had luck getting this code to run in any acceptable time. I understand this script is giong to be be slow, I would just love a set based way to deal with malformed xml. Pseduo code is below, anyone having a different approach would be much appreciated. My current attempt is a udf, but the xml parse error is still terminating the batch.

    The app team has been punished for putting bad xml in the db in the first place.

    DECLARE C CURSOR FAST_FORWARD FOR

    SELECT xml, xmlkey

    FROM PhyiscalTemp

    OPEN C_POD_XML

    FETCH NEXT FROM C_POD_XML INTO @xml,@xmlid

    WHILE @@FETCH_STATUS = 0

    BEGIN

    BEGIN TRY

    SELECT CASE WHEN CAST(@strCurrItemConfig AS XML) IS NULL THEN 1 ELSE 1 END AS 'IsParsable' --error throws to catch

    ...xml processing

    END TRY

    BEGIN CATCH

    --deal with malformed xml

    END CATCH

    END

    CLOSE C

    DEALLOCATE C

  • Your team has been punished for putting ANY KIND[/b] of xml in the db in the first place.

    XML is another form of database.

    Slower one, more resource demanding, with significant limitations.

    It's been specifically designed for message exchange between web-services and has nothing to do in relational database.

    You should parse XML into proper data on entry point. Then you may operate the data, alter it or report in any form you like, including XML.

    If XML is not parseable then it must be returned to the point of origin for fixing.

    Just like failed fridge or TV set.

    _____________
    Code for TallyGenerator

  • Sergiy, I could not agree with you more. I am making enroads here on turning our philosphy around. The XML native datatype has been a setback for me, but its problems, per my orginal post, have helped. I will keep preaching the doctorine you spoke of.

  • You may have better luck using a CLR function to assess whether it is a valid fragment or not. This allows you to essentially force the TRY/CATCH one level further down (so you can now run this in a batch), and will give you better control thanks to .NET's better try/catch handling. While it will still need to evaluate against each row, by being compiled you should be able to recoup a lot of the performance otherwise lost to a cursor behavior.

    Be sure to mark the function as precise and deterministic for the highest gain. Pass it the nvarchar object, get a boolean back.

    The only limitation I could see would be if any fragment is over the "traditional" nvarchar character size (8000), since SQLCLR cannot handle nvarchar(max) parameters.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • The way to go here is to retrieve whole XML string from the field, pass it to the application which suppose to parse XML and store parsing results in database (proper way).

    If the application the XML was designed for cannot parse it SQL Server could never do any better.

    _____________
    Code for TallyGenerator

  • How complicated is your malformed XML? Can you paste a sample here? You may have a potential solution just using string manipulation. This is not to argue against what Sergiy said, but I've gotten stuck with problems created by others too.

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • The xml is very complicated. It is barley xml in my opinion. Name value pair with formatting tags. No scheam can be geneated to benefit from an xml index attempt. I am attempting the CLR function. MY problem is many of these xml are way over 8k with many upwards of 300k 🙁

  • What generates those XML's?

    _____________
    Code for TallyGenerator

  • Without seeing it, there is not much way to see if there is a way to locate the value to be replaced and fix it with string manipulation. Sorry.

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • It's perhaps best to start by simply "cullling the herd" (i.e try to instantiate this into an XML reader object, and see which ones bounce out.) Once you get down to the "ugly" ones - then you start worrying about finding the bad files.

    Otherwise - there is a fair amount out there (open source) on various ways to validate well-formedness and/or compliance with canonical XML formats (which should really get you most of the way there). matching starting and ending tags, invalid characters, etc....

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply