How to Skip Invalid XML

  • Dear All,

    I have a Prod table with a million on rows having an ntext column to store XML.

    I need to process something from that XML, but my query crashed in between as there are some invalid xml exists in some rows. So, i need to skip those. how can i?.

    create table #temp (id int, xmldoc ntext)

    insert into #temp values (1, '<parent><child></child></parent>')

    insert into #temp values (2, '<parent><child>')

    insert into #temp values (3, '<parent><child></child></parent>')

    select * from #temp

    where convert(xml,convert(nvarchar(max),xmldoc)).exist('(/parent)') = 1

    I expect this to skip 2nd row instead an error.

    Appreciating your helps.

  • This is a cursor-based "RBAR" function so if this is going to be a task that runs often it may be necessary to rewrite it using a Tally Table for improved performance...but I'll leave that up to someone else. I also include an optional XML "cleanup" function (below) that removes tabs and spaces. If that is not necessary, then obviously leaving it out will also improve performance.

    /* Test data */

    IF OBJECT_ID('tempdb..#Temp') IS NOT NULL

    DROP TABLE #Temp

    CREATE TABLE #temp (id INT,xmldoc NTEXT)

    INSERT INTO #temp

    VALUES

    (1,'<parent><child1></child1></parent>')

    INSERT INTO #temp

    VALUES

    (2,'<parent><child>')

    INSERT INTO #temp

    VALUES

    (3,'<parent><child2></child2></parent>')

    /* Process data */

    SET NOCOUNT ON

    IF OBJECT_ID('tempdb..#TempTable') IS NOT NULL

    DROP TABLE #TempTable

    CREATE TABLE #TempTable (

    [id] INT NOT NULL,

    [xmldoc] NVARCHAR(4000) NULL,

    PRIMARY KEY (ID))

    DECLARE

    @idoc INT

    ,@rootnode NVARCHAR(255)

    ,@ID INT

    ,@strXML NVARCHAR(4000)

    ,@errMsg NVARCHAR(255)

    SET @rootnode = '/parent'

    DECLARE ValidateXML CURSOR

    FOR

    SELECT

    id

    ,xmldoc

    FROM

    #temp

    OPEN ValidateXML

    FETCH NEXT FROM ValidateXML INTO

    @ID

    ,@strXML

    WHILE @@FETCH_STATUS = 0

    BEGIN

    /* Optional but recommended to remove excess spaces and tabs. */

    /* However, this may slow down the process considerably so use with care. */

    SET @strXML = dbo.svfRemoveExcessSpacesFromXML(@strXML)

    BEGIN TRY

    EXEC sp_xml_preparedocument @idoc OUTPUT, @strXML

    INSERT INTO #TempTable

    SELECT

    id

    ,xmldoc

    FROM

    #temp

    WHERE

    ID = @ID

    AND CONVERT(XML,CONVERT(NVARCHAR(MAX),@strXML)).exist('(/parent)') = 1

    END TRY

    BEGIN CATCH

    SELECT @errMsg = ERROR_MESSAGE()

    /* Comment out this insert to just skip rows with invalid XML */

    INSERT INTO #TempTable

    SELECT

    @ID AS id

    ,'Invalid XML: '+@errMsg AS xmldoc

    END CATCH

    SET @strXML = NULL

    FETCH NEXT FROM ValidateXML INTO

    @ID

    ,@strXML

    END

    CLOSE ValidateXML

    DEALLOCATE ValidateXML

    SELECT

    id

    ,xmldoc

    FROM

    #TempTable

    OUTPUT:

    idxmldoc

    1<parent><child1></child1></parent>

    2Invalid XML: The error description is 'The following tags were not closed: parent, child.'.

    3<parent><child2></child2></parent>

    Optional XML cleanup function:

    CREATE FUNCTION dbo.svfRemoveExcessSpacesFromXML

    (

    @strXML NVARCHAR(MAX)

    )

    RETURNS NVARCHAR(MAX)

    BEGIN

    DECLARE

    @STR NVARCHAR(MAX)

    ,@xml XML

    SET @STR = @strXML

    SET @STR = REPLACE(@str,CHAR(9),' ') -- convert tabs to spaces

    SET @STR =

    REPLACE(

    REPLACE(

    REPLACE(

    @STR

    ,' ',' '+CHAR(7))

    ,CHAR(7)+' ','')

    ,CHAR(7),'')

    SET @STR = REPLACE(@str,'> <','><')

    RETURN @STR

    /*

    SELECT dbo.svfRemoveExcessSpacesFromXML(' hi thisxxxx is test for onr jh1 lljkl l l jl l ')

    SELECT dbo.svfRemoveExcessSpacesFromXML('<soapenv:Body><BackgroundReports xmlns="http://www.cpscreen.com/schemas"> <BackgroundReportPackage type="report"><ProviderReferenceId>WPS-6308077</ProviderReferenceId><PackageInformation> <ClientReferences />')

    */

    END

  • SQL Mad Rafi (11/12/2012)


    Dear All,

    I have a Prod table with a million on rows having an ntext column to store XML.

    I need to process something from that XML, but my query crashed in between as there are some invalid xml exists in some rows. So, i need to skip those. how can i?.

    create table #temp (id int, xmldoc ntext)

    insert into #temp values (1, '<parent><child></child></parent>')

    insert into #temp values (2, '<parent><child>')

    insert into #temp values (3, '<parent><child></child></parent>')

    select * from #temp

    where convert(xml,convert(nvarchar(max),xmldoc)).exist('(/parent)') = 1

    I expect this to skip 2nd row instead an error.

    Appreciating your helps.

    No offense but why in the world would you use ntext to hold xml data? We have an xml datatype and ntext is deprecated. You need to get this data into a table that will be supportable in the long run. The RBAR example is exactly what I would come up with for this. I would change the datatype to XML for your xml data. Run that to get your data in the proper datatypes, fix up the garbage that can't convert and change the process that loads this data.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

Viewing 3 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic. Login to reply