INSERT SELECT with large XML Dataset - XML DTD stripped...

  • Let me begin by saying I'm not very well versed in pulling XML files into SQL using a straight T-SQL solution so I have pieced together pieces of information from various sites to get this far. The document I am processing is ~2.6 million lines ( icecat free product feed for those interested ). Now, I am able to SELECT the data out of the XML file using the following process:

    USE IceCatBiz

    DECLARE @data XML;

    CREATE TABLE #Xml ( Data XML );

    INSERT INTO #Xml

    ( Data

    )

    SELECT CONVERT(XML,BulkColumn,2)

    FROM OPENROWSET(BULK N'\icecat\files.xml', SINGLE_BLOB) O;

    SELECT @data = Data FROM #Xml ;

    DROP TABLE #Xml;

    SELECT filenode.value('@path', 'VARCHAR(300)') AS filepath,

    filenode.value('@Product_ID', 'INT') AS Product_ID,

    filenode.value('@Supplier_id', 'INT') AS Supplier_id,

    filenode.value('@Prod_ID', 'VARCHAR(50)') AS Prod_ID,

    filenode.value('@Quality', 'VARCHAR(50)') as Quality,

    filenode.value('@Catid', 'INT') as Catid,

    filenode.value('@On_Market', 'INT') as On_Market,

    filenode.value('@Model_Name', 'VARCHAR(255)') as Model_Name,

    filenode.value('@Product_View', 'INT') as Product_View,

    filenode.value('@HighPic', 'VARCHAR(400)') as HighPic,

    filenode.value('@HighPicSize', 'INT') as HighPicSize,

    filenode.value('@HighPicWidth', 'INT') as HighPicWidth,

    filenode.value('@HighPicHeight', 'INT') as HighPicHeight,

    filenode.value('@Updated', 'VARCHAR(50)') as Updated,

    filenode.value('@Date_Added', 'VARCHAR(50)') as Date_Added,

    filenode.value('EAN_UPCS[1]/EAN_UPC[1]/@Value', 'VARCHAR(50)') as EAN_UPC,

    filenode.value('Country_Markets[1]/Country_Market[1]/@Value', 'VARCHAR(10)') as Country_Market

    FROM @data.nodes('ICECAT-interface/files.index/file') files ( filenode );

    So that returns 296k rows back and the data looks spot on with what I expect. Now, I'm trying to take that data and throw and INSERT INTO with the select. The code is:

    USE IceCatBiz

    IF EXISTS (SELECT * FROM sysobjects WHERE type = 'U' AND name = 'filesindex')

    BEGIN

    DROP Table filesindex

    END

    CREATE TABLE filesindex (

    Product_ID INT NOT NULL Primary Key,

    filepath VARCHAR(300) NOT NULL,

    Quality VARCHAR(255) NOT NULL,

    Supplier_ID INT NOT NULL,

    Prod_ID VARCHAR(255) NOT NULL,

    CatID INT NOT NULL,

    On_Market BIT NOT NULL,

    Model_Name VARCHAR(500) NOT NULL,

    Product_View INT NOT NULL,

    HighPic VARCHAR(500) NULL,

    HighPicSize INT NULL,

    HighPicWidth INT NULL,

    HighPicHeight INT NULL,

    Date_Added VARCHAR(40) NOT NULL,

    Updated VARCHAR(40) NULL,

    EAN_UPC VARCHAR(50) NULL,

    Country_Market VARCHAR(4) NULL

    )

    CREATE INDEX Product_ID on filesindex (Product_ID)

    CREATE INDEX Supplier_ID ON filesindex (Supplier_ID)

    CREATE INDEX CatID ON filesindex (CatID)

    CREATE INDEX prodcat ON filesindex( Product_ID, CatID)

    DECLARE @data XML;

    CREATE TABLE #Xml ( Data XML );

    INSERT INTO #Xml

    ( Data

    )

    SELECT CONVERT(XML,BulkColumn,2)

    FROM OPENROWSET(BULK N'\icecat\files.xml', SINGLE_BLOB) O;

    SELECT @data = Data FROM #Xml ;

    DROP TABLE #Xml;

    INSERT INTO filesindex (

    filepath, Product_ID, Supplier_id, Prod_Id, Quality, Catid, On_Market,

    Model_Name, Product_View, HighPic, HighPicSize, HighPicWidth, HighPicHeight,

    Date_added, Updated, EAN_UPC, Country_Market

    )

    SELECT filenode.value('@path', 'VARCHAR(300)') AS filepath,

    filenode.value('@Product_ID', 'INT') AS Product_ID,

    filenode.value('@Supplier_id', 'INT') AS Supplier_id,

    filenode.value('@Prod_ID', 'VARCHAR(50)') AS Prod_ID,

    filenode.value('@Quality', 'VARCHAR(50)') as Quality,

    filenode.value('@Catid', 'INT') as Catid,

    filenode.value('@On_Market', 'INT') as On_Market,

    filenode.value('@Model_Name', 'VARCHAR(255)') as Model_Name,

    filenode.value('@Product_View', 'INT') as Product_View,

    filenode.value('@HighPic', 'VARCHAR(400)') as HighPic,

    filenode.value('@HighPicSize', 'INT') as HighPicSize,

    filenode.value('@HighPicWidth', 'INT') as HighPicWidth,

    filenode.value('@HighPicHeight', 'INT') as HighPicHeight,

    filenode.value('@Updated', 'VARCHAR(50)') as Updated,

    filenode.value('@Date_Added', 'VARCHAR(50)') as Date_Added,

    filenode.value('EAN_UPCS[1]/EAN_UPC[1]/@Value', 'VARCHAR(50)') as EAN_UPC,

    filenode.value('Country_Markets[1]/Country_Market[1]/@Value', 'VARCHAR(10)') as Country_Market

    FROM @data.nodes('ICECAT-interface/files.index/file') files ( filenode );

    Now I have let this statement execute for several hours and it just keeps cranking away. I never see any files inserted into the filesindex table. When I finally stop the execution it returns:

    XML DTD has been stripped from one or more XML fragments. External subsets, if any, have been ignored.

    (1 row(s) affected)

    I don't know if I'm using outdated techniques or just flat out doing this wrong but any assistance or insights you could provide would be greatly appreciated. As I stated before I'm not very spun up on importing XML so my knowledge of proper terminology is lacking.

    Before I go a quick sample of the XML data....

    <?xml version="1.0" encoding="UTF-8"?>

    <!DOCTYPE ICECAT-interface SYSTEM "http://data.icecat.biz/dtd/files.index.dtd">

    <!-- source: Icecat.biz 2013 -->

    <ICECAT-interface xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://data.icecat.biz/xsd/files.index.xsd">

    <files.index Generated="20130219013338">

    <file path="export/freexml.int/US/133.xml" Product_ID="133" Updated="20110801235612" Quality="ICECAT" Supplier_id="5" Prod_ID="CPD-E430" Catid="221" On_Market="1" Model_Name="Multiscan E430 19" Di

    splay" Product_View="8747" HighPic="http://images.icecat.biz/img/norm/high/133.jpg" HighPicSize="2014" HighPicWidth="116" HighPicHeight="120" Date_Added="20050627000000">

    <EAN_UPCS>

    <EAN_UPC Value="4901780729524" />

    </EAN_UPCS>

    <Country_Markets>

    <Country_Market Value="IT" />

    </Country_Markets>

    </file>

    <file path="export/freexml.int/US/385.xml" Product_ID="385" Updated="20120621143343" Quality="ICECAT" Supplier_id="3" Prod_ID="06P5754" Catid="219" On_Market="1" Model_Name="Ultra 160 SCSI Hot-Swap HS

    SL HARD DRIVE" Product_View="16470" HighPic="" HighPicSize="0" HighPicWidth="0" HighPicHeight="0" Date_Added="20041201000000">

    <EAN_UPCS>

    <EAN_UPC Value="5019170744060" />

    <EAN_UPC Value="4968665597388" />

    </EAN_UPCS>

    <Country_Markets>

    <Country_Market Value="UK" />

    </Country_Markets>

    </file>

    <file path="export/freexml.int/US/824.xml" Product_ID="824" Updated="20120207154716" Quality="ICECAT" Supplier_id="2" Prod_ID="PA3156E-1PRP" Catid="152" On_Market="1" Model_Name="Port Replicator" Produ

    ct_View="13645" HighPic="http://images.icecat.biz/img/norm/high/824-5078.jpg" HighPicSize="7980" HighPicWidth="200" HighPicHeight="200" Date_Added="20051104000000">

    <Country_Markets>

    <Country_Market Value="UK" />

    </Country_Markets>

    </file>

    <file path="export/freexml.int/US/830.xml" Product_ID="830" Updated="20101217151007" Quality="SUPPLIER" Supplier_id="2" Prod_ID="PA3000U-1BRS" Catid="106" On_Market="1" Model_Name="Battery Pack (Li-Ion

    Graphite, 6 cell, 3000mAh)" Product_View="13978" HighPic="http://images.icecat.biz/img/norm/high/830.jpg" HighPicSize="15177" HighPicWidth="140" HighPicHeight="140" Date_Added="20051104000000">

    <EAN_UPCS>

    <EAN_UPC Value="4026203287532" />

    <EAN_UPC Value="2000004672447" />

    </EAN_UPCS>

    <Country_Markets>

    <Country_Market Value="UK" />

    </Country_Markets>

    </file>

    <file path="export/freexml.int/US/837.xml" Product_ID="837" Updated="20111111181015" Quality="SUPPLIER" Supplier_id="2" Prod_ID="PA3005U" Catid="911" On_Market="1" Model_Name="128 MB Memory Expansion"

    Product_View="28483" HighPic="http://images.icecat.biz/img/norm/high/837-2012.jpg" HighPicSize="11908" HighPicWidth="140" HighPicHeight="140" Date_Added="20051104000000">

    <EAN_UPCS>

    <EAN_UPC Value="3517986100794" />

    <EAN_UPC Value="4026203111653" />

    <EAN_UPC Value="4026203287662" />

    <EAN_UPC Value="7612758146416" />

    <EAN_UPC Value="5705965638798" />

    </EAN_UPCS>

    <Country_Markets>

    <Country_Market Value="NL" />

    <Country_Market Value="BE" />

    <Country_Market Value="UK" />

    </Country_Markets>

    </file>

    </files.index>

    </ICECAT-interface>

  • To get this to convert to XML I had to strip out the first 3 lines due to an encoding error. For me, this is not uncommon and I have to strip out the "junk" in the doc headers.

    That got me closer, but I kept getting an error on:

    Model_Name="Multiscan E430 19" Display" Product_View="8747"

    Can you see why? It's because of the non-escaped quotation mark [19"]. I just deleted that and the XML converted correctly. So you will have to account for that in your code.

    DECLARE @strXML AS NVARCHAR(MAX)

    SET @strXML =

    --'<?xml version="1.0" encoding="UTF-8"?>

    --<!DOCTYPE ICECAT-interface SYSTEM "http://data.icecat.biz/dtd/files.index.dtd">

    --<!-- source: Icecat.biz 2013 -->

    '<ICECAT-interface xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://data.icecat.biz/xsd/files.index.xsd">

    <files.index Generated="20130219013338">

    <file path="export/freexml.int/US/133.xml" Product_ID="133" Updated="20110801235612" Quality="ICECAT" Supplier_id="5" Prod_ID="CPD-E430" Catid="221" On_Market="1" Model_Name="Multiscan E430 19 Display" Product_View="8747" HighPic="http://images.icecat.biz/img/norm/high/133.jpg" HighPicSize="2014" HighPicWidth="116" HighPicHeight="120" Date_Added="20050627000000">

    <EAN_UPCS>

    <EAN_UPC Value="4901780729524" />

    </EAN_UPCS>

    <Country_Markets>

    <Country_Market Value="IT" />

    </Country_Markets>

    </file>

    <file path="export/freexml.int/US/385.xml" Product_ID="385" Updated="20120621143343" Quality="ICECAT" Supplier_id="3" Prod_ID="06P5754" Catid="219" On_Market="1" Model_Name="Ultra 160 SCSI Hot-Swap HS SL HARD DRIVE" Product_View="16470" HighPic="" HighPicSize="0" HighPicWidth="0" HighPicHeight="0" Date_Added="20041201000000">

    <EAN_UPCS>

    <EAN_UPC Value="5019170744060" />

    <EAN_UPC Value="4968665597388" />

    </EAN_UPCS>

    <Country_Markets>

    <Country_Market Value="UK" />

    </Country_Markets>

    </file>

    <file path="export/freexml.int/US/824.xml" Product_ID="824" Updated="20120207154716" Quality="ICECAT" Supplier_id="2" Prod_ID="PA3156E-1PRP" Catid="152" On_Market="1" Model_Name="Port Replicator" Product_View="13645" HighPic="http://images.icecat.biz/img/norm/high/824-5078.jpg" HighPicSize="7980" HighPicWidth="200" HighPicHeight="200" Date_Added="20051104000000">

    <Country_Markets>

    <Country_Market Value="UK" />

    </Country_Markets>

    </file>

    <file path="export/freexml.int/US/830.xml" Product_ID="830" Updated="20101217151007" Quality="SUPPLIER" Supplier_id="2" Prod_ID="PA3000U-1BRS" Catid="106" On_Market="1" Model_Name="Battery Pack (Li-Ion Graphite, 6 cell, 3000mAh)" Product_View="13978" HighPic="http://images.icecat.biz/img/norm/high/830.jpg" HighPicSize="15177" HighPicWidth="140" HighPicHeight="140" Date_Added="20051104000000">

    <EAN_UPCS>

    <EAN_UPC Value="4026203287532" />

    <EAN_UPC Value="2000004672447" />

    </EAN_UPCS>

    <Country_Markets>

    <Country_Market Value="UK" />

    </Country_Markets>

    </file>

    <file path="export/freexml.int/US/837.xml" Product_ID="837" Updated="20111111181015" Quality="SUPPLIER" Supplier_id="2" Prod_ID="PA3005U" Catid="911" On_Market="1" Model_Name="128 MB Memory Expansion" Product_View="28483" HighPic="http://images.icecat.biz/img/norm/high/837-2012.jpg" HighPicSize="11908" HighPicWidth="140" HighPicHeight="140" Date_Added="20051104000000">

    <EAN_UPCS>

    <EAN_UPC Value="3517986100794" />

    <EAN_UPC Value="4026203111653" />

    <EAN_UPC Value="4026203287662" />

    <EAN_UPC Value="7612758146416" />

    <EAN_UPC Value="5705965638798" />

    </EAN_UPCS>

    <Country_Markets>

    <Country_Market Value="NL" />

    <Country_Market Value="BE" />

    <Country_Market Value="UK" />

    </Country_Markets>

    </file>

    </files.index>

    </ICECAT-interface>'

    SELECT CONVERT(XML,@strXML)

    Now as a general comment, a 2 million line XML document is going to take forever to read and probably fail or you will give up on it. I'd recommend that you break it up and run it in smaller batches.

    Another option is to shred the XML directly into a table with a parsing procedure. Once the data is in a regular table you will get HUGELY better performance when trying to query it. I've attached a procedure below that will shred every node into a table. It uses a recursive CTE so it is no speed demon and running it in batches is still a good idea. You can speed the CTE up a bit by commenting out the part that builds the node path--that's not absolutely necessary and for large files I usually comment it out. Note that I do some prep work within the procedure to get rid of various parts of the header. You will have to change those parts to work with your particular file header structure.

    This method uses sp_xml_preparedocument and OPENXML rather than using XQUERY. To see XML in tabular form is a different way of working with an XML document and some people like it and some don't. But as a SQL person I like seeing the results as a set rather than the typical XML flat file.

     

  • Thank you for taking the time to look at my scenario and write up your response. I will definitely be using the SP you attached in the future.

    I tried one last time to run the whole data set ( only have to do the big one once and then it's just incremental ones from there ). It failed after an hour but splitting it up things are starting to roll into the system.

    Thank you again for your help!

Viewing 3 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic. Login to reply