Stored procedure to read XML

  • Some help needed here.

    I've got a store procedure that reads an XML file, and then uses it to update a given table where a match exists (see INNER JOIN below). The code is:

    --vars

    DECLARE @Publication TABLE

    (

    PublicationID Varchar(50),

    XML_Data XML

    )

    DECLARE @xml XML

    --reading the XML file

    INSERT INTO @Publication ( XML_Data )

    SELECT UserData.query('.')

    FROM @xml.nodes('/PubmedArticleSet/PubmedArticle') AS Data ( UserData)

    --updating matches

    UPDATE dst

    SET

    ....................

    AuthorAddress = left(CONVERT(VARCHAR(max), src.XML_Data.query('data(//Affiliation)')),2000),

    ...........

    FROM @Publication src INNER JOIN tblPublication dst ON src.PublicationID = dst.PubMedID

    where XML_Data is an XML variable. The problem is the XML source file has changed. Initially it had this structure:

    <Author ValidYN="Y">

    <LastName>his f. name</LastName>

    <ForeName>his f. name</ForeName>

    <Initials>AJ</Initials>

    <Affiliation>whaterver the affil is</Affiliation>

    </Author>

    whereas now it is like this:

    <Author ValidYN="Y">

    <LastName>his f. name</LastName>

    <ForeName>his f. name</ForeName>

    <Initials>AJ</Initials>

    <AffiliationInfo>

    <Affiliation>XXX.</Affiliation>

    </AffiliationInfo>

    </Author>

    The line in bold is causing the procedure to crash; what I'm missing is what the 'XML_Data.query' and specially the data(//...) exactly do.

    Do I simply need to add a longer path to data(), to be able to read the amended tag <AffiliationInfo>? Any explanation of how this works appreciated.

    Thanks in advance, p.

  • I don't really understand your question and could not help you without any DDL.

    That said, note the XML below. I created two variables, the first with the old XML structure and the second using the new XML structure you mentioned. Note the XPath expressions used to extract the data.

    -- old structure

    DECLARE @xml1 xml =

    '<Author ValidYN="Y">

    <LastName>his f. name</LastName>

    <ForeName>his f. name</ForeName>

    <Initials>AJ</Initials>

    <Affiliation>whaterver the affil is</Affiliation>

    </Author>'

    -- new structure

    DECLARE @xml2 xml =

    '<Author ValidYN="Y">

    <LastName>his f. name</LastName>

    <ForeName>his f. name</ForeName>

    <Initials>AJ</Initials>

    <AffiliationInfo>

    <Affiliation>XXX.</Affiliation>

    </AffiliationInfo>

    </Author>';

    -- method 1

    SELECT @xml1.value('(/Author/Affiliation/text())[1]','varchar(20)');

    SELECT @xml2.value('(/Author/AffiliationInfo/Affiliation/text())[1]','varchar(20)');

    -- method 2

    SELECT @xml1.value('(//Affiliation/text())[1]','varchar(20)');

    SELECT @xml2.value('(//Affiliation/text())[1]','varchar(20)');

    Note that query is not the optimal choice for what you are doing, you need value instead.

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Thanks Alan. I get you've got a much higher XML level, but I simply cannot to redesign the whole SP (lacking expertise on that !!). What I'd appreciate is some help to understand the code I have to modify.

    My SP does this:

    --reads an XML file, tblXmlImport has one field for the file

    select @xml = t.[xml_data] from dbo.tblXmlImport t Where t.FileID=@FileID

    --copies the XML in a variable table? Don't know what UserData.query command is. I take

    --@xml.nodes are the tags/nodes of the file, but don't know how to find them

    INSERT INTO @Publication ( XML_Data )

    SELECT UserData.query('.') FROM @xml.nodes('/PubmedArticleSet/PubmedArticle') AS Data ( UserData )

    --Update table based on an INNER JOIN

    UPDATE <table1>

    ...........

    AuthorAddress = left(CONVERT(VARCHAR(max), src.XML_Data.query('data(//Affiliation)')),2000)

    FROM <table2>src INNER JOIN <table1> t1

    Now, I guess that UserData.query reads all the XML content (or its nodes) in one go (as opposed to a normal select which reads a field) and it in a variable called 'Data ( UserData )'.

    Is this right? Then it refers that variable in the insert with 'XML_Data.query(), which extracts nodes from the XML.

    I guess my question is how XML_Data.query depends on the nodes, as right now I don't see it.

    Thanks,

  • a_ud (3/11/2015)


    Thanks Alan. I get you've got a much higher XML level, but I simply cannot to redesign the whole SP (lacking expertise on that !!). What I'd appreciate is some help to understand the code I have to modify.

    My SP does this:

    --reads an XML file, tblXmlImport has one field for the file

    select @xml = t.[xml_data] from dbo.tblXmlImport t Where t.FileID=@FileID

    --copies the XML in a variable table? Don't know what UserData.query command is. I take

    --@xml.nodes are the tags/nodes of the file, but don't know how to find them

    INSERT INTO @Publication ( XML_Data )

    SELECT UserData.query('.') FROM @xml.nodes('/PubmedArticleSet/PubmedArticle') AS Data ( UserData )

    --Update table based on an INNER JOIN

    UPDATE <table1>

    ...........

    AuthorAddress = left(CONVERT(VARCHAR(max), src.XML_Data.query('data(//Affiliation)')),2000)

    FROM <table2>src INNER JOIN <table1> t1

    Now, I guess that UserData.query reads all the XML content (or its nodes) in one go (as opposed to a normal select which reads a field) and it in a variable called 'Data ( UserData )'.

    Is this right? Then it refers that variable in the insert with 'XML_Data.query(), which extracts nodes from the XML.

    I guess my question is how XML_Data.query depends on the nodes, as right now I don't see it.

    Thanks,

    You're obviously simplifying some items, or you didn't quite grab enough content to paste in, since the last statement is missing some stuff and so far - the 2nd statement doesn't relate to the third one. All I can see right now is that you're populating some table variable based on pieces of an incoming XML file, and one or more tables are being updated after that - but the update right now doesn't relate to the other statements you posted.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • OK, so if the problem is not having the scripts, here's the SP. I was trying to help.

    Now: I know what my code does (read a file, put it into a table variable, update other table). What I don't get is how the SP refers to each bit of the XML file, particularly what comes after the XML_Data.query:

    src.XML_Data.query('data(//Affiliation)

    Given that the XML structure has changed, and now has another 'node' or level (see my previous messages), how should the line:

    AuthorAddress=left(CONVERT(VARCHAR(max),src.XML_Data.query('data(//Affiliation)')),2000)

    be changed to adapt to the new XML structure?

    OLD structure:

    <Author ValidYN="Y">

    <LastName>A</LastName>

    <ForeName>B</ForeName>

    <Initials>C</Initials>

    <Affiliation>D</Affiliation>

    </Author>

    NEW structure:

    <Author ValidYN="Y">

    <LastName>A</LastName>

    <ForeName>B</ForeName>

    <Initials>C</Initials>

    <AffiliationInfo>

    <Affiliation>D.</Affiliation>

    </AffiliationInfo>

    </Author>

    S.P:

    USE [mydatabase]

    GO

    SET ANSI_NULLS ON

    GO

    SET QUOTED_IDENTIFIER ON

    GO

    ALTER PROCEDURE [dbo].[spLoadPubmedXML]

    @FileID [int]

    AS

    BEGIN

    --Create variables to manipulate incoming data

    DECLARE @Publication TABLE

    (

    PublicationID Varchar(50),

    XML_Data XML

    )

    DECLARE @xml XML

    --fetch the imported xml from the storage table

    select @xml = t.[xml_data] from dbo.tblXmlImport t Where t.FileID=@FileID

    --Create publication records for importing/updating

    INSERT INTO @Publication ( XML_Data )

    SELECT UserData.query('.')

    FROM @xml.nodes('/PubmedArticleSet/PubmedArticle') AS Data ( UserData )

    -- Extract the PubMedID for each publication record

    UPDATE @Publication

    SET PublicationID = CONVERT(Varchar(50), XML_Data.query('data(//PMID)[1]'))

    --Insert publications where a match does not exist

    INSERT INTO tblPublication (PubMedID)

    (SELECT PublicationID FROM @publication as p

    WHERE NOT EXISTS

    ( SELECT 1

    FROM tblPublication

    WHERE PubMedID = p.PublicationID

    ))

    --Update all Publications where a match exists

    UPDATE dst

    SET PublicationTitle = CONVERT(VARCHAR(1000), src.XML_Data.query('data(//ArticleTitle)')),

    PublicationYear = coalesce(pubmed.value(N'(//ArticleDate/Year)[1]', N'int'),pubmed.value(N'(//Journal/JournalIssue/PubDate/Year)[1]',N'varchar (20)')),

    PublicationMonth = coalesce(pubmed.value(N'(//ArticleDate/Month)[1]', N'int'),DatePart(mm,CAST(pubmed.value(N'(//Journal/JournalIssue/PubDate/Month)[1]',N'varchar (20)')+' 1900' AS DATETIME))),

    PublicationVolume = CONVERT(VARCHAR(50), src.XML_Data.query('data(//Journal/JournalIssue/Volume)')),

    PublicationPages = CONVERT(VARCHAR(50), src.XML_Data.query('data(//Article/Pagination/MedlinePgn)')),

    Summary = CONVERT(NVARCHAR(MAX), src.XML_Data.query('data(//AbstractText)')),

    --PublicationNotes = CONVERT(VARCHAR(MAX), src.XML_Data.query('data(//notes)')),

    --THIS IS THE LINE CAUSING TROUBLE on the NEW FORMAT !!!

    AuthorAddress = left(CONVERT(VARCHAR(max), src.XML_Data.query('data(//Affiliation)')),2000),

    ISBN_ISSN = CONVERT(VARCHAR(25), src.XML_Data.query('data(//ISSN)')),

    --URL = CONVERT(VARCHAR(500), src.XML_Data.query('data(//urls/related-urls)')),

    Journal = CONVERT(VARCHAR(1000), src.XML_Data.query('data(//Journal/Title)')),

    PublicationNumber = CONVERT(VARCHAR(50), src.XML_Data.query('data(//PMID)[1]')),

    --PublicationFile = CONVERT(nVARCHAR(500), src.XML_Data.query('data(//custom3)')),

    PublicationType = CONVERT(VARCHAR(50), src.XML_Data.query('data(//PublicationTypeList/PublicationType[1])')),

    PubMed = src.XML_Data,

    --Added 29Nov2012

    PublicationStatus = CONVERT(VARCHAR(100), src.XML_Data.query('data(//MedlineCitation/@Status)')),

    pmcID = CONVERT(VARCHAR(50), src.XML_Data.query('data(//PubmedData/ArticleIdList/ArticleId[@IdType="pmc"])')),

    OpenAccess = CASE WHEN CONVERT(VARCHAR(50), src.XML_Data.query('data(//PubmedData/ArticleIdList/ArticleId[@IdType="pmc"])')) =''

    THEN 0

    ELSE 1

    END,

    CopyInEpMC = CASE WHEN CONVERT(VARCHAR(50), src.XML_Data.query('data(//PubmedData/ArticleIdList/ArticleId[@IdType="pmc"])')) =''

    THEN 0

    ELSE 1

    END,

    Final = CASE WHEN CONVERT(VARCHAR(100), src.XML_Data.query('data(//MedlineCitation/@Status)')) = 'MEDLINE'

    THEN 1

    ELSE 0

    END,

    PubMedlastUpdated = getdate()

    FROM @Publication src

    INNER JOIN tblPublication dst ON src.PublicationID = dst.PubMedID

    --Update the year and Month where it exists

    UPDATE dst

    SET PublicationYear = coalesce(pubmed.value(N'(//ArticleDate/Year)[1]', N'int'),pubmed.value(N'(//Journal/JournalIssue/PubDate/Year)[1]',N'varchar (20)')),

    PublicationMonth = coalesce(pubmed.value(N'(//ArticleDate/Month)[1]', N'int'),DatePart(mm,CAST(pubmed.value(N'(//Journal/JournalIssue/PubDate/Month)[1]',N'varchar (20)')+' 1900' AS DATETIME)))

    FROM @Publication src

    INNER JOIN tblPublication dst ON src.PublicationID = dst.PubMedID

    --Update the Open Access due date where a match exists

    --calculate as six months after the publication date, the first day of the following month

    UPDATE dst

    SETOADueBy = CASE WHEN Coalesce([PublicationYear],0) > 0 AND Coalesce([PublicationMonth],0) > 0

    THEN Dateadd(mm,7,CAST(CAST([publicationyear]as varchar(4))+'-'+CAST([PublicationMonth] as varchar(4))+'-'+'1'As DATETIME))

    ELSE NULL

    END

    FROM @Publication src

    INNER JOIN tblPublication dst ON src.PublicationID = dst.PubMedID

    END

    Thnx, p.

  • So as far as I can tell from the XML snippet you posted - the line you're worried about would in fact return content and should return "D." (assuming there is only one author)

    Using the query construct is fairly dangerous in this scenario, since as written, the query would pull (and concatenate) the contents from every instance of the Affiliation node. So if you were to have multiple PubMed articles and multiple authors it would pull and concatenate each and every author's affiliation into one single entry.

    Of course - since you're using that syntax everywhere through this procedure, you could have this issue happening on many entities.

    Any chance you could provide a cleansed copy of the offending larger XML (the one you're feeding into the proc) and a mockup of what you'd expect to get out of it?

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Thanks for your help, Matt !

    I take your advice on risk. However, the source being Pubmed, chances that the file is corrupt are tiny. In other words, you cannot seriously expect that Pubmed is messing up with your Db, as you wouldn't from, say, Google, UN or Facebook. And I take PM is way more serious than some of these companies.

    Anyway, this is not the point. I think my question was very simple, almost trivial, and it seems it's not.

    I have attached 2 files: the first (right.xml, old format) is read correctly, the second has the new format (wrong.xml). Come on ! All I'm asking and all that happens is that there is *just* ONE new tag or level in the XML file. This means the Affiliation tag is replaced by AffiliationInfo (and then the Affiliation tag comes, untouched).

    If there are any other relevant changes, the S.P. code doesn't notice.

    The old one gets read fine; the new one crashes, so the stored procedure is not interpreting that correctly. I miss what changes I need in the S.P in order to 'get' this new tag.

    I know this is causing the messbecause if I ignore this section (comment the SP. code), everything ets imported fine (of course the field remains blank).

  • a_ud (3/19/2015)


    Thanks for your help, Matt !

    I take your advice on risk. However, the source being Pubmed, chances that the file is corrupt are tiny. In other words, you cannot seriously expect that Pubmed is messing up with your Db, as you wouldn't from, say, Google, UN or Facebook. And I take PM is way more serious than some of these companies.

    Anyway, this is not the point. I think my question was very simple, almost trivial, and it seems it's not.

    I have attached 2 files: the first (right.xml, old format) is read correctly, the second has the new format (wrong.xml). Come on ! All I'm asking and all that happens is that there is *just* ONE new tag or level in the XML file. This means the Affiliation tag is replaced by AffiliationInfo (and then the Affiliation tag comes, untouched).

    If there are any other relevant changes, the S.P. code doesn't notice.

    The old one gets read fine; the new one crashes, so the stored procedure is not interpreting that correctly. I miss what changes I need in the S.P in order to 'get' this new tag.

    I know this is causing the messbecause if I ignore this section (comment the SP. code), everything ets imported fine (of course the field remains blank).

    Well when I attempted to bring in the "wrong" XML, the error I got was because of trying to dump content into a VARCHAR rather NVARCHAR. Just importing the file was interesting since there is also some DTD markup in the file, which has some notations that SQLXML isn't fond of (which might be another source of issues). Change the CONVERT to leverage NVARCHAR rather than VARCHAR and I think it might stop crashing.

    That said - I think you're taking my previous comments in the wrong way. It's not that I mistrust PubMed's format, it's that the handling of the XML in the procedure is taking a broadaxe to the content, and dropping out a LOT of content. This would be happening on abstract AND affiliation AND notes just to name a few.

    As a specific example - the .QUERY notation is a bit dangerous in that it takes the affiliation info from ALL authors on a given article and dumps them together, then truncates that to your acceptable length. In the "right" XML, only one uthor on each article has an affilition; in the wrong one, they EACH do, and since you have 5-6 per article, you're dropping out about half of that affiliation content. Same on abstract, etc....

    I'm hoping this helps to some degree. If not - please post the nature of the error reported by the proc.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Thanks Matt !!

    I've tried this and, to my surprise, it seems to work. Very happy and a big thanks!!

    I can't mark it as a defintiive solution as I still don't know what was causing the problem. I may have misunderstood one of your entries, English being my 2nd language it's easy to make mistakes.

    I'm still baffled at how SQL Server reads XML; I think my code (even if not mine..) uses XPath to 'get' to a node, and my suprise is that even when changes are added (like the new 'AffiliationInfo' level in the files) it manages to read the content and import it.

    It is true, that some content is eliminated. My char fields have ~2,000 characters, so some authors or comments, the very long ones, might be capped. Still that's a minor problem for us, compared to the procedure crashing. And it's kind of ienvitable (unless you use Memo, Varchar max, etc), because you have to set a size for 'some' of your fields. But I get your point now.

    Thanks again, p.

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply