Iterating thru XML entries

Question

Post reply

Iterating thru XML entries

Tim Pain

Ten Centuries

Points: 1136
More actions
July 19, 2010 at 12:49 pm

#97320

I'm new to this XML stuff, so please excuse my lack of knowledge.
Below is part of the daily file provided by the US Government Printing Office. This is a large file and I'm only interested in a few small parts of it. So what I'm trying to do is pull out the relevant parts, and load it into a tables that I can then report on.
I have worked out how to find the relevant sections, but I can figure out how to iterate thru the entries within the section.
Attached is an example of the output I'm hopeing to achieve.
- <GPOTABLE CDEF="s25,r20,r20,r45,r75,r75" COLS="6" OPTS="L2,i1">
- <TTITLE>
<E T="04">I.—44 Premanufacture Notices Received From: 6/15/09 to 7/3/09</E>
</TTITLE>
- <BOXHD>
<CHED H="1">Case No.</CHED>
<CHED H="1">Received Date</CHED>
<CHED H="1">Projected Notice End Date</CHED>
<CHED H="1">Manufacturer/Importer</CHED>
<CHED H="1">Use</CHED>
<CHED H="1">Chemical</CHED>
</BOXHD>
- <ROW>
<ENT I="01" O="xl">P-09-0424</ENT>
<ENT O="xl">06/12/09</ENT>
<ENT O="xl">09/09/09</ENT>
<ENT O="xl">CBI</ENT>
<ENT O="xl">(G) Thermoset adhesive additive</ENT>
<ENT O="xl">(S) Siloxanes and silicones, di-me, bu group- and 3-[2-(3-carboxymethylene-1-oxopropoxy)ethoxy]propyl group-terminated</ENT>
</ROW>
- <ROW>
<ENT I="01" O="xl">P-09-0425</ENT>
<ENT O="xl">06/15/09</ENT>
<ENT O="xl">09/12/09</ENT>
<ENT O="xl">Grafil, Inc.</ENT>
<ENT O="xl">(S) Precursor for carbonization/production of carbon fiber</ENT>
<ENT O="xl">(S) 2-propenoic acid, 2-methyl-, polymer with 2-propenamide and 2-propenenitrile</ENT>
</ROW>
Any pointers would be much appreciated.

Tim

Viewing 15 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply

LutzM SSC Guru Points: 107049 More actions · Answer 1

Would the following help you to get closer to your expected result?

SELECT

row.value('1+count(for $a in . return $a/../*[. << $a])','INT')-2 AS row,

col.value('1+count(for $a in . return $a/../*[. << $a])','INT') AS col,

col.value('text()[1]','VARCHAR(10)') AS val

FROM @xml.nodes('GPOTABLE/ROW') AS x1(row)

CROSS APPLY row.nodes('ENT') AS x2(col)

I'm not sure how it will perform against a large xml file... If the code performance cannot be accepted we might need an intermediate table to assign at least the row number per ROW element.

Based on the result you could either use PIVOT or the CrossTab method to get the data in your expected result.

Lutz
A pessimist is an optimist with experience.

How to get fast answers to your question[/url]
How to post performance related questions[/url]
Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

Tim Pain Ten Centuries Points: 1136 More actions · Answer 2

Lutz,

I won't pretend I understand how it works, but it does just what I asked for.

But can I now be a real nuisance and add an additional requirement, to include the "TITLE" value as part of the returned value. This way I can determine from which section of the original source, the details are referring to.

Many thanks for you help.

Tim

LutzM SSC Guru Points: 107049 More actions · Answer 3

You would need to add the reference to the TTITLE section:

SELECT

title.value('E[1]','VARCHAR(100)') AS title,

row.value('1+count(for $a in . return $a/../*[. << $a])','INT')-2 AS ROW,

col.value('1+count(for $a in . return $a/../*[. << $a])','INT') AS col,

col.value('text()[1]','VARCHAR(10)') AS val

FROM @xml.nodes('GPOTABLE/ROW') AS x1(ROW)

CROSS APPLY row.nodes('ENT') AS x2(col)

CROSS APPLY @xml.nodes('GPOTABLE/TTITLE') AS x3(title)

If you're new to XQuery I'd recommend Jacob Sebastians blog. It's a great resource for XML related questions that I refer to frequently.

Lutz
A pessimist is an optimist with experience.

How to get fast answers to your question[/url]
How to post performance related questions[/url]
Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

Tim Pain Ten Centuries Points: 1136 More actions · Answer 4

Lutz,

Thank you once again for the great help, it worked beautifully.

However now reviewing the source data, I have found the a very few records are structured slightly differently, with an extra node (PRTPAGE) (see record 2 below) before the start of the ENT nodes, when this happens it throws the subsequent columns out for that specific record. Is there any way of telling the code to skip the PRTPAGE node, if it's present?

- <ROW>

<ENT O="xl">(G) Papermaking process aid</ENT>

<ENT O="xl">(G) Polyamide epichlorohydrin resin salt</ENT>

</ROW>

- <ROW>

<ENT O="xl">(G) Site limited intermediate</ENT>

<ENT O="xl">(G) Aromatic diamine</ENT>

</ROW>

- <ROW>

<ENT O="xl">(S) Anti-scaling agent for water treatment; co-builder for detergents; drilling mud additive; dispersant for solid materials in water based coatings</ENT>

<ENT O="xl">(G) Polyitaconic acid</ENT>

</ROW>

I had not anticipated needing any help with the Pivot part of the problem, but at present don't seem to be able to get this to work either. So any pointers anybody can give on this would also be appreciated.

I had already come across Jacob Sabastains blog and had found them helpful.

Many thanks

Tim

LutzM SSC Guru Points: 107049 More actions · Answer 5

I would wrap the XML shredding into a CTE or subquery and use ROW_NUMBER to get it sorted as required:

;WITH cte AS

(

SELECT

title.value('E[1]','VARCHAR(100)') AS title,

row.value('1+count(for $a in . return $a/../*[. << $a])','INT')-2 AS ROW,

col.value('1+count(for $a in . return $a/../*[. << $a])','INT') AS col,

col.value('text()[1]','VARCHAR(10)') AS val

FROM @xml.nodes('GPOTABLE/ROW') AS x1(ROW)

CROSS APPLY row.nodes('ENT') AS x2(col)

CROSS APPLY @xml.nodes('GPOTABLE/TTITLE') AS x3(title)

)

SELECT

title,

ROW,

ROW_NUMBER() OVER(PARTITION BY ROW ORDER BY col ) AS col,

val

FROM cte

Regarding pivoting the data I'd recommend you have a look at the CrossTab article referenced in my signature.

Lutz
A pessimist is an optimist with experience.

How to get fast answers to your question[/url]
How to post performance related questions[/url]
Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

Tim Pain Ten Centuries Points: 1136 More actions · Answer 6

Lutz,

Making some good progress now with your great help.

But I have found 2 issues with the current code.

1) In the original specification of the problem I omited to explain that the source data can have a variable number of sections that I'm interested in. (In the test data I'm using there are 6 relevant sections).

In the code for the "row"

row.value('1+count(for $a in . return $a/../*[. << $a])','INT')-2)

this resets to 1 at the start of each new section. The existing "row" counter is useful for identifying the number of rows per section, but it's stopping the Pivot working. So is it possible to add another row counter, that does not reset on change of section

2) This is to do with the way that the "TTILE" node is added, currently this adds all 6 section TTITLES to each row (taking the total row count from 620 to 3720 (ie: 620 * 6 = 3720). If I remove the TTILE section (and manually fix the "row" issue above), then the data is correct and the Pivot works just fine. I did try to add the TTILE after selecting the rest of the data, by using an Update query, but I can't see any way of selecting the correct TTILE value based on the value of the ROW/ENT.

I'm sorry to be such a nuisance on this, but any further help you are able to give, would be very much appreciated.

Tim

LutzM SSC Guru Points: 107049 More actions · Answer 7

I think the samples provided should help you to resolve your issues.

To add another counter, look what ROW_NUMBER actually does and try to modify it to your needs.

Regarding the PIVOT: based on the code and sample data you provided I don't see any reason why the last code I provided won't work.

Please try to understand and modify the code. It's a much better learning experience than simply copy and paste. If you get stuck, post your code, provide related sample data and your expected result.

Lutz
A pessimist is an optimist with experience.

How to get fast answers to your question[/url]
How to post performance related questions[/url]
Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

Tim Pain Ten Centuries Points: 1136 More actions · Answer 8

Lutz,

This is the code I currently have and it mostly works.

-- Use a real table so that I can check what is happening

IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].') AND type in (N'U'))

DROP TABLE [dbo].

CREATE TABLE b

(

IID INT IDENTITY(1,1),

b_date varchar(25),

b_title varchar(100),

b_row INT,

b_row2 INT,

b_col INT,

b_val varchar(200)

)

-- Take out the "Titles" for the present as they screw thing up

INSERT INTO b(b_date, /*b_title,*/ b_row, b_col, b_val)

SELECT

(SELECT a.b.value('DATE[1]','varchar(25)') AS s_Date

FROM @xml.nodes('FEDREG') AS a(b)) AS R_Date,

--title.value('E[1]','VARCHAR(100)') AS title,

row.value('1+count(for $a in . return $a/../*[. << $a])','INT')-2 AS row,

col.value('1+count(for $a in . return $a/../*[. << $a])','INT') AS col,

col.value('text()[1]','VARCHAR(200)') AS val

FROM @xml.nodes('FEDREG/NOTICES/NOTICE/SUPLINF/GPOTABLE/ROW') AS x1(row)

CROSS APPLY row.nodes('ENT') AS x2(col)

--CROSS APPLY @xml.nodes('FEDREG/NOTICES/NOTICE/SUPLINF/GPOTABLE/TTITLE') AS x3(title)

DECLARE @row_int INT, @row_b INT, @col_b INT

SET @row_int = 1

-- Use a Cursor to allocate the Row numbers over the whole result set.

-- Messy, but there are only a few hundred rows in the set, so it runs fast

DECLARE curs CURSOR FOR

SELECT b_row, b_col FROM dbo.b;

OPEN curs

FETCH NEXT FROM curs INTO @row_b, @col_b

WHILE @@FETCH_STATUS = 0

BEGIN

IF @row_b <> @row_int AND @col_b = 1

BEGIN

SET @row_int = @row_int + 1

UPDATE b SET b_row2 = @row_int WHERE CURRENT OF curs

END

ELSE

BEGIN

UPDATE b SET b_row2 = @row_int WHERE CURRENT OF curs

END

FETCH NEXT FROM curs INTO @row_b, @col_b

END

CLOSE curs

DEALLOCATE curs

-- Pivot

SELECT b_row2,

[1] as [Case No.],

[2] as [Received],

[3] as [End],

[4] as [Manufacture],

[5] as [Use],

[6] as [Chemical]

FROM (SELECT b_row2, b_col, b_val FROM b) AS src

PIVOT (MIN(b_val) FOR b_col IN ([1],[2],[3],[4],[5],[6])) AS pvt

ORDER BY b_row2

The 2 areas that are still giving difficulty are -

1) The extra (PRTPAGE) nodes referred to in my post of the 21st, where I would like to be able to skip those extra nodes (where they are present)

2) The TTITLE section, referred to in my post of the 22nd. I tried this code to pick up the correct TTITLE, but not surprisingly it returns all TTILE's from the source data, rather than just the one I'm interested in.

SELECT Title.value('E[1]','varchar(100)') AS title

FROM @xml.nodes('FEDREG/NOTICES/NOTICE/SUPLINF/GPOTABLE/TTITLE') AS x3(title)

WHERE @xml.exist('FEDREG/NOTICES/NOTICE/SUPLINF/GPOTABLE/TTITLE/E[./text()="II. 63 Notices of Commencement From: 09/7/09 to 10/23/09"]') = 1

You can get a copy of the test source XML file that I'm using here http://www.gpo.gov/fdsys/bulkdata/FR/2009/11/FR-2009-11-12.xml

Any help or guidance you can give in solving these last 2 issues, would be very much appreciated.

Many thanks

Tim

LutzM SSC Guru Points: 107049 More actions · Answer 9

Please post your expected result set based with reference to your sample data (either the first sample, the second or the data from the link you mentioned.

Regarding the c.u.r.s.o.r. *cough*: what's wrong with the ROW_NUMBER approach I posted on 7-21 ?? (I wouldn't have to ask if I had a result set (based on provided sample data) to compare to... 😉 )

Lutz
A pessimist is an optimist with experience.

How to get fast answers to your question[/url]
How to post performance related questions[/url]
Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

Tim Pain Ten Centuries Points: 1136 More actions · Answer 10

Yes I know all about the very widely held views on cursors, but personally I'm not to worried about this aspect, as at least I have something that works! Your suggested solution, resets back to one on each change of (TTITLE) group.

I have attached a output.bmp that shows a fake up of the type of output I would like to be able to achieve. This is based on the data in the url mentioned earleir today.

The attachment output2.bmp show part of the pdf that is currently published by the US government. This runs to hundreds of pages, but we are only interested in a few, hence the desire to automatically extract only the parts we are interested in.

If you want a full fake up of the data based on the URL, just let me know.

Many thanks

Tim

Tim Pain Ten Centuries Points: 1136 More actions · Answer 11

Sorry, should have also said, that we are only interested in data from the source file, where the TTITLE contains the following text -

Premanufacture Notices Received From

Notices of Commencement From

Test Marketing Exemption Notices Received From

Tim

LutzM SSC Guru Points: 107049 More actions · Answer 12

I had a look at the bmp file you posted. That's not what I expected.

Please provide table structure (CREATE TABLE) and some INSERT INTO ... SELECT ... UNION ALL statements that will actually show your populated target table based on a source to either be provided or referenced.

Regarding resetting to one with the ROW_NUMBER approach: yes, that's intentional, since I assumed that's what you've been looking for. Instead of using some sort of a loop I strongly recommend you look up ROW_NUMBER in BOL (BooksOnLine, the SQL help system usually installed together with SQL Server) and make the changes you need.

Lutz
A pessimist is an optimist with experience.

How to get fast answers to your question[/url]
How to post performance related questions[/url]
Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

Tim Pain Ten Centuries Points: 1136 More actions · Answer 13

I hope this is what you expected.

The table structure for storing the pivoted data will look like this -

IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[c]') AND type in (N'U'))

DROP TABLE [dbo].[c]

GO

CREATE TABLE [dbo].[c](

[IID] [int] IDENTITY(1,1) NOT NULL,

[c_date] [varchar](25) NULL,

[Title] [varchar](100) NULL,

[CaseNo] [varchar](200) NULL,

[Received] [varchar](200) NULL,

[EndDate] [varchar](200) NULL,

[Manufacture] [varchar](200) NULL,

[Usage] [varchar](200) NULL,

[Chemical] [varchar](200) NULL

) ON [PRIMARY]

GO

And the insert would look something like this (but would need to include the Title part) -

INSERT INTO c(c_date, /*Title,*/ CaseNo, Received, EndDate, Manufacture, Usage, Chemical)

SELECT b_row2,

[1] as [CaseNo],

[2] as [Received],

[3] as [EndDate],

[4] as [Manufacture],

[5] as [Usage],

[6] as [Chemical]

FROM (SELECT b_row2, b_col, b_val FROM b) AS src

PIVOT (MIN(b_val) FOR b_col IN ([1],[2],[3],[4],[5],[6])) AS pvt

ORDER BY b_row2

I'm sorry for the confusion on the Row_number issue, and your assumption was very reasonable. But I don't see how ROW_NUMBER is going to help to solve the problem - BOL says "Returns the sequential number of a row within a partition of a result set, starting at 1 for the first row in each partition", but I need a number that is repeated n times for each group. In theory you could use ROW_NUMBER together with PARTITION, but I don't see anything to Partition against in the current data.

Tim

MacMcAleer Old Hand Points: 328 More actions · Answer 14

Couldn't you use XSLT to transform the original large XML file to one containing just the XML elements you want and then load this into SQL Server?