xml in subselect and performance => alternatives?

Question

xml in subselect and performance => alternatives?

WolfgangE

SSCertifiable

Points: 6328
More actions
November 21, 2012 at 9:12 am

#393641

hi there,
many times there is the requirement of selecting multiple records into one xml-string. Usually you do this using a subselect. Here's a simplified example for showing what i mean (most of you will be familiar with this kind of query):
declare @Parcel table (
ParcelId int,
Shipper varchar(50),
Consignee varchar(50)
);
declare @ParcelDetail table(
ParcelId int,
ScannedAtStation varchar(50),
ScannedAtTime datetime
);
insert into @Parcel values
(1, 'Romeo', 'Julia' ),
(2, 'Donald', 'Daisy' ),
(3, 'someone', 'nobody' )
insert into @ParcelDetail values
( 1, 'Austria', '20120101' ),
( 1, 'Germany', '20120103' ),
( 1, 'UK', '20120110' ),
( 2, 'Spain', '20120101' ),
( 2, 'Portugal', '20120201' ),
( 3, 'Italy', '20120301' ),
( 3, 'Switzerland', '20120310' )
select
p.ParcelId,
p.Shipper,
p.Consignee,
TransportRoute =
(
select *
from @ParcelDetail detail
where detail.ParcelId = p.ParcelId
for xml auto
)
from @Parcel p
The problem in my case is the performance if the number of records raises. Of course the last statement is a loop: make a table scan (or clustered index seek if possible) on the inner table for each outer table.
My target is to eliminate the loop. I think if I scan each table once I have all the data i need to serve the whole query.
My idea is to join the two tables and let the XML be generated by a kind of aggregate. For example I join the two tables and make a group by, so both tables are only scanned once, like this:
select
p.ParcelId,
p.Shipper,
p.Consignee,
TransportRoute =
(
select detail.*
for xml auto
)
from @Parcel p inner join
@ParcelDetail det on
det.ParcelId = p.ParcelId
group by
p.ParcelId,
p.Shipper,
p.Consignee
This doesn't work. The question is: can we realize this query?
Another try would be following: as the "for xml auto" expects a table where the data is selected from I tried to simulate a table using the values-clause. Without success.
select
p.ParcelId,
p.Shipper,
p.Consignee,
(
select *
from (
values(
det.ParcelId,
det.ScannedAtStation,
det.ScannedAtTime
)) as x(col1)
with xml auto
)
from @Parcel p inner join
@ParcelDetail det on
det.ParcelId = p.ParcelId
group by
p.ParcelId,
p.Shipper,
p.Consignee
By the way: the xml is requestet this way, there is no sense in discussing whether there are alternatives to the xml string itself or not 😉 only the way getting there is the question 🙂
Thank's a lot for any suggestions,
Wolf

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply

GSquared SSC Guru Points: 260824 More actions · Answer 1

The way you're doing it is pretty standard. You could switch from an inline sub-query to a Cross/Outer Apply in the From clause, but it won't really do much except change how the query reads.

In your actual query, as opposed to the sample provided, is it actually using table variables, or does it use persisted tables? If so, check the indexing on them. If it's done right, that kind of query can be very fast, unless the amount of data being returned is simply huge.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

WolfgangE SSCertifiable Points: 6328 More actions · Answer 2

hello GSquared,

thank's for your answer.

Using the subselect as shown in the example or switching it to a cross apply ends up in exactly the same execution plan.

The table variables are only for making a quick example for understanding.

In our real database the two tables are permanent tables, each containing several 10thousands rows.

The query above is executed several times within a daily process. The outer table gets filtered by an Id and returs just a few rows most of the time. For some cases the outer table returns about 20.000 to 40.000 rows. The inner select does a clustered index seek, which is quite well. But the high number of index seeks results in a query duration of more than 10 seconds. That's too long.

I personally think making 40.000 index seeks is much slower than 1 partial index scan. For this I try to find a solution where each table is scanned only once.

In any other query the optimizer surely would switch to an index scan instead of making that much lookups but with the subquery I force him to loop. And that's the target: eliminate the loop.

Good evening, Wolf

GSquared SSC Guru Points: 260824 More actions · Answer 3

Can you save the execution plan as a .sqlplan file and post that here? That would definitely help.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

WolfgangE SSCertifiable Points: 6328 More actions · Answer 4

hi. here is the execution plan. some explanations:

@InsertedCumulatedDataIds:

A table filled at the beginning of the procedures. For this Ids we have to write records in table #tmpProductHistory.

The tables DetailledDataANP and Product are joined to the @InsertedCumulatedDataIds to get all the information necessary for the later insert.

In the subselect the content of table DetailledDataANP and CumulatedData are stored as XML for each of the records.

The #tmpProductHistory is copied to the permanent table later.

This is the query:

insert into #tmpProductHistory( ProductId, IdentCode, PartitionKey, ProductHistoryTypeId, ProductHistoryReasonId, EventContent )

select

anp.ProductId,

anpp.Identcode,

anp.PartitionKey,

1,

(

select *

from billing.DetailledDataANP as Innen join

billing.CumulatedData as f on

f.CumulatedDataId = Innen.CumulatedDataId

Where Innen.DetailledDataANPId = anp.DetailledDataANPId

for XML Auto

) As EventContent

from billing.CumulatedData fact inner join

billing.DetailledDataANP anp on

fact.CumulatedDataId = anp.CumulatedDataId inner join

dbo.Product anpp on

anp.ProductId = anpp.ProductId

where fact.CumulatedDataId in ( Select ids.Id From @InsertedCumulatedDataIds as ids );

GSquared SSC Guru Points: 260824 More actions · Answer 5

Approximately how many rows does that table variable end up having in it when you run this?

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

WolfgangE SSCertifiable Points: 6328 More actions · Answer 6

The table variable contains just a few rows, maybe 200 for the "large" processes. But the join sometimes results in more than 30.000 rows.

GSquared SSC Guru Points: 260824 More actions · Answer 7

Can you replace the table variable with a temp table and see what that does?

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

WolfgangE SSCertifiable Points: 6328 More actions · Answer 8

not really, restricted access and so on...

But I don't think this makes sense anyway. It won't change the fact that the subselect makes a lookup for any outer row.

Maybe I manage to build a similar example...

WolfgangE SSCertifiable Points: 6328 More actions · Answer 9

ok, I could make an example. As expected, the where- and from-task has changed a little bit, but not the loop join for the subselect.

GSquared SSC Guru Points: 260824 More actions · Answer 10

Against small datasets, Nested Loop Joins are often the most efficient. SQL Server pretty much assumes table variables will have 1 row in them. Since it doesn't maintain stats on the contents of table variables, it's pretty much stuck with that. So, table variables often cause a Nested Loops Join when they interact with other tables. That's usually okay if they only have a few rows in them, but can seriously tank performance if they have lots of rows.

That's why I asked about converting to a temp table. Those maintain stats, which gives the query optimizer a better chance of picking a more efficient join method (Hash or Merge) on larger datasets.

Table variables are a notorious performance-killer. They can be very useful if you need to persist data changes past a rollback (since they aren't subject to most rollback operations), or if you know you're going to be dealing only with very small datasets, or if you need to use one as an input parameter (SQL 2008 and beyond). If you go past those limited uses, you're usually much better off using a temp table.

Since you can't use one, you may not be able to do much about tuning this query. You may be stuck with leaving it as-is.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

WolfgangE SSCertifiable Points: 6328 More actions · Answer 11

The reason for the loop join is not the table variable. It is the subselect for the XML-output. It doesn't matter which kind of tables you take, when using the XML-output like shown in the first example there will always be a loop join like with nearly any other subselect I know.

The table variable in my case is only part of the where-condition of the outer select.

WolfgangE SSCertifiable Points: 6328 More actions · Answer 12

Here is the same example using large, permanent tables.

create table dbo.Parcel (

ParcelId int,

Shipper varchar(50),

Consignee varchar(50)

);

create clustered index cix_dboParcel_ParcelId on dbo.Parcel ( ParcelId asc );

create table dbo.ParcelDetail(

ParcelId int,

ScannedAtStation varchar(50),

ScannedAtTime datetime

);

create clustered index cix_dboParcelDetail_ParcelId on dbo.ParcelDetail ( ParcelId asc );

declare @counter int = 0;

set nocount on;

while @counter < 50000

begin

insert into dbo.Parcel values

(1*@counter, 'Romeo', 'Julia' ),

(2*@counter, 'Donald', 'Daisy' ),

(3*@counter, 'someone', 'nobody' )

insert into dbo.ParcelDetail values

( 1*@counter, 'Austria', '20120101' ),

( 1*@counter, 'Germany', '20120103' ),

( 1*@counter, 'UK', '20120110' ),

( 2*@counter, 'Spain', '20120101' ),

( 2*@counter, 'Portugal', '20120201' ),

( 3*@counter, 'Italy', '20120301' ),

( 3*@counter, 'Switzerland', '20120310' )

set @counter += 1;

end

select

p.ParcelId,

p.Shipper,

p.Consignee,

TransportRoute =

(

select *

from dbo.ParcelDetail detail

where detail.ParcelId = p.ParcelId

for xml auto

)

from dbo.Parcel p

where p.ParcelId between 10 and 20

select

p.ParcelId,

p.Shipper,

p.Consignee,

TransportRoute =

(

select *

from dbo.ParcelDetail detail

where detail.ParcelId = p.ParcelId

for xml auto

)

from dbo.Parcel p

where p.ParcelId between 1 and 100000

drop table dbo.Parcel;

drop table dbo.ParcelDetail;

When looking at the execution plan of the two selects we see that the index seek is done on both tables, but the subselect still causes the loop join which I want to eliminate.

GSquared SSC Guru Points: 260824 More actions · Answer 13

With that same structure, try this:

SELECT *

FROM dbo.Parcel AS P

INNER JOIN dbo.ParcelDetail AS PD

ON P.ParcelId = PD.ParcelId

WHERE P.ParcelId BETWEEN 10 AND 20;

You'll still get a Nested Loops Join.

Has nothing to do with it being a sub-select. Has to do with data distributions.

The second table, in this sample, has between 2 and 7 rows per ParcelID value. The most efficient way to handle that, in this case, is a Nested Loops Join.

You might be able to force this into a Merge Join if the indexes are right and the data volume is high enough, but odds are SQL Server would pick a Merge Join if it would actually be more efficient.

Your test script also produces multiple identical ParcelID values in the Parcel table. Does that parallel your actual environment, or is ParcelID unique in the table that is meant to parallel? If it isn't unique, that will also complicate the join. When I deduped that table and set the clustered index to Unique, the execution plan remains the same, but the execution time on a 10k-row query went from 452 ms CPU time and 1.217 seconds total time, down to 200 ms CPU and 700 ms total. Almost twice as fast. Of course, it returns less rows (10k vs 18k) once the primary table is deduped, which accounts for some of the speed-up, but some of it is also the stats on the table are treated differently with a unique index than with a non-unique one.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon