How to avoid duplicating rows in query results when joining 2 tables

Question

Post reply

How to avoid duplicating rows in query results when joining 2 tables

b.miller

SSC Journeyman

Points: 77
More actions
May 5, 2012 at 1:10 pm

#268382

The below scenario represents my question in it's simplest form.
I am trying to join an Item History table to an Invoice History table. Unfortunately the columns I have available to me in the Item History table for join criteria do not always define a unique record in the Invoice History table. I get repeating results included when there are multiple invoices generated for the same item against the same order and line for the same quantity on the same day. Each Invoice History transaction REALLY only corresponds to ONE Item History transaction, but I have insufficient information in Item History to be able to identify the specific corresponding Invoice History transaction
How can I limit the rows returned by this query so that the Item History quantity change is only represented once even if there are multiple Invoice History transactions that meet the join criteria? It seems as though there should be a simple solution, but I am having a difficult time finding it.
create table dbo.ItemHistory
(Item varchar(30)
,OrderNo numeric(9,0)
,OrderLine numeric(4,0)
,TxDate datetime
,QtyChg numeric(10,0)
)
go
create table dbo.InvoiceHistory
(Item varchar(30)
,OrderNo numeric(9,0)
,OrderLine numeric(4,0)
,InvoiceNo numeric(9,0)
,InvoiceDate datetime
,InvoiceQty numeric(10,0)
)
go
insert into
dbo.ItemHistory
values
('123',1,5,'2012-01-01',10)
,('123',1,5,'2012-01-01',10)
,('124',2,5,'2012-01-01',5)
,('125',3,5,'2012-01-01',25)
go
insert into
dbo.InvoiceHistory
values
('123',1,5, 1,'2012-01-01',10)
,('123',1,5, 2, '2012-01-01',10)
,('124',2,5, 3, '2012-01-01',5)
,('125',3,5, 4, '2012-01-01',25)
go
Here is the query I am struggling with. Note that the problem is with Item 123.
select
itm.Item
,itm.OrderNo
,itm.OrderLine
,CONVERT(varchar, itm.TxDate, 101) as TxDate
,itm.QtyChg
,inv.InvoiceNo
,CONVERT(varchar, inv.InvoiceDate, 101) as InvoiceDate
,inv.InvoiceQty
from
dbo.ItemHistory itm
join
dbo.InvoiceHistory inv
on
inv.Item = itm.Item
and inv.OrderNo = itm.OrderNo
and inv.OrderLine = itm.OrderLine
and inv.InvoiceDate = itm.TxDate
and inv.InvoiceQty = itm.QtyChg
With the columns available to me for joining, the first two Item History rows each match to the first two Invoice History rows, giving me the cartesian product result of 4 rows returned. Rows 5 and 6 represent a one to one correspondence between the 2 tables and are not part of the problem.
The result I am after is to have Invoice Numbers 1 and 2 each represented only once.
ItemOrderNoOrderLineTxDateQtyChgInvoiceNoInvoiceDateInvoiceQty
1231501/01/201210101/01/201210
1231501/01/201210101/01/201210
1231501/01/201210201/01/201210
1231501/01/201210201/01/201210
1242501/01/20125301/01/20125
1253501/01/201225401/01/201225

Viewing 5 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply

vicdileo SSC Veteran Points: 293 More actions · Answer 1

Thank you for providing tables and data - this takes out a lot of the guesswork.

Here is a relatively new and efficient way to solve this kind of problem.

To your query I added

selectitm.Item, itm.OrderNo ,itm.OrderLine... as rownum

that adds a row, called rownum, that contains a sequence that re-starts with 1 whenever the Item, OrderNo and InvoiceNo grouping changes.

Then I wrapped it in an outer query that selects only rows that have rownum = 1.

Put all that together and here's the query that solves your problem:

select Item, OrderNo, OrderLine, TxDate, QtyChg, InvoiceNo, InvoiceDate, InvoiceQty, rownum

from (

selectitm.Item, itm.OrderNo ,itm.OrderLine

,CONVERT(varchar, itm.TxDate, 101) as TxDate

,itm.QtyChg, inv.InvoiceNo

,CONVERT(varchar, inv.InvoiceDate, 101) as InvoiceDate

,inv.InvoiceQty

,ROW_NUMBER() OVER (PARTITION BY itm.Item, itm.OrderNo, inv.InvoiceNo

ORDER BY itm.Item, itm.OrderNo, inv.InvoiceNo ) as rownum

from dbo.ItemHistory itm join dbo.InvoiceHistory inv on

(inv.Item = itm.Item

and inv.OrderNo = itm.OrderNo

and inv.OrderLine = itm.OrderLine

and inv.InvoiceDate = itm.TxDate

and inv.InvoiceQty = itm.QtyChg) ) X

WHERE rownum = 1

This solution uses windows functions (because the ROW_NUMBER, OVER, and PARTITION BY are applied the the result set, or "window", after the data has been retrieved).

This and other ways of solving this problem can be found here:

http://www.simple-talk.com/content/print.aspx?article=646

The solution used here is near the bottom of the article in the section titled, "New Techniques for Removing Duplicate Rows in SQL Server 2005".

Google "sql server windows functions" for more info on this solution.

- victor di leo

vicdileo SSC Veteran Points: 293 More actions · Answer 2

Oops.... where I wrote

To your query I added

select itm.Item, itm.OrderNo ,itm.OrderLine... as rownum

I actually meant to write

To your query I added

,ROW_NUMBER() OVER (PARTITION BY ... as rownum

b.miller SSC Journeyman Points: 77 More actions · Answer 3

Victor,

I believe this will absolutely resolve my issue. I will test it out shortly.

I have been working with sql for a few months now and have learned one thing so far. There is ALWAYS more to learn.

Thank you very much for the assistance.

Brian Miller

Paul White SSC Guru Points: 150467 More actions · Answer 4

If you omit the rownum column from the result, the query plan can use the ANY aggregate:

SELECT

X.Item,

X.OrderNo,

X.OrderLine,

X.TxDate,

X.QtyChg,

X.InvoiceNo,

X.InvoiceDate,

X.InvoiceQty

FROM

(

SELECT

itm.Item,

itm.OrderNo,

itm.OrderLine,

TxDate = CONVERT(char(10), itm.TxDate, 101),

itm.QtyChg,

inv.InvoiceNo,

InvoiceDate = CONVERT(char(10), inv.InvoiceDate, 101),

inv.InvoiceQty,

rownum =

ROW_NUMBER() OVER (

PARTITION BY itm.Item, itm.OrderNo, inv.InvoiceNo

ORDER BY itm.Item, itm.OrderNo, inv.InvoiceNo)

FROM dbo.ItemHistory itm

JOIN dbo.InvoiceHistory inv ON

inv.Item = itm.Item

AND inv.OrderNo = itm.OrderNo

AND inv.OrderLine = itm.OrderLine

AND inv.InvoiceDate = itm.TxDate

AND inv.InvoiceQty = itm.QtyChg

) AS X

WHERE

X.rownum = 1;

You could also just add a DISTINCT to the original query. Both of these approaches are imperfect though; the real solution is to fix the schema. Using a properly relational design will help avoid these issues arising in the first place.

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi