February 23, 2006 at 1:03 pm
My sales data table.
TransactionID Product Quantity SalesDollars StoreID
100101 A 1 $10.00 40
100101 B 2 $15.00 40
100101 C 1 $50.00 40
100101 D 3 $18.00 40
100102 B 3 $21.50 40
100102 D 6 $36.00 40
100103 A 1 $10.00 40
100103 D 10 $70.50 40
100103 B 2 $15.00 40
100105 A 1 $10.00 53
100105 B 2 $15.00 53
100105 C 1 $50.00 53
100105 D 3 $18.00 53
100106 B 3 $21.50 53
100106 D 6 $36.00 53
100107 A 1 $10.00 53
100107 D 10 $70.50 53
100107 B 2 $15.00 53
This is the character of the dataset I am dealing with. However in real life I have 400 products, thousands of transactions, and 20 million records for a year sales. Transactions are from 40 stores and hundreds of customers.
I have my own ideas how to solve the following data problem, but I am after differing points of view.
I need to determine a data relationship query or matrix between table rows. To answer questions like when Product A is sold how often does product B in the same transaction in same store? Then the same question has the following permutations and beyond.
Store, Customer, Product
Store, Customer, Product Category
Customer, Product Category
Customer Group, Product
Etc.
I thought I build a Transaction summary table, that turns the data information from a row to a column, for SQL searches, So I can search field Product LIKE ‘B’
Like
TransactionID Product Quantity SalesDollars StoreID
100107 A,B,D 13 $95.50 53
100106 B,D 9 $57.50 53
etc
February 24, 2006 at 3:41 am
I'm not 100% sure what you are asking here, but I don't like your summary table idea at the bottom. Merging the rows, by transaction will be much harder to retrieve. Using a LIKE instead of = is also much slower. I think you want to normalize your tables like this with these relationships: I put Primary Keys in Bold and Foreign Keys in Italics:
Create Table PRODUCT_CATEGORY (CatId int not null, Description ....)
Create Table CUSTOMER_GROUP (GroupId int not null, Description ...)
Create Table STORE (StoreId int not null, other_store_col ...)
Create Table CUSTOMER (CustomerId int not null, GroupId int not null, ...)
Create Table PRODUCT (ProductId int not null, CatId int not null, Name ...)
Create Table TRANSACTION (TransId int not null, CustomerId int not null, StoreId int not null, Date datetime not null)
Create Table TRANS_ITEM(TransId int not null, ProductId int not null, Quantity int not null, SalesDollars money not null)
Now create the Primary and Foreign Keys that I've mentioned and you can run all the queries you want !! For example if you want "if Product A is sold how often does product B in the same transaction in same store? "
select *
from TRANS_ITEM i inner join TRANS_ITEM ii on i.TransId = ii.TransId
inner join TRANSACTION t on i.TransId = t.TransId
inner join TRANSACTION tt on ii.TransId = tt.TransId
where i.ProductId = 'A'
and ii.ProductId = 'B'
and tt.StoreId = 'Mystore'
and t.StoreId = 'Mystore'
If you can also use the GROUPBY functions to groups your data by CUSTOMER_GROUP or PRODUCT_CATEGORY or lots of things. I put a date in the TRANSACTION table so you can ask questions like "how many PRODUCT A did we sell on mondays?" "How many did we sell this month"? THis quarter? etc etc
The trick here is normalizing your data so that the tables have ONLY the columns that apply to that table. Once the data is in the right place, you can run all the queries you want, with great speed.
Hope this helps,
February 24, 2006 at 3:43 am
OOPs my BOLD and Italics got lost !! ugggh
See if my above post makes sense anyway.
February 24, 2006 at 10:55 am
All the tables except TRANS_ITEM have a unique ID, however the TRANS_ITEM table have many TransID to ProductID, thus TransID is not unique in the table. And thats my point.
So if I wanted to add up transactions with product A and B in same transaction. I would get in the scase of TransID 100107
100107 A 1 $10.00 53
100107 B 2 $15.00 53
So a question like how many transactions have sold A and B products, this resultset answers a count of 2 ( ie 2 rows), when the TransID count is really 1 of products A and B in same transaction.. I am trying to avoid DISTINCT as it slow on a big table. So how to get a result set of only ONE row for ONE transaction that has multiple rows of data per transaction.
Any ideas on this ?
February 27, 2006 at 3:32 am
Sorry for the slow reply, I don't surf on the weekends!
If you want to count how many transactions where are that contain both Product A and B, I'd try
select count( distinct t.transid) from TRANSACTION t inner join TRANS_ITEM a on a.transid = t.transid
inner join TRANS_ITEM b on b.transid = t.transid
where a.ProductId = 'A' and b.ProductId = 'B'
but this uses a distinct, which you said you didn't want. I don't see why this would be slow though if you have primary keys defined with indexs. You could also try:
select count(*) from TRANSACTION where transid in
(select a.TransId from TRANS_ITEM a inner join TRANS_ITEM b on a.transid = b.transid
where a.ProductId = 'A' and b.ProductId = 'B')
The slow thing here is the "in" clause. I don't know which is faster. I don't have a million row table to test it. If you have indexes on the PKs and FKs, these should be ok.
Hope this helps!
Viewing 5 posts - 1 through 5 (of 5 total)
You must be logged in to reply to this topic. Login to reply