Looking for some design advice

Question

Looking for some design advice

Matthew Cushing

SSCrazy

Points: 2898
More actions
April 30, 2009 at 1:39 pm

#209074

I have a file that I pick up every day using a script I wrote. I had to use the script because its hitting an SFTP site, rather than an FTP. That part is working great.
What I'm needing advice on is what to do with it. The way the system works, there is a three day window of possible changes they can make, so what we did was to have the import file they prepare for us every night include 6 days worth of data. So the file for 4/28 runs at 3 am and contains the data for 4/21 - 4/27.
My first thought would be to just throw the data into a table and just look for a max for the date and two other columns to use as a way to differentiate. What I'm realizing is that this table is going to get HUGE very quickly.
The table has 7 columns:
date, facilityID, measureID, score1, score2, score3, score4
What I'm thinking now is that I can build in some logic that will look at the file line by line, see if there is a date/facilityID/measureID combo and update it with the new scores, or if not, just insert it.
Am I taking the long way around, or does this sound like a good way to go?
thanks for any advice.

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply

GSquared SSC Guru Points: 260824 More actions · Answer 1

Why not update all at once, and then do an insert of the rows that aren't in the target table. Two transactions. Should be pretty efficient, much more so than a line-by-line process.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

Matthew Cushing SSCrazy Points: 2898 More actions · Answer 2

GSquared (4/30/2009)
Why not update all at once, and then do an insert of the rows that aren't in the target table. Two transactions. Should be pretty efficient, much more so than a line-by-line process.

Don't I have to parse thru the tables to check them in the db? I thought I would have to look them up line by line to see what matched and what didn't. The text file is a csv . I'm not sure what you mean. Sorry, I'm just getting started with SSIS and they expect miracles 🙂

GSquared SSC Guru Points: 260824 More actions · Answer 3

Sorry, assuming a bit here, aren't I?

What I'm suggesting is import the CSV into a holding table, then using that to batch-update the real table.

You'd start out by downloading the CSV, which you've already got working (or so I gather).

Then, truncate the holding table, and follow that by importing the file into it.

From what you're describing, you just want the most recent row for each facility ID and measure ID. Is that correct? If so, then you could dump the older rows for each of those, either by inserting into a stage 2 table, or by deleting the rows you don't want.

Once you have that done, you have a set of current data in the staging table (or staging 2 table), and it's ready to be moved into the main table.

You can do an Update...From to update the existing data in the main table, and then follow that with an Insert...Select based on a Left Outer Join, which will get all your data into the main table. Alternately, you could delete data from the main that that matches the facility IDs and measure IDs in the staging table and then insert the whole thing into the main table. Either one will get you all your updated data.

The basic idea is get the data all cleaned up and ready to go, then do what's called an Upsert (for Update and Insert) in two steps.

If you need more detailed help on this, I'd need the create statement for your final (main) table, and a small sample of the file you're importing, like a couple of hundred rows or so. With those, I can help write code for this, if you want that.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

Matthew Cushing SSCrazy Points: 2898 More actions · Answer 4

I had the first in mind, but I had a feeling you were about 5 steps beyond me.

Thanks for the offer, I'd like to try to figure it out, but I'd also like to take you up on your offer at some point if it gets too crazy.

So if I understand you correctly, I have two tables:

Table 1 - the holding table for the full file.

Table 2 - the final table that the production data is in.

1) Simply dump the file into the table, no clean up.

2) Compare Table 1 with Table 2 and see which rows are in there, and what aren't.

3) Remove the duplicate rows from Table 2

4) Insert the rows from Table 1 into Table 2

Is this correct? The way I'm thinking of it is that if you compare the holding and final tables, and remove the rows from final that are contained in the holding, you don't have to worry about update, it's a straight insert, right?

thanks

M@

GSquared (4/30/2009)
Sorry, assuming a bit here, aren't I?
What I'm suggesting is import the CSV into a holding table, then using that to batch-update the real table.
You'd start out by downloading the CSV, which you've already got working (or so I gather).
Then, truncate the holding table, and follow that by importing the file into it.
From what you're describing, you just want the most recent row for each facility ID and measure ID. Is that correct? If so, then you could dump the older rows for each of those, either by inserting into a stage 2 table, or by deleting the rows you don't want.
Once you have that done, you have a set of current data in the staging table (or staging 2 table), and it's ready to be moved into the main table.
You can do an Update...From to update the existing data in the main table, and then follow that with an Insert...Select based on a Left Outer Join, which will get all your data into the main table. Alternately, you could delete data from the main that that matches the facility IDs and measure IDs in the staging table and then insert the whole thing into the main table. Either one will get you all your updated data.
The basic idea is get the data all cleaned up and ready to go, then do what's called an Upsert (for Update and Insert) in two steps.
If you need more detailed help on this, I'd need the create statement for your final (main) table, and a small sample of the file you're importing, like a couple of hundred rows or so. With those, I can help write code for this, if you want that.

Phil Parkin SSC Guru Points: 246755 More actions · Answer 5

Sounds correct, except (hopefully) the rows that you are removing are 'matching', not 'duplicate'.

The 'Upsert' idea, though slightly more complicated, should perform better, as it requires much less disk activity. If you have a lot of data which is being recreated by this simpler method, it may be worth the additional effort.

Phil

If you haven't even tried to resolve your issue, please don't expect the hard-working volunteers here to waste their time providing links to answers which you could easily have found yourself.

GSquared SSC Guru Points: 260824 More actions · Answer 6

So if I understand you correctly, I have two tables:
Table 1 - the holding table for the full file.
Table 2 - the final table that the production data is in.
1) Simply dump the file into the table, no clean up.
2) Compare Table 1 with Table 2 and see which rows are in there, and what aren't.
3) Remove the duplicate rows from Table 2
4) Insert the rows from Table 1 into Table 2

That is definitely one way to do it. It's not necessarily the most efficient as far as the database is concerned, but it sure is simple.

I think, though, from what you wrote in your original posts, that you'd want to remove older data from the import first, before you insert it into the final table.

One way to do that would be something like this:

;with Dupes as

(select row_number() over (partition by facilityid, measureid, order by date desc) as Row,

facilityid,

measureid

from dbo.MyHoldingTable)

delete from Dupes

where Row > 1;

Of course, you could just leave that alone, but I think it'll end up being easier to do the final insert or upsert correctly if you do something like this first.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

Phil Parkin SSC Guru Points: 246755 More actions · Answer 7

GSquared (5/1/2009)
--
One way to do that would be something like this:
;with Dupes as
(select row_number() over (partition by facilityid, measureid, order by date desc) as Row,
facilityid,
measureid
from dbo.MyHoldingTable)
delete from Dupes
where Row > 1;

nice idea!

If you haven't even tried to resolve your issue, please don't expect the hard-working volunteers here to waste their time providing links to answers which you could easily have found yourself.

avijitghosh77 SSC Enthusiast Points: 187 More actions · Answer 8

there may be another way, if you use JET engine(provider) in the oledb source and read the csv file as a table(donot dump it in a table), you can query the csv file using the JET provider. I have used this to load only specific columns from a csv file to table with conditions.

Matthew Cushing SSCrazy Points: 2898 More actions · Answer 9

GSquared (5/1/2009)
So if I understand you correctly, I have two tables:
Table 1 - the holding table for the full file.
Table 2 - the final table that the production data is in.
1) Simply dump the file into the table, no clean up.
2) Compare Table 1 with Table 2 and see which rows are in there, and what aren't.
3) Remove the duplicate rows from Table 2
4) Insert the rows from Table 1 into Table 2
That is definitely one way to do it. It's not necessarily the most efficient as far as the database is concerned, but it sure is simple.
I think, though, from what you wrote in your original posts, that you'd want to remove older data from the import first, before you insert it into the final table.
One way to do that would be something like this:
;with Dupes as
(select row_number() over (partition by facilityid, measureid, order by date desc) as Row,
facilityid,
measureid
from dbo.MyHoldingTable)
delete from Dupes
where Row > 1;
Of course, you could just leave that alone, but I think it'll end up being easier to do the final insert or upsert correctly if you do something like this first.

I think i gave the wrong impression. The application that is sending all this information updates it up to four days from the original entry. So what I would want to do is delete the row that has an update in the file from the db table first so I could import the updated data. I don't want to delete anything from the file itself, that's all either new or data to be updated.

GSquared SSC Guru Points: 260824 More actions · Answer 10

Makes sense. In that case, yeah, your process should be fine. Test it on a dev/test copy of the database, make sure it does exactly what you want, then you should be good to go.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon