RE: Are the posted questions getting worse?

SSC Guru

Points: 104773

April 30, 2016 at 4:01 pm

Chad Crawford (4/29/2016)
Solomon Rutzky (4/29/2016)
TomThomson (4/28/2016)
Jeff Moden (4/28/2016)
Grant Fritchey (4/28/2016)
Technical question:
I want to export large amounts of data. However, I don't want to use SSIS. I'm going to use SQLCMD.EXE. I need to be able to script these exports so that I do, let's say, 10,000 rows at a time, in chunks. I'm doing this so that I have the ability to easily restart, and, allow for starting to migrate or move the exported files as each chunk finishes.
What's the best way to, in a relatively generic fashion, efficiently divide a table into these 10,000 row chunks?
Articles, blog posts, some documentation to get this going in the right direction. Please!
Oh, and, what the heck... URGENT!
Ha!
BCP and xp_CmdShell. Simple and very fast.
Yes, old school is often best. It worked a decade and a half ago, and it can still work today.
If the stuff contains duplicate rows (:sick:) you need to take care of that when recording where you are up to, and it may mean you can't use precise 10,000 row chunks. In any case you need to make sure your copying and your recording of where you are at happen in the same transaction. I don't recall there being any problems apart from that (but it's a long time since I last did it).
As an alternative to xp_cmdshell and BCP, do it with an SQL agent job - a jobstep copies 10,000 rows with an insert-select statement and some extra rows if the last or those rows is not the last (or only) row with particular values using another select statement, then writes a where I'm up to record, then perhaps delays for a time and then perhaps picks itself as next step (can't remember if you need two steps to do this - it's all so long ago), or it can put the copying and place marking and delay code in a while loop within the step, or can just exit (and take the next 10,000 rows when next it's scheduled). That too is old school that will still work.
Just out of curiosity, how did you (or anyone else) solve the problem of consistency across multiple queries / transactions? I mean, without duplicating the entire table at the very start of the export process, wouldn't this process be subject to most of the pitfalls of sensitive / dynamic (i.e. non-STATIC) cursors and non-Serializable transactions? There will be rows that disappear between queries, new rows added (that could be added to a position that is earlier in the sequence of records since), and rows updated after being written to file but before the end of the operation. Each of these scenarios (and possibly others) would render the total set of export files as potentially in conflict, depending on the business rules for how that particular data works. Hence the files might not be importable.
I have an idea or two, but I don't think that they would work in all situations so I am wondering how you overcame this in the past.
I had the same concern you did about consistency. The only items I came up with were Serializable Isolation (please no!), Snapshot Isolation (single connection), backup/restore an offline copy, CDC (+ some complex reconciliation scripts), and Database Snapshots. It seems like you either need to block everyone from making changes, or have some way of capturing and reconciling changes like snapshot isolation does.
Chad

It's all too long ago, and I'm not sure what we did. I have a vague recollection of two quite different things that were used in two different cases.

The second one was pretty trivial - the stuff being exported was history and essentially the records in the original db that were old enough to be of interest didn't get updated or deleted, (actually the eventually did get deleted, we cleared stuff out of the active db once they had been included in the history db's full backup at least twice), and cluster keys were strictly increasing so there was no problem with inserts. The apps there ran with either repeatable read or serializable, and I think the export used repeatable read.

The first (much earlier) one was rather more complicated, and it was on a strictly non-critical system (research, not production - not product development). But it was history oriented in a sense. The active database was such that the state at any time in the past was recoverable, an app could choose to read the database as it had been at a certain date. The initial export was done by reading the state at a past instant and exporting, subsequent updates to the exported data were done by a newer state and updating anything that was different from the exported state or was not in the exported state. There were of course timestamps all over the place, it was at least partly based on some of Date's ideas (I think - they might have been someone else's, I've been out of touch with datase research for too long to be sure that they came from him) about a normal form for databases that would permit complete knowledge of database history.

It strikes me as very possible that neither of those history-based approaches would be applicable to Grant's problem. But the idea of trying to do it using just isolation levels doesn't seen likely to work either. Serializable is probably OTT since repeatable read may be adequate (maybe that depends on what sort of updating is going on, I haven't thought it through) but I suspect it isn't going to deliver high performance; nor is snapshot isolation, the temp database is probably going to grow too much when a vast collection of export chunks have to be mutually compatible and snapshot isolation is used to ensure that by building them all in one enormous transaction.

Tom