RE: Importing - best approach to compare Excel row counts against the table created

SSC Guru

Points: 1003857

October 21, 2016 at 8:14 am

#1907973

Phil Parkin (10/21/2016)
Jeff Moden (10/21/2016)
Phil Parkin (10/20/2016)
Jeff Moden (10/20/2016)
Phil Parkin (10/20/2016)
Jeff Moden (10/20/2016)
Phil Parkin (10/20/2016)
First point to note is that a CSV file is a far far better thing to be importing from than an Excel file.
"It Depends". If someone exported a spreadsheet as a CSV file, you have all of the same problems plus the addition problem of dealing with only occasional quoted identifiers on text-based items that have a comma in them. Huge PITA compared to reading from the Excel file directly using ACE Drivers.
Actually, even then, you don't have all the same problems.
Knowing your SSIS proficiency level, you'll have to trust me on this:-) An important difference is that, with flat files in SSIS, you can set the datatype of all the underlying columns manually, rather than have the Excel driver sometimes insist that you are importing a numeric() when you know it should be a varchar(), then throw an error when it hits the (valid) string on row 5,000. That Excel driver is a steaming purulent carbuncle.
The IMEX parameter is the solution for that particular problem.
IMEX = 1? Still doesn't work if your string data is way down the file, unless you hack the registry to increase Excel's 'number of rows to check' sample size. Or maybe you know something I don't.
I'll have to try longer spreadsheets because I've not run into that problem using IMEX=1, which is claimed to force everything in the spreadsheet to character-based values.
Jeff, please read this thread, particularly Todd McDermid's response, in which he explains the problem with IMEX=1.
The only near-bulletproof way I have found of doing this is to ensure that
a) IMEX=1 appears in the connection string, and
b) I read the column headers as data
Doing this forces Excel to treat all of the columns as text (assuming your column headers are not numeric!).
This brings its own overheads, of course (have to dispose of column headers, have to name columns manually based on ordinal, have to specifically convert to numeric/date where appropriate). But at least all of the data actually gets pulled in this way.

Good stuff, Phil. Thanks for the link. That explains why I've never had a problem with such a thing. I do both of those things most of the time and, when I don't, the spreadsheets are well-formed and don't have mixed column types except for the header.

As with (apparently) many folks, I had a basic misunderstanding of what IMEX actually does. According to Todd's line (which is broken and I had to do a little search on DougBert's site to find it), IMEX does NOT treat everything like text. It only treats mixed-type columns as text and suffers from the same 8 row sample limit as everything else.

As I recalled, there is a connection string setting to change the sample row limit (MaxScanRows), Supposedly, it can be set from 1-16 but some claim that if you set it to 0, that it will sample all rows in the spreadsheet. Of course, that will also take some extra time because that would be the first of two complete reads of the spreadsheet. Unfortunately, some claim the MaxScanRows extended is broken even in the current version of the driver. You've just got to know I'll be testing that. 😀

There's also the claim that the ACE driver has a problem with spreadsheets over 32k rows. Most of that claim is has to do with VB scripts where the driver doesn't recognize an EOF when using a VB loop to read the rows and, therefor, loops forever. Of course, I'll check that out, as well.

Again, thanks for the heads-up about even IMEX=1 not working because of the default MaxScanRows ("TypeGuesRows" setting in the registry). I've given a couple of presentations on how to auto-magically overcome the problem of importing some pretty complex spreadsheets that have columns and rows added every month (including overcoming the problem of having to manual name columns). I'll add my test findings to that and try to get an article out on all of it.

As you might also guess, it also demonstrates that SSIS might be the wrong choice for importing spreadsheets. 😛

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)