NPI Data

  • That particular problem is because there's a head row that you're not skipping.  Add the following to the options in the BULK INSERT command...

    FIRSTROW = 2,

    I tried the same thing an a different error showed up complaining that the data in column 330 (the certification date column) was too long.  I'm still trying to figure that out but give the FIRSTROW = 2 thing a try.

     

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Ok... I got this working BUT...

    The bastards lied on their record layout.  The "HEALTHCARE_PROV_TAXONOMY_GROUP_xx" columns aren't 10 wide in the data... they're up to 43 characters wide.  I changed the table for those 15 columns to VARCHAR(50) and it's working but the data has another surprise there.  It looks like they blended both the code and the description for HEALTHCARE_PROV_TAXONOMY_GROUP's together.  You folks might want to change things to separate the code from the description in your destination table.

    Here's what those entries look like...  the code on the left is 10 characters long... including the space after that, the description looks to be up to 40 characters (to be on the safe side.

    193200000X MULTI-SPECIALTY GROUP
    193400000X MULTIPLE SINGLE SPECIALTY GROUP

    There's also an anomaly in the data somewhere (probably right at the start) that causes the following error.  Very fortunately, we can fix it but it takes a bit of extra time because we have to use PowerShell to make a repaired copy of the file.

    Once fixed, it takes 00:03:34 to load all 6,448,360 rows on my laptop.  The PowerShell fix for the file copy takes about 00:03:02 to make the file copy on my laptop.

    Ironically, I found that fix by total luck.  The file is way too big to look at in Word, Excel, Notepad, and even NotePad++ so I went search for something that would copy just the first thousand rows and did all my import testing with that.  When I switched back to the bigger file, it kept giving me the following error.

    So. with all that being said, here's the BULK INSERT I used (with some comments).  I didn't use 3 part naming because I don't have the databases you do.  Add the database name back in if you need to.

    TRUNCATE TABLE dbo.SBHN_NPI_DATA
    ;
    BULK INSERT dbo.SBHN_NPI_DATA
    FROM 'C:\Temp\NPPES_Data_Dissemination_July_2020\npidata_pfile_20050523-20200712.csv'
    WITH (
    CODEPAGE = 'RAW' --For a little extra speed... like not having any collation
    ,DATAFILETYPE = 'char' --I always include this out of habit.
    --,BATCHSIZE = 50000 --This just gives you something to watch so you know it's working. I don't use it because I want Minimal Logging.
    ,FIRSTROW = 2 --This essentially skips the header row, which they didn't tell us about. It actually does it by counting delimniters.
    ,FORMAT = 'CSV' --New for 2017 and I'm liking it. It only took the 3 bloody decades to do it.
    ,FIELDQUOTE = '"' --Same here... new for 2017.
    ,FIELDTERMINATOR = ',' --CSV field delimiter
    ,ROWTERMINATOR = '\n' --Use to shift the control to next row... NOPE! It's just a delimiter. BULK INSERT does actually know what a row is.
    ,TABLOCK
    )
    ;

    I've attached the modfied CREATE TABLE as a .txt file (still amazed that an SQL Forum doesn't allow attaching an SQL file).

    And here's the PowerShell code I used to fix the bad file...

    powershell -command "& {get-content '%~1' -TotalCount 2000000000|set-content -path '%~dp1%~n1_FIXED%~x1'}"

    The "-TotalCount 2000000000" thing is just to remind me that if I change the number on that, I can quickly create a partial copy of the file to work with like the first 1000 rows or whatever.

    Let me know how it works out for you.  This was fun... seriously.  And, yes... Calling that fun means that I don't actually have a life. 😀

     

     

     

     

    Attachments:
    You must be logged in to view attached files.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Thanks. For the powershell part, do I need to modify it for my PC or what do I need to do?

  • The Powershell code will work on any PC.  Although it doesn't need to be, I installed it as a .bat file in the same directory as the file.  That makes is so can execute it by providing the file name (and just the file name with extension) as part of the call to the .bat file or, from the file explorer in Windows, you can literally  click and drag the CSV file onto the .bat file and that will work, as well.

     

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • p.s.  Try it with my changes to the table and the BULK INSERT command first without doing the powershell stuff.  I don't know if I just had a bit of a bad download or if it's actually necessary.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • BTW... I left a bit of a test in the table creation code and you should change it.  The first column should be a BIGINT and not VARCHAR(11).

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • It still didn't work without the powershell. Is the attached what I need to do for the powershell? It doesn't seem to do anything.

  • It seemed to work if I change the rowterminator to below. Does the sample look right?

     

    ,ROWTERMINATOR = '0x0a'

    • This reply was modified 3 years, 8 months ago by  smattiko83.
    Attachments:
    You must be logged in to view attached files.
  • It worked find for me as just \ n (without the space).  The 0x0a thing will also work.  For those that are reading this and don't know, that is the same as CHAR(10), which is the "LineFeed" or "NewLine" character, which \ n also represents.

     

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Thanks again for all your help. Did the sample look ok?

  • smattiko83 wrote:

    Thanks again for all your help. Did the sample look ok?

    A cursory glance says it looks fine.  I didn't do a deep dive on it.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden wrote:

    smattiko83 wrote:

    Thanks again for all your help. Did the sample look ok?

    A cursory glance says it looks fine.  I didn't do a deep dive on it.

    I'm shocked!! A CURSORy glance? From you? Surely a set-based glance? 😀

    "Knowledge is of two kinds. We know a subject ourselves, or we know where we can find information upon it. When we enquire into any subject, the first thing we have to do is to know what books have treated of it. This leads us to look at catalogues, and at the backs of books in libraries."
    — Samuel Johnson
    I wonder, would the great Samuel Johnson have replaced that with "GIYF" now?

  • david.edwards 76768 wrote:

    Jeff Moden wrote:

    smattiko83 wrote:

    Thanks again for all your help. Did the sample look ok?

    A cursory glance says it looks fine.  I didn't do a deep dive on it.

    I'm shocked!! A CURSORy glance? From you? Surely a set-based glance? 😀

    😀

     

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • try adding firstrow option to start data at row 2 - row 1 is headers so it would blow the size of the columns you defined on that table.


    BULK INSERT DE_IDENTIFIED.DBO.SBHN_NPI_DATA
    FROM 'C:\SQL\NPI_DATA\NPI_DATA.csv'
    WITH
    (
    FORMAT = 'CSV',
    FIELDQUOTE = '"',
    FIRSTROW = 2,
    FIELDTERMINATOR = ',', --CSV field delimiter
    ROWTERMINATOR = '\n', --Use to shift the control to next row
    TABLOCK
    )
  • frederico_fonseca wrote:

    try adding firstrow option to start data at row 2 - row 1 is headers so it would blow the size of the columns you defined on that table.


    BULK INSERT DE_IDENTIFIED.DBO.SBHN_NPI_DATA
    FROM 'C:\SQL\NPI_DATA\NPI_DATA.csv'
    WITH
    (
    FORMAT = 'CSV',
    FIELDQUOTE = '"',
    FIRSTROW = 2,
    FIELDTERMINATOR = ',', --CSV field delimiter
    ROWTERMINATOR = '\n', --Use to shift the control to next row
    TABLOCK
    )

    Agreed... I made that suggestion at the top of page 2 of this thread but was also a shedload of other issues including stray characters in the original file and mis-documented column sizes in the documentation.  I got all that worked out and posted the correct target table and code to fix the stray character issue that was causing a most unusual error.

     

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 16 through 30 (of 30 total)

You must be logged in to reply to this topic. Login to reply