Huge number of inserts vs number of reads

  • Hello to all

    I have a system that is getting 24 hours a days xml files to import. Each of these files will generate a few new dozens of records or update existing ones. The problem is that i get over 6000 files per day. At this moment i can't keep up with the rate that the files are getting in the system. I know it must be a bit vague. But there is anything that i can do on sqlserver in terms of configuration to speed up the processes?
    The number of reads is very low because i "discard" ll the data after 3 days. This problem is making the website very slow because of tables are locked by transactions in the import process 

    Thanks in advance
    Henrry

  • How, exactly, are the files being imported?   Does every file have the same xml structure?  Are you using SSIS?  What kind, if any, of automation do you use to get each file to start processing via your import process?   We'll need all the details you can supply...

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • it's 3 different XML files. I wrote a software that does some calculations and insert/update files. I never used SSIS before. is it flexible (and fast) so i can write some logic on it?

  • henrrypires - Wednesday, November 1, 2017 2:07 PM

    it's 3 different XML files. I wrote a software that does some calculations and insert/update files. I never used SSIS before. is it flexible (and fast) so i can write some logic on it?

    Using SSIS will just make things different.  It may not solve your problem.

    You say you wrote some "software" to do all of this.  What does the "software" actually do and what language was it wrriten in? 

    Also, 6000 files in 24 hours is only 250 files per hour.  How big are these files in bytes, columns, and rows?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I wrote the software in c#. It's kinda simple, it just deserialize the xml file check if the data already exists. It it does update it otherwise insert it. After that just update 1 table with accumulated values. The files are between 2MB to 6MB. One of the tables that it handles has over 30 columns

  • Really vague so hard to tell what the issue really is.
    No need to change to SSIS - but you may need to change your process a bit depending on how you are doing it at the moment.

    When you say 3 different XML files -
    how many tables does these transform onto?
    And how many records?

    And are you performing all the checks if data is there one row at the time and doing the updates/inserts per row or per batches or rows or per file?

    Is the transaction you are creating per record, per batch or per file or per set of files?

    Each transaction how many records in total does it process - both inserts and updates.
    And do you do further C# code between inserts/updates while still on a single transaction?

    Are you doing check/insert/update all in C# or are you loading the data onto a staging table and then using T-SQL to do the inserts/update?

    Things that may speed up things

    Enable RCSI (Read Commited Snapshot Isolation) on the database
    - Your readers will not get locks while you are doing any update/inserts
    (This will have an impact on your TEMPDB - depending on your server spec it may not be noticable at all - but it may grow a bit more than what you have it at now)

    Load all your files onto a staging table and do the validation on T-SQL if this is not what you are doing already - and do it in a set (batch) mode, not per record basis.

  • henrrypires - Wednesday, November 1, 2017 2:49 PM

    I wrote the software in c#. It's kinda simple, it just deserialize the xml file check if the data already exists. It it does update it otherwise insert it. After that just update 1 table with accumulated values. The files are between 2MB to 6MB. One of the tables that it handles has over 30 columns

    My recommendation would be to use the "old ways".  Import the files as row blobs and shred the XML in the database.  That way you save a huge amount of time because you're cutting out loading each row into an app and then transmitting one row of shredded data to the server at a time.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • henrrypires - Wednesday, November 1, 2017 2:07 PM

    it's 3 different XML files. I wrote a software that does some calculations and insert/update files. I never used SSIS before. is it flexible (and fast) so i can write some logic on it?

    Okay, so you have 3 different variations that any one of the 6,000 files could be.   If you wrote your "software" in C#, you can certainly continue to use it, but the question is, is it the software that's the cause of the slowness.    Just using that same C# code within SSIS won't make it run any faster, and usually, the only way to make things faster is to have something that will process things more in bulk, and perhaps in parallel.   As I said before, and as Jeff Moden has asked, we need a lot more detail about exactly how your "software" operates, and exactly what each variation of XML structure looks like.   If there's a naming standard in effect that is used to keep one file from overwriting another because of name collisions, we need that information as well.   There are lots of tools in the toolbox, but we don't quite know what, EXACTLY, we are dealing with, other than we have 6,000 XML files.   We need a LOT more detail than that.

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • Do you have an easy way to identify what format each xml file, like file naming convention or does each format go to separate directory or the like?

    One option depending on the xml format is to grab all the files of the same format, run some kind of preprocess step in C# or your language of choice and dump each format out into a single csv or some other format SQL can load easily and just import that using SSIS or bcp or whatever.

  • The 6k+ files slowly arrive during the whole day.

    The files contain an hierarchical structure
    Route
    -> Stops
          -> Jobs
               -> Items

    file 1 bring all levels of data (just a few hundred at the beginning of the day)
    file 2 brings correction to Routes and Stops (just a few hundred at the beginning of the day)
    file 3 brings corrections to all levels of data (the whole day and i get over 5K of these)

    The file 3 may be something simple as a new jobs or more complicated scenarios like new/corrections Routes/Stops or Jobs been moved from one stop to another

    The process of these jobs has a lot of logic. I have to check if the job is completed based on the expected vs delivered items of each job (among other calculations that need to be done).  If the job had any problem I need to create a record in another table so the users can handle the problem (missing or damage items).
    So I need to cross the values that come in the xml file with the data that may be already in the DataBase. Because the xml file may contain corrections or new jobs.

    If the job is completed I have to check if all Jobs in the Stop are completed to mark the stop as completed. The same goes to Stop, when all stops are completed the Route is mark as completed as well.

    So i start a transaction as soon as i deserialize the xml file. So because the are arriving the whole day, the tables are under transactions the whole time.

  • henrrypires - Thursday, November 2, 2017 2:43 PM

    The process of these jobs has a lot of logic. I have to check if the job is completed based on the expected vs delivered items of each job (among other calculations that need to be done).  If the job had any problem I need to create a record in another table so the users can handle the problem (missing or damage items).
    So I need to cross the values that come in the xml file with the data that may be already in the DataBase. Because the xml file may contain corrections or new jobs.

    If the job is completed I have to check if all Jobs in the Stop are completed to mark the stop as completed. The same goes to Stop, when all stops are completed the Route is mark as completed as well.

    So i start a transaction as soon as i deserialize the xml file. So because the are arriving the whole day, the tables are under transactions the whole time.

    Ouch.......

    So first do you actually have an SLA to load the files as soon as they arrive?  Second are you handling all that logic in your custom code, as in for each record in the database querying the table, pulling the results into your code them sending back an insert or update?  If so consider loading the entire data set into a staging table then handling the upserts, error handling reporting etc in SQL.

  • The problem is that i need to get the data ready for users to handle the job problems in max 1 hour. At this moment i am barely over the water line. Really soon I'll be drowning. The total number of files may double up within 8 weeks when the software is fully implements a cross all the departments.

  • Yes that was the point of my second question, for each record in each xml file are you handling querying the database, determining what action to take and then sending it back in your custom code?  If so that is certainly killing performance.

  • So you suggesting to load the data i handle in in memory?
    just need to keep synchronized but it may work

  • henrrypires - Thursday, November 2, 2017 3:33 PM

    So you suggesting to load the data i handle in in memory?
    just need to keep synchronized but it may work

    Not in memory, but in a staging table, where you can run a stored procedure that encompasses all the logic and can process an entire file at once, without using a cursor.   If we can get the xml formats, along with the detailed logic, and the table structures in the form of CREATE TABLE statements, we have a shot at getting you something that can perform considerably better than what you're doing right now.   Row by agonizing row (aka RBAR) is no way to run a business...

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

Viewing 15 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply