CURSOR speed - SQL 2000 vs. SQL 2005

  • Currently I am converting 10 years worth of customer and order data from an old FileMaker database into MS SQL 2000 and the best way we have found to do this is with cursors using T-SQL. I have it all working fine now, but had a few bumps along the way.

    With SQL 2000 it seems that there is major memory paging going on when a large table is being read from within a cursor. I have done quite a bit of research on this in the last couple weeks, but didn't find out too much as everyone seems to hate cursors LOL

    Anyways, I imported a 1,000,000+ line csv file into a temp table in SQL 2000. From this I have built a cursor to read it one row at a time, do one select query to find a product id number from the new product table, insert the data into the new table, then mark it as imported. Pretty basic, but this cursor was taking days to run!

    I did some testing and used a smaller 100,000 row table to run the cursor from, and there was a huge performance gain. Where I was processing around 200 rows per minute on my older and overworked SQL server, I was suddenly getting 4000+ rows a minute!

    So I do some more testing. I tried to make smaller cursors with a WHILE loop to only grab 10000 rows at once from my 1,000,000 table but that only got me up to 400 rows a minute. Then I decided to expand the while loop to read 10k rows, create a new table and populate it with them, then create the cursor with the temp table, and drop it when done. Rinse and repeat. This got me up to 5000+ rows a minute which I am quite happy with given my server.

    Now I ask you all, does SQL 2005 have these same types of issues with cursors on large tables?

    I just didn't find any info like this on any site in the last 3 weeks of searching, so I thought it would be good to throw this out as it might help others speed up their cursors.

    Perhaps this way of breaking down a cursor will help someone...

  • I think that you'll find the same problems in SQL Server 2005 as well. SQL Server is designed to work off sets of data. Row by row processing is terribly inefficient, slow, and resource intensive. For the example you gave, why do you feel you need to insert the rows one at a time? If you've already imported from csv into a table, why can't you do your INSERT as a set based transaction?

    John Rowan

    ======================================================
    ======================================================
    Forum Etiquette: How to post data/code on a forum to get the best help[/url] - by Jeff Moden

  • For all of the cursors I have written for this data conversion, I need to do one or more seperate selects or inserts besides the row I'm working with. Sometimes I need to do some math with the data read before it is inserted, sometimes I need to break an address or phone number out to be inserted into a different table.

    The above example is the simplest cursor I have done so far, it is for order line items and does not write to more than one table. So yea, this one might be able to be done another way.

    I just have not found another way to do this other than a cursor.

    I'm not quite sure what you mean by "INSERT as a set based transaction"

    I'm still kinda new to MS SQL 2000. I have 6 years expereince with MySQL and PHP with smaller databases on e-commerce projects, but this large data conversion stuff is new to me... Still got a lot to learn.

  • Set based means that you are working on 'sets' of data instead of rows. Anytime you can apply logic or business rules accross a set of common data, you'll way outperform cursoring through each row and working one at a time. Why don't you give us some greater detail as to your setup and schema and then post your cursors. If you provide enough background and setup info, you'll have several experts jump in and help you. What you are trying to do sounds like a pretty basic ETL (Extract-Transform-Load) operation and I've yet to find one that cannot be solved without the use of a Cursor.

    So for your case, you are using a cursor because you have to do different actions depending on how the data looks right? This tells me that you have some sort of business rules that tell you what type of activity you need to perform on that data. Instead of cursoring through each row and deciding what to do with the data based on those rules, why not run one operation for each grouping or set of rows? I may not be comming accross real clear but take this example. Say you have a table that looks like this:

    MyTable

    RoNum RowValue

    1 A

    2 B

    3 C

    4 B

    5 B

    6 A

    Let's say that the RowValues define which set of business rules you are going to use to load this data. The cursor approach says to process all rows one at a time starting with row 1. As soon as the cursor pulls RoNum 1, it says 'Ah ha, this is a type A. I'll run the business logic for this type of row'. It will process RoNum 1 and then move on to RoNum 2 to determine what to do there. This process continues until there are no more rows. Pretty simple right? Well, with set based logic, you would apply your busness logic to each set of data. In this case, your code would process RoNum 1 and 6 at the same time, rows 2, 4, and 5 and the same time, and then row 3.

    Post your cursors and you'll get some help. If you really are open minded to learn, having a cursor re-written as set based for you will get you going down the right path.

    John Rowan

    ======================================================
    ======================================================
    Forum Etiquette: How to post data/code on a forum to get the best help[/url] - by Jeff Moden

  • OK, here is the cursor I am using for the above example. This is the simplest cursor I have used so far so it is probably the easiest to work with.

    The 'posted_order_items_from_filemaker2' table is 1,027,000 rows long and has all of the data from the old database. The 'posted_order_items_from_filemaker ' table is being used as a temp table for each cursor run through. I am marking each row as 'done' in the 'posted_order_items_from_filemaker2' table in case the cursor needs to be stopped (or stops on its own) as I had the SQL Query Analyzer freeze up when I was importing the main order table and breaking it down with a much more complex cursor. The 'imported' field is '0' when not imported, then changed to '1' after the new row is written in the new table 'OrderItems'.

    By breaking this cursor down into smaller ones, it took 2:51 hours to finish. When it was one large cursor it took like 3 days! I consider this a big improvement, but if there is a better way then I'm all ears! Or is that eyes? hmmm

    DECLARE @comments varchar(255)

    DECLARE @customer_key varchar(255)

    DECLARE @discount varchar(255)

    DECLARE @item_code varchar(255)

    DECLARE @price_per_item money

    DECLARE @order_key varchar(255)

    DECLARE @quantity_received varchar(255)

    DECLARE @quantity_to_ship varchar(255)

    DECLARE @cost money

    DECLARE @price_per_item_after_discount money

    DECLARE @line_id int

    DECLARE @customer_id int

    DECLARE @product_id int

    DECLARE @quantity int

    DECLARE @start_num int

    DECLARE @end_num int

    DECLARE @ImportCursor CURSOR

    SET @start_num = 1

    SET @end_num = 10000

    WHILE @start_num < 1030000

    BEGIN

    SELECT * INTO posted_order_items_from_filemaker FROM posted_order_items_from_filemaker2

    WHERE imported = '0'

    AND id >= @start_num

    AND id <= @end_num

    SET @ImportCursor = CURSOR FAST_FORWARD

    FOR

    Select comments, customer_key, discount, item_code, price_per_item, order_key, quantity_received,

    quantity_to_ship, cost, price_per_item_after_discount, id

    From posted_order_items_from_filemaker

    OPEN @ImportCursor

    FETCH NEXT FROM @ImportCursor

    INTO @comments, @customer_key, @discount, @item_code, @price_per_item, @order_key,

    @quantity_received, @quantity_to_ship, @cost, @price_per_item_after_discount, @line_id

    WHILE @@FETCH_STATUS = 0

    BEGIN

    SET @product_id = ''

    SELECT @product_id = id

    FROM Products

    WHERE ProductCode = @item_code

    SET @quantity = @quantity_received + @quantity_to_ship

    SET NOCOUNT ON

    INSERT INTO OrderItems (

    OrderID,

    ItemID,

    Qty,

    QtyDelivered,

    QtyToSend,

    item_price,

    item_cost,

    comments,

    filemaker_discount,

    filemaker_price_per_item_after_discount,

    import_item,

    filemaker_id

    ) VALUES (

    @order_key,

    @product_id,

    @quantity,

    @quantity_received,

    @quantity_to_ship,

    @price_per_item,

    @cost,

    @comments,

    @discount,

    @price_per_item_after_discount,

    '1',

    @line_id

    )

    UPDATE posted_order_items_from_filemaker2

    SET imported = '1'

    WHEREid = @line_id

    FETCH NEXT FROM @ImportCursor

    INTO @comments, @customer_key, @discount, @item_code, @price_per_item, @order_key,

    @quantity_received, @quantity_to_ship, @cost, @price_per_item_after_discount, @line_idEND

    CLOSE @ImportCursor

    DEALLOCATE @ImportCursor

    SET @start_num = @start_num + 10000

    SET @end_num = @end_num + 10000

    DROP TABLE posted_order_items_from_filemaker

    END

  • This will do everything:

    [Code]

    INSERT INTO OrderItems (

    OrderID,

    ItemID,

    Qty,

    QtyDelivered,

    QtyToSend,

    item_price,

    item_cost,

    comments,

    filemaker_discount,

    filemaker_price_per_item_after_discount,

    import_item,

    filemaker_id

    )

    SELECT

    F.order_key,

    P.id,

    F.quantity_received + F.quantity_to_ship,

    F.quantity_received,

    F.quantity_to_ship,

    F.price_per_item

    F.cost,

    comments,

    discount,

    price_per_item_after_discount,

    '1',

    id

    From dbo.posted_order_items_from_filemaker2 F

    INNER JOIN dbo.Products P ON P.ProductCode = F.item_code

    [/Code]

    _____________
    Code for TallyGenerator

  • Thanks! That definitly sheds some light on how to avoid a cursor.

    I tried to do an INNER JOIN in my cursor but unfortunalty some of the input data rows have blank (null) item_code or one that is in error (not in the Products table) so those ended up getting skipped completely. That is why I broke that part out into a seperate select statement so as to allow it to be blank and still get inserted into the new table. It's sloppy, but I need to get all of the incomplete rows inputed still to convert all the old orders into the new system.

    I have a lot of 'bad' data that has to be allowed for 🙁

    Also, on my more complex cursors I need to check to see if an address is already in the address table, if it is then I need to pull the address_id, otherwise I need to create a new address then use it's address_id. How would I do something like this? I have several other more complex cursors which are all doing selects then using IF statements to determine the next step. i.e. INSERT or SELECT or maybe even an UPDATE.

  • I have a lot of 'bad' data that has to be allowed for 🙁

    LEFT JOIN to dbo.Product will do just that.

    But can you explain what that "bad" data will mean?

    Bad data should not be allowed in relational database.

    Also, on my more complex cursors I need to check to see if an address is already in the address table, if it is then I need to pull the address_id, otherwise I need to create a new address then use it's address_id. How would I do something like this? I have several other more complex cursors which are all doing selects then using IF statements to determine the next step. i.e. INSERT or SELECT or maybe even an UPDATE.

    INSERT INTO dbo.Address

    SELECT ....

    WHERE NOT EXISTS (select 1 from dbo.Address

    where {check unique key existence})

    Then just join dbo.Address to final insert

    _____________
    Code for TallyGenerator

  • By 'bad data' I mean data that is not in the correct format. For instance, in the above line item import, I ran across 3 rows which had '.5' for the quantity. As this is a float value in SQL and not an integer it made the cursor stop each time. I altered the original data to '1' and restarted the cursor. I am running into a lot of data that is blank/null when it should not be. Or has invalid values.

    The data was pulled from a FileMaker Pro 4.0 database which seems to have extreemly simplistic data types. A 'Number' field can be a Integer or a Float for example. A Text field can be a varchar of any length. FileMaker does not make you specify the data lengths anywhere it seems, so I'm having to find out through trial and error where I need to tweak the lengths of my varchars.

    If I'm processing this garbage data one row at a time and marking each row as 'done', when it stops I can find where it stopped fast and look at the data on that row to see what is happening.

    I'll try implementing the suggestions above to see if I can allow for all the variables I'm dealing with. Thanks for the input! 🙂

  • Steve Stout (12/10/2007)


    By 'bad data' I mean data that is not in the correct format. For instance, in the above line item import, I ran across 3 rows which had '.5' for the quantity. As this is a float value in SQL and not an integer it made the cursor stop each time. I altered the original data to '1' and restarted the cursor. I am running into a lot of data that is blank/null when it should not be. Or has invalid values.

    The data was pulled from a FileMaker Pro 4.0 database which seems to have extreemly simplistic data types. A 'Number' field can be a Integer or a Float for example. A Text field can be a varchar of any length. FileMaker does not make you specify the data lengths anywhere it seems, so I'm having to find out through trial and error where I need to tweak the lengths of my varchars.

    If I'm processing this garbage data one row at a time and marking each row as 'done', when it stops I can find where it stopped fast and look at the data on that row to see what is happening.

    I'll try implementing the suggestions above to see if I can allow for all the variables I'm dealing with. Thanks for the input! 🙂

    You have your data types, you have your business rules regarding the imported data.

    Right?

    So, just put the checks for those rules in WHERE clause:

    WHERE Quantity NOT LIKE '%[^0-9]%' -- this will leave only rows with integer quantity

    AND EXISTS (select 1 from Products P ON P.ProductCode = F.item_code)

    AND {whatever else you need to check}

    If you want to look at rejected lines store it in "Failed_FileMaker" table:

    INSERT INTO dbo.Failed_FileMaker (...)

    SELECT ...

    WHERE Quantity LIKE '%[^0-9]%' OR Quantity = '' OR Quantity IS NULL

    OR NOT EXISTS (select 1 from Products P ON P.ProductCode = F.item_code)

    OR .... {whatever else you need to check}

    _____________
    Code for TallyGenerator

  • If you want to learn SSIS, and it is worth the investment in time, you could do all of this in there and implement complex rules for handling this.

    However Sergiy and John have given you some great advice for working with this in T-SQL.

  • Wow, you guys were working late yesterday. Thanks for jumping in Sergiy. Steve (Stout), the example Sergiy gave you is exactly what I was describing. He was able to replace all of the cursor logic and variable usage with one simple INSERT INTO...SELECT statement. Feel free to post some of your more complex cursors here and we'll see what can be done. Also, Steve (Jones) is correct in that all of this can be done in DTS or SSIS if you have the time to play around with it. I would say that your best bet would be to do this in straignt T-SQL just so you can start to devlop the right coding habits. The faster you are able to change your thinking from writing procedural T-SQL code to set based T-SQL code, the better it will be for you in the long run.

    John Rowan

    ======================================================
    ======================================================
    Forum Etiquette: How to post data/code on a forum to get the best help[/url] - by Jeff Moden

  • This is an example of what you are trying to do:

    DECLARE @start_num int

    DECLARE @end_num int

    SET @start_num = 1

    SET @end_num = 10000

    SET NOCOUNT ON

    INSERT INTO OrderItems

    ( OrderID

    , ItemID

    , Qty

    , QtyDelivered

    , QtyToSend

    , item_price

    , item_cost

    , comments

    , filemaker_discount

    , filemaker_price_per_item_after_discount

    , import_item

    , filemaker_id

    )

    SELECT order_key

    , products.id AS product_id

    , (quantity_received + quantity_to_ship) AS quantity

    , quantity_received

    , quantity_to_ship

    , price_per_item, cost

    , comments

    , discount

    , price_per_item_after_discount, '1'

    , line_id

    FROM posted_order_items_from_filemaker2

    LEFT JOIN Products

    ON ProductCode = item_code

    WHERE imported = '0'

    AND id >= @start_num

    AND id <= @end_num

    UPDATE posted_order_items_from_filemaker2

    SET imported = '1'

    FROM posted_order_items_from_filemaker2

    WHERE imported = '0'

    AND id >= @start_num

    AND id <= @end_num

  • i have done these types of conversion projects and almost always use a staging table (a permanent table used temporarily) with a cursor and RBAR.

    why? because you can go back and easily query which rows did not load. yes, you can write a query with the opposite WHERE clause but that has to be maintained too and is subject to human error.

    in my situations, i often added several control columns to the staging table such as the incoming filename that was the source of the data, what batch number (defined by the user) for backwards tracking of where did the data come from (only important in multi-file or multi-source situations), and most importantly a column to store why the row failed to load.

    those error rows can be easily queried and presented to the data owners for a decision on how to correct or skip the data.

    besides foreign key issues (items not found, etc.), the most common problem in my experience is having input data that contains combined columns that have to be split a part such as names (prefix, first, mi, last, suffix) and address blocks. and it gets worse when the orignal system allowed free form entry and the new system has separate columns and proper business rules.

    just my two cents,

    joe

  • If you have a staging table, why not use that instead of a cursor to scrub/update/correlate/fill in your data. Even with the extra insert - it's going to be 20 to 30 times faster, consumer fewer resources, free up the table faster, etc...

    If you need to know what didn't get updated, then hold onto the staging table, and compare it to the master table (using the staging PK). anything in staging that didn't get ported over needs to be manually reviewed.

    Everything else you mention (tracking, including the file name and date) are all very easily implemented using something set-based.

    I just did something like that today - using a staging table - the 1M or so records took 13 seconds to finish. You'll be lucky if you can get a cursor solution to finish in 10 minutes (more like an hour).

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

Viewing 15 posts - 1 through 15 (of 16 total)

You must be logged in to reply to this topic. Login to reply