Decrease query execution time to retrieve records from huge table

  • Hi there,

    I have a table that contains Financial information in text format for each company along with each Company ID. This table gets updated every two weeks. It has around 25 million records and changes with updates. I take this table and trying to create Lucene index using C# program. Instead of pulling all records at once, I thought I will do this

    1. select Count(*)/4 into firstpart from dbo.CompanyFinance

    2. with Records AS(select row_number() over(order by d) as 'row', *

    from dbo.CompanyFinance)

    select id,FinancialInfo from records

    where row<firstpart

    To get the next 1/4th of the records? The way I thought of doing was

    3. select Count(*)/2 into temp.

    4. Get the id for the 12.5 millionth record (just like step 2) and

    5. then Select id,FinancialInfo from dbo.CompanyFinance where id between firstpart+1 and id from step 4

    Here's my problem is the query execution time. Step 1,2,3,4 are taking a very long time to run. Is there a better query you can suggest.

    Create table dbo.CompanyFinance(id int not null, Financialinfo nvarchar(max), CONSTRAINT [pk_id] PRIMARY KEY CLUSTERED

    ([Id] ASC)

    Insert into dbo.CompanyFinance values(1,'This is a test');

    Insert into dbo.CompanyFinance values(2,'I have very large financial info in this field');

    Thanks

    Rash

  • You're attempting to query 8.25 million records out of 25 million. It's going to scan the table to do this. Indexes won't help. You can get faster or more disks to help speed this up, or you can attempt to move smaller amounts of data so that an index will be of assistance.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • When you say "you can attempt to move smaller amounts of data so that an index will be of assistance. " are you saying I should get select top 1000 rows at a time instead of several million.

  • The issue is, there's no way within code or structure to speed up moving 1/4 of a 25 million row table. You are completely dependent on hardware because that will lead to a scan of the table. So, in order for you to arrive at benefits to access speed presented by indexes, you have to deal with substantially smaller amounts of data. But, there's a balancing act. Yes, 1,000 rows will result in a seek most likely. But, multiply 1,000 row seeks often enough to move 8+ million rows, suddenly that may no longer be desirable. So you have to balance the benefits of being able to use the index on the table to help you with the amount of data you need to move.

    I'd suggest questioning the need to move all the table. Is there a way to incrementally update this thing you're doing rather than a complete replace each time? Incremental data moves can then take advantage of indexes to assist your queries.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • What is(are) the clustering column(s) on the table? You need to read each 1/4 of the table at a time using the appropriate clustering key range.

    What are there nonclustering indexe(es)? (in case they are needed to determine the clustering key ranges).

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • ScottPletcher (8/15/2014)


    What is(are) the clustering column(s) on the table? You need to read each 1/4 of the table at a time using the appropriate clustering key range.

    What are there nonclustering indexe(es)? (in case they are needed to determine the clustering key ranges).

    Based on the structure posted above, it's an ID/Value table and the cluster is on the ID. With that kind of structure and the data volumes we're talking about, I'm still back to questioning the need for moving the entire table around, let alone tuning that.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Grant Fritchey (8/16/2014)


    ScottPletcher (8/15/2014)


    What is(are) the clustering column(s) on the table? You need to read each 1/4 of the table at a time using the appropriate clustering key range.

    What are there nonclustering indexe(es)? (in case they are needed to determine the clustering key ranges).

    Based on the structure posted above, it's an ID/Value table and the cluster is on the ID. With that kind of structure and the data volumes we're talking about, I'm still back to questioning the need for moving the entire table around, let alone tuning that.

    Sorry, didn't see that.

    Assuming you can accept "close enough" to 1/4, rather than needing an exact 1/4, you could get the min and max, take the diff, divide by 4, and read each 1/4 range of ids. Be sure to explicitly specify id BETWEEN @id_start_of_fourth AND @id_end_of_fourth.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply