Parsing millions of records without index

  • Hello All,

    I have a view that has around 100 million records. I need to get a count of data that was loaded the previous day. I could have done a left outer join on the PK, but its out of question since the same records have multiple updates too and I need to catch these too.

    I have the query ready with me. But the problem is its taking too long in the absence of an index. I cant create one since its production server.

    Can somebody guide me in optimizing the query? I just have only one join in the query and I compare by the time stamps as thats the only way to get count of the records that have been updated/inserted in the past 24 hours. But my query takes around 77-80 minutes to complete and that sucks! Any help will be appreciated.

    Thanks

    RV.

  • Please provide create statement of the table structure, and some sample data of what exactly it is you're trying to do.

    Thanks!

  • I have 4 columns CC, YY, MM, DD and they have records like

    '20', '10', '5', '23'

    To make this work as a time stamp, I have the following query

    WITH TEMP AS(

    SELECT ID,

    (isnull(CONVERT(char(2),CC), (CASE WHEN YY Between 80 and 99 THEN '19' ELSE '20' END)) +

    CASE WHEN YY < 10 THEN '0'+ CONVERT(char(1),YY) ELSE CONVERT(char(2),YY) END + '-'+

    CASE WHEN MM < 10 THEN '0'+ CONVERT(char(1),MM) ELSE CONVERT(char(2),MM) END + '-'+

    CASE WHEN DD < 10 THEN '0'+ CONVERT(char(1),DD) ELSE CONVERT(char(2),DD) END) AS DayNumber)

    SELECT COUNT(*) FROM TEMP

    LEFT OUTER JOIN table where CONVERT(DATETIME, temp.DayNumber,121) >= startdate and DayNumber <= enddate

    GROUP BY DayNumber

    (consider startdate and enddate values as 2010-05-23 00:00:00.000 and 2010-05-24 00:00:00.000 respectively which are stored in a physical table)

    This is the query that takes almost 70-80 mins to complete on the number of records that we have.

  • the startdate and end enddates have table preceding them in the query.

  • The result of this query is inline with expectations, but the query time is ridiculous. I need to run it under 20 mins. Thats not a requirement, but running queries for an hour on production is something that I dont want to do 🙂

  • rajiv.varma (5/24/2010)


    I have 4 columns CC, YY, MM, DD and they have records like

    '20', '10', '5', '23'

    This is why we need the CREATE TABLE statement (along with the related indexes, triggers, etc) for all tables involved in the query. I need to know a couple of things about the "real, permanent" table and the CREATE TABLE statement with all the goodies would answer all of my questions.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • The source for the above query is a view which has 96 columns approx and doesnt have any other object associated to it. Its just your "plain jane" view which is totally useless to me since it doesnt have any sort of index/trigger/constraint. Dumb I know.:-D

  • So let's ask the question, then... does the underlying table in the view have the cc, yy, mm, dd columns or does it have a viable DATE or DATETIME column, instead? What I'm getting at if the view has viable dates, then you should have your temp table in such a condition to also have viable dates instead of splitting them up into 4 columns. THEN you could quickly apply an index on your Temp Table to speed things up a bit.

    You could also make a secondary temp table that holds ONLY what you need from the view using SELECT/INTO and that would really speed things up.

    If you have the Enterprise Edition, you should have folks in your shop consider partitioning the underlying table by date which would not only make code like this run faster, but also solve a wealth of index and table maintenance woes (and maybe even some "Restore" woes).

    Also, (just to be sure), a view will use the indexes on the underlying table.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • By the way, I have to ask... the code you posted in your original post has so many errors in it that it would never have come close to working. For example... no column name in the CONVERTs which really makes the code example difficult to follow.

    I realize you probably stripped the code down for posting purposes but could you come a little closer so we can see what you're actually trying to do? Converting to 4 column date data in the CTE and then converting it back to join with the view just doesn't make any sense in the current context.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • rajiv.varma (5/24/2010)


    I have the query ready with me. But the problem is its taking too long in the absence of an index. I cant create one since its production server.

    In general terms this statement would settle the issue - isn't it? 😀

    Lets try to find a way out of this conundrum 😉

    Considering poster is concerned with performance I assume this is not a one shot process but something that is expected to get implemented in production.

    Also considering I'm reading things like "100M rows" and "previous day load" I assume this is related to a DSS environment.

    Question... Is this kind of an afterwards attempt to capture ETL statistics on a DWH environment?

    Wouldn't be easier to generate statistics at ETL time?

    _____________________________________
    Pablo (Paul) Berzukov

    Author of Understanding Database Administration available at Amazon and other bookstores.

    Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.
  • @jeff

    I agree I stripped down the code for posting purpose, but the actual code definitely works. The whole reason behind combining data from 4 columns is that if there is an insert to the underlying table, that would definitely show up in my query result. But if its an update to an existing record (we have type 2 updates) then I will have to go by a timestamp column stored in the table itself to pick up the latest one. I inherited this monstrosity from a previous consultant (BTW, he did a great job at making the code given we have a really big *** DW). Oh! and the data source is some old machine being replicated to SQL Server.

    @PaulB

    Thats a really nice suggestion to perform this at ETL time. But I am guessing that we have 3 loads everyday and thats the sucker (I joined in just 3 weeks ago, so still getting familiar with their processes). After we finish the 3rd load in morning, I run these scripts to get a measure of what records have been inserted/updated in the past day. I think it would be too cumbersome to capture the counts at 3 different times and then combine them the next morning to produce one final result. Definitely doable, but wont it look ugly?

  • rajiv.varma (5/25/2010)


    @Jeff

    I agree I stripped down the code for posting purpose, but the actual code definitely works. The whole reason behind combining data from 4 columns is that if there is an insert to the underlying table, that would definitely show up in my query result. But if its an update to an existing record (we have type 2 updates) then I will have to go by a timestamp column stored in the table itself to pick up the latest one. I inherited this monstrosity from a previous consultant (BTW, he did a great job at making the code given we have a really big *** DW). Oh! and the data source is some old machine being replicated to SQL Server.

    Heh... understood but that still doesn't explain why the temp table has a 4 column date in it to start with. Where is the data from the temp table coming from?

    Also, none of that explains what the data type of the 4 columns in the temp table actually are. I'm asking all of these questions because I think we can do better for you. I at least need to know the datatypes of the related columns in both the temp table and the underlying table of the view.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Allow me to give you an example of why it's important for us to have all the information we ask for. We're not asking you this stuff to run you over the coals. We ask because we need the info to come up with the best solution. For example... instead of doing all the nasty slow CONVERTs you're doing, you could be adding a fair bit of performance if you did something like the following, instead...

    DECLARE @cc INT, @YY INT, @mm INT, @DD INT

    SELECT @cc = 20,

    @YY = 10,

    @mm = 12,

    @DD = 31

    SELECT DATEADD(dd,@dd-1,DATEADD(mm,@MM-1,DATEADD(yy,(@CC-19)*100 + @YY,0)))

    SELECT @cc = 19,

    @YY = 8,

    @mm = 5,

    @DD = 9

    SELECT DATEADD(dd,@dd-1,DATEADD(mm,@MM-1,DATEADD(yy,(@CC-19)*100 + @YY,0)))

    Notice that it's all high speed math instead of character based conversions.

    Now, if you can provide me with the other information I asked for, someone might actually be able to help you with your performance problem. The information we need is...

    1. the datatype for all columns involved in joins, criteria, and forumulas in the Temp table.

    2. the datatype for all columns involved in joins, criteria, and forumulas in the underlying permanent table(s) for the view.

    3. a list of any and all indexes involved in the columns from 1 and 2 above.

    4. a list of all clustered indexes for all the tables involved in 1 and 2 above whether or not they're involved with the columns in 1 or 2 above.

    5. the code for any and all triggers on any of the tables involved.

    6. confirmation that the view is NOT an indexed view.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • @jeff

    Its not the temp table that has 4 columns. The source (which is a view being populated in real time and having no constraints/joins whatsoever) has 4 columns which are CC, YY, MM, DD. I dont know why they chose to store it like this but this is what I have to work with 🙁

    For the sake of example:

    SELECT CC, YY, MM, DD FROM source_view

    will provide me values such as

    20, 10, 5, 24 (end result with my conversion: 2010-05-24)

    20, 10, 5, 23 (" " :2010-05-23)

    19, 98, 2, 6 (" ":1998-02-06)

    20, 2, 6, 9 (" ": 2002-06-09).

    The data type for these columns is numeric(2,0). I dont know if there are any joins or not, but the source (as I already mentioned) is a view on a replicated SQL Server. The legacy system is IBM iSeries (which I dont have access to).

    There are no indexes on the view. and thats seriously stupid considering the fact that there are 100 million records

    There are no triggers/joins/ any other objects associated with this view. Also, this view is NOT indexed at all. Not even a freakin Non clustered index. Plus, there are no partitions available.

    In simple terms, consider building a single view for your entire Adventureworks database without any index/constraint/joins/trigger/any DB object associated to it. No Stored procedures/UDF's. NOTHING.

  • rajiv.varma (5/24/2010)


    the startdate and end enddates have table preceding them in the query.

    I'm not sure what "have table preceding them in the query" means.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 1 through 15 (of 18 total)

You must be logged in to reply to this topic. Login to reply