SSIS Lookup feature waste of time???

  • I am not sure if this is a question of a rant.

    As you may know, in SSIS you can have a lookup as part of your data flow.

    The idea is brilliant. In practice however it falls just short of moronic.

    I am trying to import 1000 rows from a text file.

    Two of these columns need to be validated against the main data to obtain the key.

    The claimed lovely feature of the lookup is lightning fast lookups due to caching.

    When the package executes it builds up it's cache. It does this by selecting everything(!) from the lookup table. Everything being 28 million rows! This, one of two lookups, causes the DTSExec service to consume a massive amount of memory. My dev box has 2GB installed. 900 for SQL and 600 for DTS. I can only assume this process has warmed up my page file as well.

     

    Now, there is a feature which limits the memory the cache can use. What good is this? I am importing 1000 rows, which is 0.0035% or 28 mil. If it blindly loads up data, what are the chances it'll get the right ones?

    I assume if a row is not found, it will look for it in the table. How much better is this than having a select in a loop?

     

    Surely it's better to join my temp table to the dimension?

     

    Am I wrong or is this a hot air rant?

     

     

    Cheers,CrispinI can't die, there are too many people who still have to meet me!It's not a bug, SQL just misunderstood me!

  • You should use the results of a query to narrow down the lookup reference set.

    Select the "Use results of an SQL query:" to only bring in the 1000 records in which you're interested.

    K

  • But this defeats the object. I am loading 1000 records from a text file. At the time of doing the lookup I do not know what those records are. They still sitting in the "pipeline". Is there a way to join a table onto a pipeline?

    Cheers,CrispinI can't die, there are too many people who still have to meet me!It's not a bug, SQL just misunderstood me!

  • Crispin,

    I think yours is a valid point. What I guess Kirk means is if you can filter the data (either in the cache or the pipeline) then do it. By filtering I mean both horizontally and vertically. One very important lesson is only bring in the columns that are used - do not bring in the whole table because this will just consume memory. I have seen massive perf improvements when doing this.

    I've had similar problems where I was importing 20million+ rows and looking up against 800million+. I wrote up the experience here: http://blogs.conchango.com/jamiethomson/archive/2005/07/22/1825.aspx which may or may not be of use to you.

    One thing that I think IS a major problem with the LOOKUP transform is that the query to populate the cache is hard-coded at design-time. I can think of many scenarios where we know things at runtie that would enable us to restrict teh data in the cache. I requested this to be enhanced here: http://lab.msdn.microsoft.com/productfeedback/viewfeedback.aspx?feedbackid=5ee6da3e-4d8b-4ada-a526-1034c11d6763 and you can vote on it if you think its a good idea (I'd appreciate it if you did).

    The best advice I can give is to test out different scenarios and measure the results - see what works best for you. When you have alot of data the biggest hinderance IS the amount of data and you have to find the best way of dealing with it. If a MERGE JOIN will work best - use that. If you can use an uncached LOOKUP followed by a cached LOOKUP - use that. Test and measure, test and measure, test and measure!!

    Oh, and if you find out anything useful, reply up here. I for one would be keen to hear your experiences.

    -Jamie

     

  • Yippee. Someone who shares my problems.

     

    Jamie, 20 million against 830. Nice. Must have hurt when you first ran it.

    Currently I am working with a small set of the data. 1000 rows import will turn into around a million against around ±75 mill. Not quite your numbers but....

    I think the whole lookup thing is flawed as it currently sits. It's one of those things that worked well in dev with a few hundred rows or worked really well on a 64 proc machine with 64GB RAM (Mine's on order)

    Your suggestion or dyanamically altering the query will have an effect but still it's limited. How do you propose to limit the lookup (Fact?) data? You cannot do it by time because you might have a shopper who last bought 3 years ago so you'll miss that. (BTW: These are the arguments I am facing as well)

    My first thought was to cache sucessful queries. i.e. lookup does a lookup and when found caches that record. Move on to the next one. In some cases this would be a huge advantage but others a waste of time as well. Both you and I sit in Retail. The chances of a customer coming to a shop 3 times a day are slim so caching those records in the hopes you'll come across it again is a waste of time. Although, this method would get around the "hang in there while I cache 23 million records for you" 10 minute caching.

    With the scale of the data, the only way I can see to get good results is to dump it to a table and do a join. Merge or Loop would be streaks ahead of putting everything in a bucket and sifting through it.

     

    Where I can see the Lookup being brilliant is if you loading 20 million records against 300k. i.e. Purchases against stock (Will be more than 300k ) Even then, a loop or merge join might actually beat it. It to is cached data.

    Which brings me to another idea. Maybe I am wrong here...

    I just queried a table containing 20 mil rows. This would have flooded the SQL cache as well. Have I just wiped my cache of other tables / queries in order to do this?

     

    Ahh well, the joys or working with bleeding edge tech

    May the force be with you!

     

     

    Cheers,CrispinI can't die, there are too many people who still have to meet me!It's not a bug, SQL just misunderstood me!

  • I understand better now.

    I think you have some valid points.

    Saying the lookup is useless, or whatever wording you use is inaccurate.

    Saying that it doesn't meet every need in every situation might be more accurate.

    But then, that's why we have an extensibility model.

    Peace.

  • Point taken but in my original post I was having a bit of a rant.

    But look at it realistically. MS are doing a brilliant job at moving SQL. They claim it's to be the next best thing for massive warehouses since sliced bread. And that the new ETL tool (SSIS) is the answer to all our woes.

    Yes, the SQL engine is good and fast bla bla bla. I will root for SQL for large installations over Oracle anyday. I used to work on a 12TB, now 18 warehouse and it was a bute. They actually de-commed Oracle due to cost and performance etc.

    But there are just some things that you come across that, once you get past the icing you think "What were they thinking??" This is one of them. It's a lovely thing for small / medium DB's but anything in the way of VLDBs (Enterprise) it'll fall short.

    True, it's early days for this so maybe it'll change with SP1 or so.

    Jamie, still here?

    Have you read about ProjectREAL? Barnes & Knoble used, on a large scale, SSIS. Read any issues they had with it? I would image the reason I can't find any gripes with it is because it was baby sat by MS. ???

     

    Oh well....

     

     

    Cheers,CrispinI can't die, there are too many people who still have to meet me!It's not a bug, SQL just misunderstood me!

  • Don't talk to me about B&N and Project REAL. Conchango did the original proof of concept for that project and then lost it to IntelliNet. Me? Bitter? no way!!!!! 

    Seriously though, I haven't heard of many issues and know some of the guys that built it (Don Farmer, Grant Dickinson) fairly well now. Then again, they're hardly going to publicise any problems are they???

    Good luck Crispin. Sounds like you've got a really interesting project going on there!

    -Jamie

     

  • Well, you do have the partial cache feature that sounds like a better approach than fully caching in your circumstance, since it only caches each row on looking it up. It's not as efficient for smaller reference sets, but much better for small lookup sets against large reference sets.

    I'm interested if you tried this and why you found it lacking.

    Thanks,

    K

  • hmmm, hit a nerve then...

    I actually got a Conchango SQL 2005 Edition of snakes and ladders sitting on my desk. Never played it

     

     

    Cheers,CrispinI can't die, there are too many people who still have to meet me!It's not a bug, SQL just misunderstood me!

  • I've got two of them and my parents have another. I don't know anyone, in or out of Conchango, that has played it.

    To be fair...I don't think that was the point

    -Jamie

     

  • But it's not...

    Two be honest, I have not delved into what the partial does but the flaw would still exist.

    Yes, I have tried it and it's far slower.

    Two things I can see happening:

    1)

    It caches until the nMB or memory has been used. What does this do? Load data top down until cached data = limit? There after it does a lookup per row. This is an unlikely meathod.

    2)

    It does a lookup for each row and caches it's result. This in theory is a better way but it's like putting a select in a while loop or cursor. SQL is a set based language(?) and joining nmillion onto .01n or nthousand onto nMillion would be, if you have the right indexes, far quicker than 1) is it in my cache. 2) No, ask SQL. 3)SQL asks is it in my cache etc etc etc.

     

    I hope I am way off here and someone with a big gun will come shoot me down and tell me to shutup because I got it all wrong.

    Here's another thing I have been playing with:

    While caching 20million records, I can freely update them. Both the key and the column I am returning while the cache is, well, caching. Still playing with it though so I may be wrong on this one. If not, you could sit with dirty cache.

    Watch this space

     

    Cheers,CrispinI can't die, there are too many people who still have to meet me!It's not a bug, SQL just misunderstood me!

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply