Best Strategy to Create a Million Files using SSIS

  • Hi All,

    We have a requirement to create a file for each unique record from the Database and we have nearly million records to process.This Package will be scheduled and will re-create the files if any changes to any of the fields in the record set.

    I can use a Foreach Loop Container and use dataflow task to create a file for each record.But this will hit the DB everytime i pass a uniqueID inside the dataflow task.

    I am looking for a best strategy to opt for??

    Please give me your valuable suggestion.

  • I think a For Each loop is what you'll need. Can't think of any other way to do it.

    Might want to split them up into smaller chunks in multiple directories. A million files in one directory will create problems if you ever need to view the contents of that folder.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • This is quite an unusual requirement - do you mind expanding a little on the reasons behind it?

    If you haven't even tried to resolve your issue, please don't expect the hard-working volunteers here to waste their time providing links to answers which you could easily have found yourself.

  • Phil Parkin (8/13/2012)


    This is quite an unusual requirement - do you mind expanding a little on the reasons behind it?

    I feel like maybe OP misunderstood the requirement? Maybe the request was "we need a file with all the records in table x"? Will let the OP expound further but I agree, this sounds bizarre.

  • Hey Phil,

    Thanks for the reply.

    Our project involves storing the every detail of a product(example).When the users search for the product from the UI, they get this file which has all the info.We are using some indexing strategy to get the file quickly.

    Now,I need to create a file for each product using ETL.

    I think we will split the files into multiple folders based on the product category.

    Looking for a best possible method to go for.

    Thanks

  • sradez_DBA (8/13/2012)


    Hey Phil,

    Thanks for the reply.

    Our project involves storing the every detail of a product(example).When the users search for the product from the UI, they get this file which has all the info.We are using some indexing strategy to get the file quickly.

    Now,I need to create a file for each product using ETL.

    I think we will split the files into multiple folders based on the product category.

    Looking for a best possible method to go for.

    Thanks

    Isn't this what databases were built for? 🙂

    If you haven't even tried to resolve your issue, please don't expect the hard-working volunteers here to waste their time providing links to answers which you could easily have found yourself.

  • sradez_DBA (8/13/2012)


    Hey Phil,

    Thanks for the reply.

    Our project involves storing the every detail of a product(example).When the users search for the product from the UI, they get this file which has all the info.We are using some indexing strategy to get the file quickly.

    Now,I need to create a file for each product using ETL.

    I think we will split the files into multiple folders based on the product category.

    Looking for a best possible method to go for.

    Thanks

    When the users search for the product from the UI, they're searching your database. So why not generate the "file which has all the info" as the output from this search?

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • The DB comes info Picture only till creating the files and maintaining the log of ETL.

    Rest is all done using application..which i cant talk about.This is an interesting project.If you search for Microsoft Index Search you will know more.

    Anyways,My question is about the best ETL Method to create files where i dont have to hit million times to create the file.

  • They are not searching the Database from the UI.The search is on the files.

  • sradez_DBA (8/13/2012)


    They are not searching the Database from the UI.The search is on the files.

    Granted I don't know all the facts, but I really don't like the sound of what you are building here!

    Your only obvious option is an RBAR solution. You may be able to inject some parallelism by creating multiple foreach loops to process different categories at the same time - I'd start with four and see how it performs.

    If you haven't even tried to resolve your issue, please don't expect the hard-working volunteers here to waste their time providing links to answers which you could easily have found yourself.

  • Sounds like a serious step backwards in data management technology.

    Don't call the database for each row of data. Use a script object to write each row to a file.

    You can do this inside a CLR proc in the database (you'd have to set it to unsafe, if I'm not mistaken), or you can do this in a script object in SSIS. VB.NET can write text files directly: http://msdn.microsoft.com/en-us/library/1t4kyezf(v=VS.80).aspx. Pretty sure C# can too.

    Pull the dataset, use that for your For Each loop, either in SSIS or in a CLR proc that steps through the data. Write the data from the dataset directly, not using a Data Flow task. One call to the database (a really big call, which may take a while to return a dataset that's a million or so rows long), lots of loops through a datawriter script or DLL.

    Of course, breaking it up into smaller database calls, one per row, will probably be faster, easier to write, easier to manage, easier to debug, and won't put anywhere near the strain on your server that a million-row dataset will, but if you don't want to do it that way, do it this way and you can make just one call.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • sradez_DBA (8/13/2012)


    They are not searching the Database from the UI.The search is on the files.

    It will be very interesting to see how this approach compares with the more conventional one, of searching a database and generating output from the database content. Superficially it's quite similar - interrogate an index for the address of a bit of data.

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • GSquared (8/13/2012)


    Sounds like a serious step backwards in data management technology.

    Don't call the database for each row of data. Use a script object to write each row to a file.

    You can do this inside a CLR proc in the database (you'd have to set it to unsafe, if I'm not mistaken), or you can do this in a script object in SSIS. VB.NET can write text files directly: http://msdn.microsoft.com/en-us/library/1t4kyezf(v=VS.80).aspx. Pretty sure C# can too.

    Pull the dataset, use that for your For Each loop, either in SSIS or in a CLR proc that steps through the data. Write the data from the dataset directly, not using a Data Flow task. One call to the database (a really big call, which may take a while to return a dataset that's a million or so rows long), lots of loops through a datawriter script or DLL.

    Of course, breaking it up into smaller database calls, one per row, will probably be faster, easier to write, easier to manage, easier to debug, and won't put anywhere near the strain on your server that a million-row dataset will, but if you don't want to do it that way, do it this way and you can make just one call.

    +1 this is good advice.

    If you haven't even tried to resolve your issue, please don't expect the hard-working volunteers here to waste their time providing links to answers which you could easily have found yourself.

  • Thank You All for sharing your thoughts.

    As i said the project is interesting.What makes it interesting is if you have 9 different companies who want to share there products and they have there own FileSystem and DB but share a common UI.

  • As Phil says I will have to try for a SSIS and Parallelism Approach and do some testing on how this works.

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply