I need help to get the performance up on my ETL. My data flow task is very simple in nature:
- OLE DB source (MS SQL Server table) points to
- OLE DB target (MS SQL Server table) with TABLOCK hint
The target table is uncompressed, has no indices or other constraints (PKs, Uniques) -> Heap. The database is set to SIMPLE RECOVERY.
According to the Microsoft docs (e.g. https://docs.microsoft.com/en-us/previous-versions/sql/sql-server-2008/dd425070(v=sql.100)?redirectedfrom=MSDN) I have met all requirements for minimal logging (I have checked to DB log that this is true).
After starting the data flow task, I realised that only one CPU core is utilised, the other 7 cores are not used. Above document actually gives the answer, I guess:
Parallelism and Partitioning Input
To optimize throughput during bulk operations, parallelism is essential. This is best achieved by running several bulk load command at the same time. Because the each bulk operation is single threaded, multiple copies must be run to utilize all CPU cores
I found other posts from different forums which addresses the problem, too:
Following the discussion I need to have a lot of parallel BULK INSERTs in parallel.
But, HOW do I achieve this automatically? Both links suggest to use a split operator in SSIS to split the stream into several ones and have concurrent loads.
I have about 20 tables with 400 million rows each per month. It doesn't scale to make this manually in every data flow task. Further, my development machine has 8 cores, my production machine has 48 cores. It is a bad design to have different ETLs for DEV and PROD.
Is there a nice efficient way to make use of all cores? I mean, why should I use SSIS when a SELECT INTO or a INSERT SELECT can do the job 23x faster (tested it).
How did you solve such problems?
Thanks a lot 🙂