I too have experienced this error from time to time in an SSAS job and I do not accept the 'There is no real solution for this error.' solution. Why you ask?
- This job runs daily normally without a problem. For the history I have available since 2014-01-26, the job has only failed four times, three due to this error and one was a deadlock error. Roughly, that is about 25% total failure rate (4 failures / 17 executions ), less if I exclude the deadlock error.
- This job runs at off hours (daily at 0355) when this server is barely used, if at all. Looking at the available data collected in Confio Ignite and NIMSoft, this server is NOT stressed at this time. CPU tops at less than 20% on the failure dates; it spikes at 80% on successful days but this is NOT maintained. It is a 16 CPU with 2 cores each server on Server 2008 R2 with SQL Server 2008 R2; it should be able to handle a max 960 threads.
So, I dug a little deeper. I noticed a pattern from the metrics records by Ignite but I cannot explain the cause (screenshots are attached in Word Doc):
- When the job succeeds, Ignite records one main hash (hashes in this context are an internal Ignite term, not related to query or query plan hashes) is executed for approximately five minutes. It appears to finish and then between 5 to 15 additional hashes are executed.
- When the job fails due to the thread resource error, Ignite records 10+ hashes (including the main hash) executed at one time for the first minute of the job and then spends the next four minutes processing the main hash. At the 'sixth' minute, the main has appears to have finished and several other hashes start and promptly quit due to the thread resource issue.
So it appears that parallelism in SQL Server is occasionally used up front and other times not. Perhaps the evaluated cost to go parallel is occasionally less and thus SQL chooses to do so causing the job to fail?
Anyway, I am looking for additional meaningful guidance. Any help would be appreciated. Thank you.