cxpacket wait differences for same query on different cluster nodes

  • Hey guys,

    We are running a 5 node windows server cluster with multiple SQL 2008 instances. 1 instance in particular, our data warehouse db instance takes longer to run the same incremental ETL on 3 of the 5 nodes compared to the other 2 nodes. I have already ruled out the possibility of this being simple coincidence related to the data involved in that night's ETL etc. I am able to replicate the performance problem every time I run the ETL on one of these 3 nodes, never experience the problem when running it on the other 2 nodes. I have compared all of the wait stats, disk IO stats, CPU and network stats etc. The only thing that I can see that looks different between the good and bad execution is a higher cxpacket wait time for the "bad" ETL. Does anyone have any ideas for a next step in troubleshooting this? I am at the point where I feel I have eliminated all possible SQL configuration parameters as a potential cause of the problem. All nodes are HP DL 585 servers with 64 procs and 512 GB RAM. Same HBAs, same NICs... we are very careful about keeping the server nodes in sync with regards to hardware and software. Any ideas or feedback would be greatly appreciated.

    Mike

    [font="Courier New"]Michael Connolly
    [/font]

  • I would check the driver software versions on the HBA adapters. Also, if this is an active-active situation where two instances reside on any one node in the event of failover, you will inherently have memory and performance degradation.

    Thanks

    John.

  • Thanks John, I am having our sysadmins check the HBA drivers and settings. It is active-active but during these tests we are running a single instance per node to eliminate the possibility of the other instance(s) resources impacting performance.

    [font="Courier New"]Michael Connolly
    [/font]

  • Confirmed same HBA driver version on all nodes (QLogic 9.1.8.25)

    [font="Courier New"]Michael Connolly
    [/font]

  • Hi you mentioned the 'bad run times' having a different number of parallelism, is ti worth you controlling the parallelism with MAXDOP for the ETL process (if possible) and running it both the good and bad nodes to see if you're able to achieve consistent results.

    If the ETL process does run the same across each of the nodes with the MAXDOP hint, is each node patched the same, seems weird the optimiser would choose a different degree of parallelism when the instance is running on different nodes.

  • I agree. I would also check sp_configure and ensure that the MAXDOP is the same for each instance. It's unwise to set to to 0 (unlimited). Pick a solid number such as 4 or 16. Some processes work better with MAXDOP 1 hinted in.

    Thanks

    John.

  • Just to confirm that I realized from your original post that it is the same instance and the times differ depending on what node (from the cluster) the instance is running on, but as you state the only difference you see is parallelism this is why I'm suggesting (if possible) to use the MAXDOP hint to set a value and run across the different nodes.

    If the ETL process runs in the same duration then you know it is that and look to why it it is different, different patch levels (as I said before) at the O/S and or SQL Server level.

    It'll be interesting to know what you find and if you get to the bottom of the problem - keep us posted.

  • Thanks for the suggestions guys. The problem with the ETL is that it is a combination of stored procedures (which I can control and add maxdop hints to) and also application code (informatica) which I cannot control at that level. I see where you guys were going with the maxdop idea and actually one of the larger stored procedures that is a part of this ETL does already have maxdop (4) set, it has always been run like this. I checked the execution times for just that procedure and it is different from the good to the bad nodes. Granted, while this procedure is running other ETL mappings may also be running at the same time so that information by itself is not completely conclusive but the pattern of good versus bad executions is rock solid, no randomness. Every time the ETL runs on node A it runs ok and every time it runs on node B it runs long. There is no variance to this pattern so I believe this is something hardware or network or driver related, something outside of the actual SQL instance.

    Thanks again for the feedback, I would love to hear any other ideas or if anyone has ever run into anything like this before?

    [font="Courier New"]Michael Connolly
    [/font]

  • My last suggestion is that with Extended Events you can capture wait statistics solely for a particular spid\session - might it be worth you capturing this information for the ETL process when run on node A and node B, you have already mentioned CXPACKET wait types but there might be something else that is causing the difference in the elasped time of the process.

  • Thanks Vinod, running an extended event trace may get me some more data on the problem, I am going to look into setting one up.

    [font="Courier New"]Michael Connolly
    [/font]

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply