Preventing attempted kerberos connection from sql agent job/SSIS package

  • Hi all,

    We have a large number of SSISDB packages running happily, connecting to our SQL Servers using ADO.Net or Sql Native Client, making their connection using NTLM. (We don't have our SQL Server SPNs correctly configured to support Kerberos).

    The SSISDB packages are hosted on and run on a dedicated SQL server, different to the SQL Servers they are connecting to.

    Very occasionally, the connection attempt is made using Kerberos instead of NTLM, and the connection attempt to sql server fails. (This is going by the Windows Security event log, which reveals a Kerberos login - a successful one at the Windows level - at the precise time that the calling agent job is informed of a connection timeout and fails, approx 23 seconds after the job starts).

    The correct configuration of our SPNs is something we may wish to look into for security best practice, and would of course fix this. However, that may not be my decision to make.

    In any case, I am curious ...... where is the decision made to try Kerberos over NTLM? Would this be in the sql client connectivity layer, sql agent, the OS hosting sql agent? And is there any way to prevent the attempted use of Kerberos, which is destined to fail?

    Many thanks

    J

  • Bit of an update on this one.

    I suggested that Kerberos was incorrectly being used. Well, that's not entirely accurate. I now believe that an NTLM (using TCPIP) connection is being attempted initially and, when that doesn't work, Kerberos is then being attempted. That's why I see a Kerberos successful login (at the OS level) at the time the SSIS package and calling job fail - because this login attempt using Kerberos immediately fails and marks the end of the overall connection attempt.

    So, the issue is really that the NTLM connection attempt fails as if the sql server is not running. "A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections."

    Anyway, I have reproduced the error outside of SSIS. I have a job running every minute to make the identical connection attempt using a Powershell job step running Invoke-SqlCmd. Occasionally this, too, fails for the same reason.

    I must point out a key detail i omitted in my original post - the target SQL Server being connected to by the occasionally failing SSIS package is an Always On availability group listener. I have today adapted my test connectivity job so that it attempts a connection to the primary replica's sql server name as well as a connection to the listener name - I am hoping to rule out any AG-specific involvement in this issue.

  • One step closer. The evidence is mounting that this is an AG listener issue.

    My test connectivity job occasionally times out trying to make a sql server connection using the listener name. The equivalent job using the actual sql server name, i.e. the primary replica, works perfectly well at all times.

    It's worth noting also that sometimes the real jobs (the ones that occasionally fail and prompted me to start this investigation) fail but my test job doesn't; or sometimes my test job (the AG listener version) fails but the real jobs don't; occasionally the test job and one or more real jobs fail at the same time.

    So this connectivity issue is very transient in nature, but does seem to affect the listener name only.

  • An update for anyone who has followed this or one day stumbles across it.

    Essentially, the problem is still unresolved and I have found it is NOT an AG listener issue. It does seem to affect connection attempts made specifying the listener more than it does those specifying the actual sql server name, but not exclusively so.

    I have managed to reproduce the problem outside of sql server tools and sqlclient connectivity.

    I have been running a basic TCPIP connection test using something based on this http://www.powershelladmin.com/wiki/Check_for_open_TCP_ports_using_PowerShell, and can confirm that connections to the target server - on the port our sql server is listening on - intermittently time out at 23 seconds.

    This seems to affect about 1 in 500 connection attempts.

    Sometimes 2 different nodes in the same cluster (this is just a standard Windows Failover Cluster to support an Always On Availability Group) experience the same connectivity issue at precisely the same time. (The fact that they are simply in the same network/domain may be far more significant than the fact they are in a cluster).

    As for the original jobs that are affected by this issue, I have added a timeout of 60 seconds to the connection strings and almost all failures have now gone away.

    Meanwhile, finally I am reaching out to our network team now to investigate the transient inability to connect to the sql server port on the target servers (without such a timeout).

    As it turns out, the title of this thread should really be something more like "Transient sql server connectivity failures"!

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply