A Kerberos puzzle double hop thorugh srv A to srv B fails. A to C succeeds. C to B succeeds. B to A succeeds.

  • For the last few years I've been pretty confident I understood what I needed to understand regarding double hops. But today I find myself stuck.

    The setup:

    • There is a server dedicated to run 2012 project model SSIS packages. Let's call this server "srv_integration".
    • srv_integration hosts many packages which point to many different targets.
    • One of the packages failed validation on a connection a couple of nights ago, the error being a login failure for everyone's favourite account, NT AUTHORITY\ANONYMOUS LOGON. The package was executed by a DBA running SSMS on their workstation. The connection was to a server which I'll call "srv_target"
    • If the package is executed from srv_integration then it runs successfully.

    I can make this simpler: Rather than going through SSIS, I can reproduce the error via linked servers. An attempt to create a SQL linked server from srv_integration to srv_target using the credentials of the current user fails if it is created through SSMS from a workstation, the error being a login failure for good ol' anon. If the linked server is created from srv_integration directly, then the creation is successful but any attempt to use it through a double hop fails.

    Now things start to get weird. Note: All instances involved are clustered, and are part of the same windows cluster at the same site.

    • SPN's of the form MSSQLSvc/computername:port domain\SQLengineServiceAccount and MSSQLSvc/FQDN:port domain\SQLengineServiceAccount exist for both srv_integration and srv_target.
    • The ports on both instances are static and match those registered in the SPN's.
    • Querying sys.dm_exec_connections on both servers shows many connections being made with an auth_scheme of kerberos.
    • Linked servers from srv_integration to any server other than srv_target are OK.
    • Linked servers to srv_target from any server other than srv_integration are OK.
    • A linked server from srv_target to srv_integration is OK!

    A domain account is used for the database engine service on all instances. This domain account is trusted for delegation to any service via kerberos. The cluster virtual host machine accounts are trusted. The cluster node machine accounts are trusted.

    There are no duplicate SPN's in the domain.

    I've even tried klist purge for the database engine account on the integration server (but not on srv_target because that runs real production databases, not just an IS catalog, so I'm hesitant to do it there).

    :crazy: :crazy: :crazy: :crazy: :crazy: :crazy:

  • I was in similar situation few months ago, after lot of testing i think i was able to crack it. I blogged it here, hopefully it works for you:) .

  • Cheers. I have already covered the stuff you've talked about in the blog. I did run the config tool and it detected a few missing SPN's across the cluster, but only those of the form Service/machine:instancename (as opposed to Service/machine:port). I added these for completeness but, as expected, it does not change the outcome.

    I'm left with the impression that this must be something to do with DNS and clusters, although I am a SQL and software guy with a bit of windows and network knowledge, not the other way around, so I don't know what to look for in this regard.

  • I have found Adam Saxton's blogs on Kerberos useful when t-shooting, starting with My Kerberos Checklist…, but not ending with Kerberos Configuration Manager updated for Reporting Services (don't let RS in the name dissuade you from trying it :). A more complete list of his blogs: https://social.msdn.microsoft.com/Search/en-US?query=saxton%20kerberos&pgArea=header&emptyWatermark=true&ac=4#refinementChanges=109&pageNumber=1&showMore=false

  • did you make changes to the account in AD?

  • curious_sqldba (12/2/2015)


    did you make changes to the account in AD?

    Do you mean changes immediately prior to the failure that may have cause it? Nope. It's also the same account as used on other servers, which don't have a problem. Only this one specific double hop, from <client> through srv_integration to srv_target, has the issue.

  • Don Halloran (12/2/2015)


    curious_sqldba (12/2/2015)


    did you make changes to the account in AD?

    Do you mean changes immediately prior to the failure that may have cause it? Nope. It's also the same account as used on other servers, which don't have a problem. Only this one specific double hop, from <client> through srv_integration to srv_target, has the issue.

    For that account you will need to do the following in AD:

    In active directory, right click properties of the account under which sql services are running. Click on Delegation and select second option ‘Trust this user for delegation to any service (Kerberos only).’ You could select the 3rd option also to be more specific and add individual server and services.

  • curious_sqldba (12/3/2015)


    In active directory, right click properties of the account under which sql services are running. Click on Delegation and select second option ‘Trust this user for delegation to any service (Kerberos only).’ You could select the 3rd option also to be more specific and add individual server and services.

    Right, per my OP:

    A domain account is used for the database engine service on all instances. This domain account is trusted for delegation to any service via kerberos. The cluster virtual host machine accounts are trusted. The cluster node machine accounts are trusted.

    In any case, if there was anything wrong with the account (or indeed, the SPN) for either service I would not be able to double hop to/from srv_integration or srv_target to other servers, but i can. Again, it's only this one particular double hop from integration to target that fails.

  • An attempt to create a SQL linked server from srv_integration to srv_target using the credentials of the current user fails if it is created through SSMS from a workstation, the error being a login failure for good ol' anon.

    Which means srv_integration does not trust credentials from that workstation.

    If the linked server is created from srv_integration directly then the creation is successful

    Which means srv_integration trusts itself (that's a good thing :)).

    any attempt to use it through a double hop fails.

    Use it where? Do you mean from a workstation whose credentials srv_integration does not trust (in my first quote), or elsewhere? Please elaborate if elsewhere:).

    Have those in charge of the domain been engaged?

    In no particular order, upon all relevant workstations, upon both failing SQL Servers, and upon a successful SQL Server (one of the other nodes, for baseline comparison):

    Run xp_logininfo for a failing user (on just the three SQL Servers :).

    Run SET LOGONSERVER, and compare.

    Run klist query_bind, and compare.

    Run eventvwr.msc and inspect Windows Logs | System Event Log. Initially filter out all informational events.

    Run eventvwr.msc and inspect Windows Logs | Security Event Log. Don't filter this log. Instead search for concurrent errors that concern 'anonymous', users' account names, trust, and (of course) Kerberos.

    Run ipconfig/displaydns, and later ipconfig/flushdns. Then test creation of a linked server.

  • SoHelpMeCodd (12/3/2015)


    Use it where? Do you mean from a workstation whose credentials srv_integration does not trust (in my first quote), or elsewhere? Please elaborate if elsewhere:)

    From any third party through srv_integration to srv_target fails. This includes the case where the client is srv_target itself. For example, if I remote desktop to srv_target (either as myself or as the SQL engie service account), open SSMS, connect to the srv_integration instance and test the linked server connection to srv_target, the login fails (anonymous).

    Have those in charge of the domain been engaged?

    To a point. Despite my limited windows and networking knowledge, I am usually able to solve double-hop related problems for SQL and applicaitons and thus have been given domain admin membership and am, by default, "the guy who fixes this stuff". As we lean further towards the DNS side of things we leave my area of expertise, but unfortunately there's a bit of a knowledge gap before we cross into areas that the infrastructure team is comfortable with.

    Run xp_logininfo for a failing user (on just the three SQL Servers :).

    Run SET LOGONSERVER, and compare.

    Run klist query_bind, and compare.

    Run eventvwr.msc and inspect Windows Logs | System Event Log. Initially filter out all informational events.

    Run eventvwr.msc and inspect Windows Logs | Security Event Log. Don't filter this log. Instead search for concurrent errors that concern 'anonymous', users' account names, trust, and (of course) Kerberos.

    Run ipconfig/displaydns, and later ipconfig/flushdns. Then test creation of a linked server.

    Cheers for those steps, some I have already run, status on each:

    xp_logininfo 'login', 'all' output is identical on all instances.

    srv_integraiton and srv_target are looking at different DC's. I will take that one over to the infrastructure team.

    All servers are win 2k8R2 so no query_bind available.

    I turned on kerberos error logging yesterday on integration and target (added LogLevel 1 to the registry) and infra is looking through the logs today, although I saw no specific errors appear in the log after turning this on and then causing the linked server error to occur.

    I will pass the dns operations over to infra.

    Edit: Just to explicitly state the obvious: If I remote to srv_integraiton, open a query against srv_target and check sys.dm_exec_connections, the auth scheme for the spid is NTLM.

  • After much (intermittent) hair pulling I was about to suggest we go to premier support with this but decided to do one more round of googling. I finally found a post on the "CSS SQL Server Engineers" blog describing the exact same symptoms with the exact same diagnostic and resolution attempts, at least down to the point where they used "an internal tool called SSPIClient which makes direct calls to the InitializeSecurityContext API call". From the page ( http://blogs.msdn.com/b/psssql/archive/2009/07/01/when-in-doubt-reboot.aspx ):

    Unfortunately, this was one of those issues that just escaped us. This tends to happen with odd Kerberos cases. We had the Directory Services team engaged as well and they did not know what else we could do in terms of data collection outside of a Kernel Dump to see what may be going on <...> the thought is that something was cached and invalid causing the issue. Rebooting cleared that out and allowed us to work as expected

    So I guess we will be going down the same route!

  • Seems worth a Premier incident to me. Bugs are free. Ensure with your TAM that no root cause is free (if you are offering to help MS catch a kernel dump before rebooting... :)). Ask to be escalated. Mention Adam's name - odds are you will get him ??.

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply