Picoscope

Q: Kerberos Conundrum

Greetings all,

 

First off: we're running OS X Server 10.8.5 providing OD, AFP, DNS, DHCP and a few other services to about 50 or so 10.8.5 and 10.9.x clients all bound to the server.

 

This issue I’d like help with began as I attempted to address an issue that was causing unrecoverable hangs on network accounts.

 

Users logged in to their network accounts have been reporting a lot of application freezes (rainbow wheel) that often escalate to a full unrecoverable lock up of their session, requiring forced shut down of the client computer using the power button.  Multiple sessions digging into client and server logs have led me to believe that all of these failures are traceable to failures involving reading/writing to various sqlite databases. This then leads to various applications and daemons freezing up, including Firefox, Safari, Mail, Contacts and tccd. A friend of mine who's been at this much longer than I suggested that a likely cause was an authentication failure in the Kerberos stack.  This appeared to pan out, as I was seeing a lot of errors in the server's kerberos log that looked like this:

 

2016-09-09T10:15:41 AS-REQ davidi@MYSERVER.MYDOMAIN.NET from 10.23.20.1:50969 for krbtgt/MYSERVER.MYDOMAIN.NET@MYSERVER.MYDOMAIN.NET

2016-09-09T10:15:41 UNKNOWN -- davidi@MYSERVER.MYDOMAIN.NET: no such entry found in hdb

 

And these:


016-09-09T10:16:11 TGS-REQ erde-suns-macbo$@MYSERVER.MYDOMAIN.NET from 10.242.4.10:56266 for host/erde-suns-macbook-pro.local@MYSERVER.MYDOMAIN.NET

2016-09-09T10:16:11 Server not found in database: host/erde-suns-macbook-pro.local@MYSERVER.MYDOMAIN.NET: no such entry found in hdb

2016-09-09T10:16:11 Failed building TGS-REP to 10.242.4.10:56266

2016-09-09T10:16:11 tgs-req: sending error: -1765328377 to client

2016-09-09T10:16:11 TGS-REQ erde-suns-macbo$@MYSERVER.MYDOMAIN.NET from 10.242.4.10:55163 for ldap/myserver.mydomain.net@MYSERVER.MYDOMAIN.NET [canonicalize]

2016-09-09T10:16:11 TGS-REQ authtime: 2016-09-09T10:16:11 starttime: 2016-09-09T10:16:11 endtime: 2016-09-09T20:16:11 renew till: unset

2016-09-09T10:16:45 TGS-REQ leahm@MYSERVER.MYDOMAIN.NET from 10.23.20.12:54625 for vnc/PLC-HMI.local@MYSERVER.MYDOMAIN.NET [canonicalize, forwardable]

2016-09-09T10:16:45 Searching referral for PLC-HMI.local

2016-09-09T10:16:45 Returning a referral to realm LOCAL for server vnc/PLC-HMI.local@MYSERVER.MYDOMAIN.NET that was not found

2016-09-09T10:16:45 Server not found in database: krbtgt/LOCAL@MYSERVER.MYDOMAIN.NET: no such entry found in hdb

2016-09-09T10:16:45 Failed building TGS-REP to 10.23.20.12:54625

2016-09-09T10:16:45 tgs-req: sending error: -1765328377 to client

 

 

 

I found one thing in our server’s AFP conf that looked awry:

 

afp: kerberosPrincipal = "afpserver/LKDC:SHA1.AEB9DE7C32B710BB5552569A00257A54B0DF9F58@LKDC:SHA1.AEB9DE7 C32B710BB5552569A00257A54B0DF9F58"

 

According to several articles like this one: https://discussions.apple.com/thread/6037923?tstart=0

That entry really ought to look like this:

 

afp:kerberosPrincipal = "afpserver/myserver.mydomain.net@MYSERVER.MYDOMAIN.NET"

 

I went ahead and changed this using serveradmin and rebooted.

 

The Kerberos log was then populated with a lot of messages like this one:

 

2016-09-13T09:47:45 Got a canonicalize request for a LKDC realm from local-ipc

2016-09-13T09:47:45 Asked for LKDC, but there is none

 

So I then ran:


sudo -rf /var/db/krb5kdc

sudo /usr/libexec/configureLocalKDC

 

And rebooted.

 

This broke the bind for all the client machines, but those got reestablished once I rebooted those machines that had been up when I made the change.

 

However, now things in the Kerberos log are a total mess.

 

I’m now getting TONS of these:


2016-09-16T13:37:28 AS-REQ davidr@MYSERVER.MYDOMAIN.NET from 10.23.5.3:52077 for krbtgt/MYSERVER.MYDOMAIN.NET@MYSERVER.MYDOMAIN.NET

2016-09-16T13:37:28 UNKNOWN -- davidr@MYSERVER.MYDOMAIN.NET: no such entry found in hdb

 

And also seeing a bunch of forwards that look like this:

 

Got a canonicalize request for a LKDC realm from local-ipc

2016-09-16T13:37:19 LKDC referral to the real LKDC realm name

2016-09-16T13:37:19 AS-REQ teaa@LKDC:SHA1.AEB9DE7C32B710BB5552569A00257A54B0DF9F58 from local-ipc for krbtgt/LKDC:SHA1.AEB9DE7C32B710BB5552569A00257A54B0DF9F58@LKDC:SHA1.AEB9DE7C32B 710BB5552569A00257A54B0DF9F58

 

I am tempted to just restore /var/db/krb5kdc from a TM backup and reboot, but that won’t necessarily address the fundamental issue here.

 

I’d like to ensure that the hdb contains all user, group and computer records and that all krbtgt requests are being referred to the proper db in the proper realm. At least I think that’s what’s gone awry here.

 

Ready to be thrashed by my betters… all suggestions welcome.

 

Thanks much,

 

Paul

MAC MINI SERVER (LATE 2012), OS X Server, 10.8.5

Posted on Sep 16, 2016 4:15 PM

Close

Q: Kerberos Conundrum

  • All replies
  • Helpful answers

  • by Picoscope,

    Picoscope Picoscope Sep 16, 2016 4:17 PM in response to Picoscope
    Level 1 (14 points)
    Servers Enterprise
    Sep 16, 2016 4:17 PM in response to Picoscope

    Oh - I should mention that end users don't appear to be noticing any of this. Users are logging in and traversing AFP shares normally even with all this falderal happening in the background.

  • by Picoscope,

    Picoscope Picoscope Sep 18, 2016 10:17 PM in response to Picoscope
    Level 1 (14 points)
    Servers Enterprise
    Sep 18, 2016 10:17 PM in response to Picoscope

    I should probably also mention that slaptest returns:

     

    57df7489 ldif_read_file: Permission denied for "/etc/openldap/slapd.d/cn=config.ldif"

    slaptest: bad configuration file!


    :-(.


    p

  • by StoneSoup,

    StoneSoup StoneSoup Sep 18, 2016 10:53 PM in response to Picoscope
    Level 1 (49 points)
    Sep 18, 2016 10:53 PM in response to Picoscope

    Hi, Paul,

     

    Eek.  I haven't dug deep into Kerberos-related setups at this level before, but there are at least a few things that look odd to me on first principles:

     

    A. The base "no such entry found in hdb" error

     

    B. The weird afp:kerberosPrincipal entry

     

    C. A bad LDIF read is also troubling

     

    Intermittent problems suck.  I've found that the key is to find a reproducible test case that generates the error condition every time. I can't see that in the description above... yet.

     

    So I'd pause for a moment before digging too deeply into the resulting mess in the Kerberos logs.  Stop digging when you're in a hole, and all that...  My first instinct would be to restore everything to the previous state and attempt to find a reproducible test case, using only a single intermittently-failing client machine if possible.

     

    Generally, when I hear about issues where something used to work but has now stopped working (or is working intermittently), my automatic response would be to ask what the users (or sysadmin) did differently. So, I'd first go high-level with the following questions:

     

    1. When did the user-visible odd timeouts start happening?  Was there a server patch cycle around that time?  [side question: do you have a log of server-side patching activities?]

     

    2. Is it isolated to a few users?  If so, what do they have in common?  Did they patch their client-side machines before the problem appeared?

     

    3. Presumably, some users are *not* seeing this problem (and their requests are not appearing as errors in the logs).  What do those users have in common?

     

    4. I found this (very old) thread with the "no such entry found in hdb" error.  At the risk of being obvious, did you do the basic server routine on things on the server?

     

    Cheers,

    Steven