Kerberos Conundrum

Question

Level 1

22 points

Kerberos Conundrum

Greetings all,

First off: we're running OS X Server 10.8.5 providing OD, AFP, DNS, DHCP and a few other services to about 50 or so 10.8.5 and 10.9.x clients all bound to the server.

This issue I’d like help with began as I attempted to address an issue that was causing unrecoverable hangs on network accounts.

Users logged in to their network accounts have been reporting a lot of application freezes (rainbow wheel) that often escalate to a full unrecoverable lock up of their session, requiring forced shut down of the client computer using the power button. Multiple sessions digging into client and server logs have led me to believe that all of these failures are traceable to failures involving reading/writing to various sqlite databases. This then leads to various applications and daemons freezing up, including Firefox, Safari, Mail, Contacts and tccd. A friend of mine who's been at this much longer than I suggested that a likely cause was an authentication failure in the Kerberos stack. This appeared to pan out, as I was seeing a lot of errors in the server's kerberos log that looked like this:

2016-09-09T10:15:41 AS-REQ davidi@MYSERVER.MYDOMAIN.NET from 10.23.20.1:50969 for krbtgt/MYSERVER.MYDOMAIN.NET@MYSERVER.MYDOMAIN.NET

2016-09-09T10:15:41 UNKNOWN -- davidi@MYSERVER.MYDOMAIN.NET: no such entry found in hdb

And these:

016-09-09T10:16:11 TGS-REQ erde-suns-macbo$@MYSERVER.MYDOMAIN.NET from 10.242.4.10:56266 for host/erde-suns-macbook-pro.local@MYSERVER.MYDOMAIN.NET

2016-09-09T10:16:11 Server not found in database: host/erde-suns-macbook-pro.local@MYSERVER.MYDOMAIN.NET: no such entry found in hdb

2016-09-09T10:16:11 Failed building TGS-REP to 10.242.4.10:56266

2016-09-09T10:16:11 tgs-req: sending error: -1765328377 to client

2016-09-09T10:16:11 TGS-REQ erde-suns-macbo$@MYSERVER.MYDOMAIN.NET from 10.242.4.10:55163 for ldap/myserver.mydomain.net@MYSERVER.MYDOMAIN.NET [canonicalize]

2016-09-09T10:16:11 TGS-REQ authtime: 2016-09-09T10:16:11 starttime: 2016-09-09T10:16:11 endtime: 2016-09-09T20:16:11 renew till: unset

2016-09-09T10:16:45 TGS-REQ leahm@MYSERVER.MYDOMAIN.NET from 10.23.20.12:54625 for vnc/PLC-HMI.local@MYSERVER.MYDOMAIN.NET [canonicalize, forwardable]

2016-09-09T10:16:45 Searching referral for PLC-HMI.local

2016-09-09T10:16:45 Returning a referral to realm LOCAL for server vnc/PLC-HMI.local@MYSERVER.MYDOMAIN.NET that was not found

2016-09-09T10:16:45 Server not found in database: krbtgt/LOCAL@MYSERVER.MYDOMAIN.NET: no such entry found in hdb

2016-09-09T10:16:45 Failed building TGS-REP to 10.23.20.12:54625

2016-09-09T10:16:45 tgs-req: sending error: -1765328377 to client

I found one thing in our server’s AFP conf that looked awry:

afp: kerberosPrincipal = "afpserver/LKDC:SHA1.AEB9DE7C32B710BB5552569A00257A54B0DF9F58@LKDC:SHA1.AEB9DE7 C32B710BB5552569A00257A54B0DF9F58"

According to several articles like this one: https://discussions.apple.com/thread/6037923?tstart=0

That entry really ought to look like this:

afp:kerberosPrincipal = "afpserver/myserver.mydomain.net@MYSERVER.MYDOMAIN.NET"

I went ahead and changed this using serveradmin and rebooted.

The Kerberos log was then populated with a lot of messages like this one:

2016-09-13T09:47:45 Got a canonicalize request for a LKDC realm from local-ipc

2016-09-13T09:47:45 Asked for LKDC, but there is none

So I then ran:

sudo -rf /var/db/krb5kdc

sudo /usr/libexec/configureLocalKDC

And rebooted.

This broke the bind for all the client machines, but those got reestablished once I rebooted those machines that had been up when I made the change.

However, now things in the Kerberos log are a total mess.

I’m now getting TONS of these:

2016-09-16T13:37:28 AS-REQ davidr@MYSERVER.MYDOMAIN.NET from 10.23.5.3:52077 for krbtgt/MYSERVER.MYDOMAIN.NET@MYSERVER.MYDOMAIN.NET

2016-09-16T13:37:28 UNKNOWN -- davidr@MYSERVER.MYDOMAIN.NET: no such entry found in hdb

And also seeing a bunch of forwards that look like this:

Got a canonicalize request for a LKDC realm from local-ipc

2016-09-16T13:37:19 LKDC referral to the real LKDC realm name

2016-09-16T13:37:19 AS-REQ teaa@LKDC:SHA1.AEB9DE7C32B710BB5552569A00257A54B0DF9F58 from local-ipc for krbtgt/LKDC:SHA1.AEB9DE7C32B710BB5552569A00257A54B0DF9F58@LKDC:SHA1.AEB9DE7C32B 710BB5552569A00257A54B0DF9F58

I am tempted to just restore /var/db/krb5kdc from a TM backup and reboot, but that won’t necessarily address the fundamental issue here.

I’d like to ensure that the hdb contains all user, group and computer records and that all krbtgt requests are being referred to the proper db in the proper realm. At least I think that’s what’s gone awry here.

Ready to be thrashed by my betters… all suggestions welcome.

Thanks much,

Paul

MAC MINI SERVER (LATE 2012), OS X Server, 10.8.5

Posted on Sep 16, 2016 4:15 PM

Reply

Answer 1

Picoscope Author

Level 1

22 points

Sep 16, 2016 4:17 PM in response to Picoscope

Oh - I should mention that end users don't appear to be noticing any of this. Users are logging in and traversing AFP shares normally even with all this falderal happening in the background.

Reply

Answer 2

Picoscope Author

Level 1

22 points

Sep 18, 2016 10:17 PM in response to Picoscope

I should probably also mention that slaptest returns:

57df7489 ldif_read_file: Permission denied for "/etc/openldap/slapd.d/cn=config.ldif"

slaptest: bad configuration file!

:-(.

p

Reply

Answer 3

Sep 18, 2016 10:53 PM in response to Picoscope

Hi, Paul,

Eek. I haven't dug deep into Kerberos-related setups at this level before, but there are at least a few things that look odd to me on first principles:

A. The base "no such entry found in hdb" error

B. The weird afp:kerberosPrincipal entry

C. A bad LDIF read is also troubling

Intermittent problems suck. I've found that the key is to find a reproducible test case that generates the error condition every time. I can't see that in the description above... yet.

So I'd pause for a moment before digging too deeply into the resulting mess in the Kerberos logs. Stop digging when you're in a hole, and all that... My first instinct would be to restore everything to the previous state and attempt to find a reproducible test case, using only a single intermittently-failing client machine if possible.

Generally, when I hear about issues where something used to work but has now stopped working (or is working intermittently), my automatic response would be to ask what the users (or sysadmin) did differently. So, I'd first go high-level with the following questions:

1. When did the user-visible odd timeouts start happening? Was there a server patch cycle around that time? [side question: do you have a log of server-side patching activities?]

2. Is it isolated to a few users? If so, what do they have in common? Did they patch their client-side machines before the problem appeared?

3. Presumably, some users are *not* seeing this problem (and their requests are not appearing as errors in the logs). What do those users have in common?

4. I found this (very old) thread with the "no such entry found in hdb" error. At the risk of being obvious, did you do the basic server routine on things on the server?

Cheers,

Steven

Reply

Answer 4

Oct 7, 2016 4:36 AM in response to Picoscope

Hi Paul

I'm one of the contributors to the discussion you've referred to.

With OS X Server DNS is extremely important and this must be as 'perfect' as it can be prior to doing anything else. AFP has always been a service prone to the errors you've seen if it's not dealt with in the right order. The way I would configure OS X Server (any version) is DNS first, followed by OpenDirectory and/or ProfileManger (it starts OD anyway) followed by AFP. That way AFP stands a good chance of being kerberized. However even this procedure does not always guarantee a kerberised AFP service. If things are working OK I would ignore the logs to be honest as they are not necessarily an indicator to a problem. Some logs can be verbose even if some of the language used seems alarming.

Not that this is a problem based on what you've given here but if you're basing the internal domain around .local then you will have problems, sooner rather than lager, regardless of what you do.

Reply