Kerberos stopped - giant kdc.log

Question

Kerberos stopped - giant kdc.log

I have a 10.5.8 OD server that has been in service for almost 2 years. The last couple of OS updates have evolved this system into a relatively stable unit (the early 10.5 versions required frequent reboots).

Recently, after losing the ability to authenticate against the domain, I found Server Admin indicating that Kerberos was stopped. The DNS is on that same server, and was the first place I looked. I noted that the reverse lookup for the server's IP was pointing to the name server's host name, so I changed it to the primary host name of the OD server.

I'm not sure how the PTR record was altered, as very few people have the needed access. However, once changed and the server rebooted, user authentication AND single sign-on were working again. Oddly, Server Admin still reports that Kerberos is stopped.

Days passed and I've already ordered a couple of Mac Mini Servers to replace our aging Xserve G4 OD system. Even so, I still want to know what happened and how to fix it without resorting to the archive-demote-promote-restore routine. While I was copying one of the OD server's backup drives onto a testbed Xserve, I noticed that one file was taking a very long time to copy: /var/log/krb5kdc/kdc.log

So, I'm thinking 3.08GB is a bit much for a log file. While Googling, I found the following synopsis of the krb5.conf file:
http://docs.sun.com/app/docs/doc/816-0219/6m6njqb94?a=view
Of specific interest is this:
"kdc_rotate:
A relation subsection that enables kdc logging to be rotated to multiple files based on a time interval. This can be used to avoid logging to one file, which may grow too large and bring the KDC to a halt."

Being uncertain that kdc.log would simply re-generate, I initially chose to setup krb5.conf with the appropriate settings. Lesson 1: /etc/krb5/krb5.conf does not exist in Leopard Server. Look for /Library/Preferences/edu.mit.Kerberos - Ignoring the warning about how the file is automatically generated, I added the period and versions parameters and rebooted. Nada. Oh, yeah, forgot about the periodic routines...

sudo periodic daily weekly monthly - no joy; the 3GB kdc.log file was still there.

I then noticed that the file /etc/periodic/daily/601.daily.server.krb5kdc did not have the executable bit set (same as on the other 10.5 servers I checked). I set the 601 script as executable and tried the daily again. This time, the kdc.log file was renamed kdc.1.log. After rebooting, a new kdc.log file didn't appear. This could have been due to my messing with the permissions on the krb5kdc log directory. Or, could this be why the 601 daily script is set as no-execute? It doesn't work?

At this point, I restored to the backup and began again. Before booting to the restored drive, I removed the 3GB log file and crossed my fingers. This time the kdc.log file was created. But, for every step forward... Now LDAP is stopped and slapd keeps crashing with this log entry repeating:
------------------------
Nov 6 18:12:16 od1 slapd[573]: @(#) $OpenLDAP: slapd 2.3.27 (Jan 26 2009 09:49:21) $
Nov 6 18:12:17 od1 slapd[573]: overlay_config(): warning, overlay "dynid" already in list
Nov 6 18:12:17: --- last message repeated 4 times ---
Nov 6 18:12:17 od1 slapd[573]: bdb dbopen: unclean shutdown detected; attempting recovery.
Nov 6 18:12:17 od1 slapd[573]: bdb(dc= od1,dc=domain,dc=lan): Ignoring log file: /var/db/openldap/openldap-data/log.0000000470: magic number 0, not 40988
Nov 6 18:12:17 od1 slapd[573]: bdb(dc= od1,dc=domain,dc=lan): Invalid log file: log.0000000470: Invalid argument
Nov 6 18:12:17 od1 slapd[573]: bdb(dc=od1,dc= domain,dc=lan): First log record not found
Nov 6 18:12:17 od1 slapd[573]: bdb(dc=od1,dc= domain,dc=lan): PANIC: Invalid argument
Nov 6 18:12:18 od1 slapd[573]: bdb dbopen: Database cannot be recovered, err -30978. Restore from backup!
Nov 6 18:12:18 od1 slapd[573]: bdb(dc=od1,dc= domain,dc=lan): DB ENV->lock_idfree interface requires an environment configured for the locking subsystem
Nov 6 18:12:18 od1 slapd[573]: bdb(dc=od1,dc= domain,dc=lan): txn_checkpoint interface requires an environment configured for the transaction subsystem
Nov 6 18:12:18 od1 slapd[573]: bdb dbclose: txn_checkpoint failed: Invalid argument (22)
Nov 6 18:12:18 od1 slapd[573]: backend startupone: bi dbopen failed! (-30978)
Nov 6 18:12:18 od1 slapd[573]: bdb dbclose: alock_close failed
Nov 6 18:12:18 od1 slapd[573]: slapd stopped.
Nov 6 18:12:18 od1 slapd[573]: connections_destroy: nothing to destroy.
------------------------

Before I head down the archive-demote-promote-restore path, does anyone have any ideas? Additionally, what mechanism is there to keep this log file from reaching 3GB if the periodic maintenance routines don't work?
Does anyone else have a huge kdc.log file?

Thanks for any thoughts or feedback.
Lyle M

Xserve G4 1.3DP, Mac OS X (10.5.8)

Posted on Nov 6, 2009 3:14 PM

Reply

Answer 1

Nov 6, 2009 7:20 PM in response to Lyle Millander

Well, I cannot help with actual OD errors (Not yet anyhow)....

You mentioned "I initially chose to setup krb5.conf with the appropriate settings. Lesson 1: /etc/krb5/krb5.conf does not exist in Leopard Server. Look for /Library/Preferences/edu.mit.Kerberos - Ignoring the warning about how the file is automatically generated, I added the period and versions parameters and rebooted. Nada." That is expected, files that state they are auto-generated usually are regenerated at fairly predictable time intervals as well as any HUP or reboot. Most files like that will have a mention about 'if you need this to not be generated, delete such-and-such line(s)', this is possibly why it did not work as you expected.

For me, my 601.daily.server.krb5kdc is setup and runs fine. My logs are rotating once a day. I do not recall if I had fixed it like you had to, and I have no notes that I did fix it so I would have to assume that I did not. Just an FYI.

Hope you can fix your other issues. I had to live through a few archive and de-promote schedules a few times and still not all my PHD's work.

Good luck,
Peter

Reply

Answer 2

Nov 11, 2009 2:00 PM in response to Peter Scordamaglia

Hi Peter,
Thanks for chiming in. I've learned a few things since this started:

1. It looks like 601.daily.server.krb5kdc is the only supported mechanism for rotating the kdc log.
2. Apple does allow for the edu.mit.Kerberos file to be modified as you mention - the header contains instructions on what should be deleted from the file in order to make any changes. It didn't really matter, as the period setting for the log don't appear to be utilized (at least not on my server). Also worthy of note, that file hadn't changed since 2008, when the server was first promoted to a master. It did re-generate on the test box during demote/promote.
3. When kdc.log failed to re-appear after the daily periodic ran, it was due to my messing with permissions (I added admin read to the log folder). With permissions restored and ACLs removed, the new kdc.log appeared.

For related background, see my post in this thread:
http://discussions.apple.com/thread.jspa?messageID=10556534

I keep hammering on the test box to determine the best path. This is how my workflow looks so far:

1. Archive the OD database using Server Admin (in my case, the database from the live server).
2. Demote the test OD server to stand alone.
3. Delete the 3GB log file.
4. Set the execute bit on 601.daily.server.krb5kdc.
5. Promote the test OD server to master using original settings (realm, search base, etc) and a diradmin that does not overlap with the one in the archive.
6. Use sudo slapd -mergedb <path toarchive> (created in step 1)
7. Use sudo slapd -settopasswordserver on each user to reset all their passwords and cleanup their password server slot and Kerberos principal (I'll likely do this with a shell script). This step seems to be my best option due to problems with my original merge from my 10.4 OD server.
8. Setup new 10.6 server (a Mini, no less) and use it's migration tools to grab from test system.
-- I'm a little unsure how to proceed from here. Can I just shut off the live 10.5 server? Then use changeip on the 10.6 server to give it the IP of the server it's replacing? If I do that, will all the bound systems just hop on board the new 10.6 unit? Or, should I give the new server a different IP and gradually 'unbind old/bind new' to all the clients?

Additional insights are very welcome.

Regards

Reply

Answer 3

Nov 23, 2009 7:40 AM in response to Lyle Millander

Sorry, those commands were supposed to be:

6. Use sudo slapconfig -mergedb <path toarchive> (created in step 1)
7. Use sudo slapconfig -settopasswordserver user

-Lyle

Reply