Hello folks,
I'm not sure if this discussion is still very alive but I've been experiencing the same problems and troubleshooting it is not easy.
I'm experiencing the failing services as well, most noticably the fact that postgres shuts down and won't restart. I've been able to drill it down to the point that it is in fact an issue with launchd which I will explain later.
What happens is that after a reboot most services work fine, as long as I don't have a Server.app or Server Admin tool trying to connect to the server. Connecting with one of these tools often, but not always immediatley, results in a message "Shutting down idle postgres...." messages which will stop postgres. These messages are not the core of the problem, but the fact that postgres won't restart.
On a working server idle processes are stopped as well, but new ones are created when needed. On the failing server (a mac-mini 2010 with a clean install) postgres can't be started again.
When trying to start postgres from the commandline a 'cannot start postgres timeout message' appears. And without postgres there is not much running on the system. Once this happens a can't user Server Admin to manage DNS, Netboot or DHCP although these are still running.
Diving deeper in the system I noticed that servermgrd is the common factor so I decided to investigate the behaviour of this process.
When trying to manage a service from the commandline I used dtruss to follow the execution flow and learned that servermgr is depending on launchd.
And that is were my problem is. Launchd is working fine for all users but root. As root I get the message launchmgs_error(): socket not connected when typing something like launchctl load -w /System/Library/LaunchDaemons/com.apple.collabd.plist
Even a 'simple' launchctl limit command fails with the same error while launchctl list works fine.
The above commands work fine when I enter them as admin with sudo. The problem then is that the services will start, but with the wrong credentials, which results in failure as well since the logfiles can't be accessed etc. etc.
I've compared all the permissions that I can think of with a working system. Did a repair disk from disk utility but still can't figure out why launchd can't be run as root.
Below the dtruss trace that show the program hierachy, hopefully somebody will speak the magic words that help me solve launchd, or at least can help me troubleshooting it.
/mac-mini:Public root# (dtruss -f "serveradmin start postgres")
PID/THRD SYSCALL(args) = return
...
77524/0xe0324: open("/usr/share/servermgrd/bundles/servermgr_postgres.bundle/Contents/MacOS/se rvermgr_postgres\0", 0x0, 0x1FF) = 4 0
...
77524/0xe0324: open("/private/var/servermgrd//servermgr_postgres.lock\0", 0x2, 0x1F8) = 4 0
...
77524/0xe0324: access("/bin/launchctl\0", 0x1, 0x400) = 0 0
77524/0xe0324: geteuid(0x7FFF73BA2C40, 0x7FFF73BA1CA0, 0x7FFF73BA2C40) = 0 0
77524/0xe0324: pipe(0x7FFF6BEED440, 0x7F8ACA0173F0, 0x10) = 5 0
77524/0xe0324: pipe(0x7FFF6BEED438, 0x7F8ACA0173F0, 0x6) = 7 0
77524/0xe0324: pipe(0x7FFF6BEED430, 0x7F8ACA0173F0, 0x8) = 9 0
77524/0xe0324: pipe(0x7FFF6BEED428, 0x7F8ACA0173F0, 0xA) = 11 0
= 0 0
77524/0xe0324: open("/usr/share/servermgrd/bundles/servermgr_postgres.bundle/Contents/Info.pli st\0", 0x0, 0x1B6) = 3 0
…
77524/0xe0324: open("/usr/sbin/serveradmin\0", 0x0, 0x1FF) = 3 0
...
77525/0xe033b: open("/var/db/launchd.db/com.apple.launchd/overrides.plist\0", 0x220, 0x180) = 3 0
77525/0xe033b: open("/var/db/launchd.db/com.apple.launchd/overrides.plist\0", 0x0, 0x1B6) = 4 0
77525/0xe033b: read(0x4, "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<!DOCTYPE plist PUBLIC \"-//Apple//DTD PLIST 1.0//EN\" \"http://www.apple.com/DTDs/PropertyList-1.0.dtd\">\n<plist version=\"1.0\">\n<dict>\n\t<key>com.apple.AEServer</key>\n\t<dict>\n\t\t<key>D isabled</key>\n\t\t<false/>\n\t</dict>\n\t<ke", 0x19CD = 6605 0
... = 0 0
77525/0xe033b: open("/System/Library/LaunchDaemons/org.postgresql.postgres.plist\0", 0x0, 0x1B6) = 4 0
77525/0xe033b: read(0x4, "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<!DOCTYPE plist PUBLIC \"-//Apple//DTD PLIST 1.0//EN\" \"http://www.apple.com/DTDs/PropertyList-1.0.dtd\">\n<plist version=\"1.0\">\n<dict>\n\t<key>GroupName</key>\n\t<string>_postgres</string>\ n\t<key>Label</key>\n\t<string>org.post", 0x4D1) = 1233 0
...
77525/0xe033b: socket(0x1, 0x1, 0x0) = 4 0
77525/0xe033b: fcntl_nocancel(0x4, 0x2, 0x1) = 0 0
77525/0xe033b: connect_nocancel(0x4, 0x7FFF65119DB0, 0x6A) = -1 Err#61
77525/0xe033b: close_nocancel(0x4) = 0 0
77525/0xe033b: getrlimit(0x1008, 0x7FFF65119900, 0x80) = 0 0
77525/0xe033b: write_nocancel(0x2, "launch_msg(): Socket is not connected\n\0", 0x26) = 38 0
...
77525/0xe033b: open("/bin/launchctl\0", 0x0, 0x1FF) = 4 0
...
77525/0xe033b: open("/var/db/launchd.db/com.apple.launchd/overrides.plist\0", 0x601, 0x1B6) = 4 0
77525/0xe033b: write(0x4, "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<!DOCTYPE plist PUBLIC \"-//Apple//DTD PLIST 1.0//EN\" \"http://www.apple.com/DTDs/PropertyList-1.0.dtd\">\n<plist version=\"1.0\">\n<dict>\n\t<key>com.apple.AEServer</key>\n\t<dict>\n\t\t<key>D isabled</key>\n\t\t<false/>\n\t</dict>\n\t<ke", 0x19CD) = 6605 0
77524/0xe0324: select(0xA, 0x7FFF6BEED540, 0x7FFF6BEED4C0, 0x0, 0x0) = 2 0
77524/0xe0324: read(0x9, "abauthd</key>\n\t<dict>\n\t\t<key>Disabled</key>\n\t\t<false/>\n\t</dict>\n\t< key>com.apple.collabcored1</key>\n\t<dict>\n\t\t<key>Disabled</key>\n\t\t<false/ >\n\t</dict>\n\t<key>com.apple.collabcored2</key>\n\t<dict>\n\t\t<key>Disabled</ key>\n\t\t<false/>\n\t</dict>\n\t<key>com.apple.collab", 0x1000) = 0 0
77524/0xe0324: close(0x9) = 0 0
77524/0xe0324: kill(0xFFFED12B, 0x9, 0x1) = -1 Err#1
77524/0xe0324: wait4(0x12ED5, 0x7FFF6BEED45C, 0x0) = 77525 0
77524/0xe0324: geteuid(0x7FFF6BEEDB90, 0x400, 0x7F8AC8D00010) = 0 0
77524/0xe0324: socket(0x1, 0x1, 0x0) = 5 0
..
= 1 0
77524/0xe0324: read(0x9, "launch_msg(): Socket is not connected\n\0", 0x1000) = 38 0
...
77524/0xe0324: write_nocancel(0x1, "postgres:error = \"CANNOT_START_SERVICE_TIMEOUT_ERR\"\n\0", 0x34) = 52 0