xserve System Crash

We have an Xserve dual 1.33 / 2GB running 10.2.8 as a file server w/ 1500va ups.

Our unit will freeze with out provocation. There is no panic information on the screen when this happens. The machine will either wait for me to reboot it or, it will reboot because the reboot if system freezes service is enabled. When this happen the machine usually will come back to life and do it's thing. Other times it will blink the system status lights 4 times indicating a ram issue. If this happens I open up the unit pop out the dims and reseat them. This does the trick and we are back up and running until the next time. Not being familiar enough with the logs, I can't seem to find anything noted. I have ran ASD 2.1.5 and it passes all tests. The open.crash.log has several entrees that are the same(see below).

Any help would be greatly appreciated.

Thanks
Michael



Date/Time: 2007-02-01 09:13:28 -0500
OS Version: 10.2.8 (Build 6R73)
Host: ProductionServer

Command: open
PID: 810

Exception: EXC BADACCESS (0x0001)
Codes: KERN PROTECTIONFAILURE (0x0002) at 0x00000020

Thread 0 Crashed:
#0 0x90164454 in CFMessagePortCreateRunLoopSource
#1 0x931b54dc in -[NSApplication _createWakeupPort]
#2 0x9313b6a0 in -[NSApplication init]
#3 0x9313a6e8 in +[NSApplication sharedApplication]
#4 0x9313a588 in +[NSWorkspace sharedWorkspace]
#5 0x000036b0 in main
#6 0x000033b8 in _start
#7 0x00003238 in start

PPC Thread State:
srr0: 0x90164454 srr1: 0x0200f030 vrsave: 0x00000000
xer: 0x00000000 lr: 0x901643f0 ctr: 0x901643e4 mq: 0x00000000
r0: 0x00000000 r1: 0xbffff7d0 r2: 0x24000280 r3: 0x00000000
r4: 0x00000000 r5: 0x00000000 r6: 0x00000080 r7: 0xffffffff
r8: 0x00047010 r9: 0xa0001048 r10: 0x00047290 r11: 0xa3073610
r12: 0x901643e4 r13: 0x00000000 r14: 0x00000000 r15: 0x00000000
r16: 0x00000000 r17: 0xa306a63c r18: 0xa307a63c r19: 0xa309a63c
r20: 0x906d6a7c r21: 0xa309a63c r22: 0xa309da3c r23: 0x00000000
r24: 0x0005fe80 r25: 0xa01343f0 r26: 0x00000000 r27: 0x00000000
r28: 0x00000000 r29: 0x00000008 r30: 0x00000000 r31: 0x901643f0

Posted on Feb 5, 2007 7:56 AM

Reply
9 replies

Feb 5, 2007 4:14 PM in response to mymacsolutions

Welcome to discussions!

The open command shouldn't be enough to bring down the machine. If it's a kernel panic, it should be writing to /Library/Logs/panic.log (at least 10.3+4 did/do). The latest couple of entries from that log, and maybe /var/log/system.log for the minute or so leading up to the problem would be much more telling.

Roger

Feb 6, 2007 5:52 AM in response to Community User

Roger,

Thanks for you response. I looked at the panic.log and it does not have any entries for the times that the server went down. Here is the info from the system.log files:

Jan 31 10:57:13 ProductionServer slpd: SR: service: afp request from 192.168.1.144
Jan 31 10:57:13 ProductionServer slpd: SR: service: nfs request from 192.168.1.144
Jan 31 10:57:14 ProductionServer slpd: SR: service: smb request from 192.168.1.144
Jan 31 10:57:14 ProductionServer slpd: SR: service: ftp request from 192.168.1.144
//server rebooted
Jan 31 11:04:33 ProductionServer syslogd: restart
Jan 31 11:04:33 ProductionServer mach_kernel: standard timeslicing quantum is 10000 us
Jan 31 11:04:33 ProductionServer mach_kernel: vm pagebootstrap: 510211 free pages
-- --- -- --- --

Feb 2 16:30:00 ProductionServer CRON[4376]: (admin) CMD ("/Library/Application Support/Norton Solutions Support/Scheduler/schedLauncher" -n 0 "/Library/Application Support/Norton Solutions Support/Norton AntiVirus/scheduledScanner" " " "NVsi" "SCae" "path" "/Volumes/Production")
Feb 2 16:30:02 ProductionServer -n[4376]: Try #0, for , result = 0
Feb 2 16:30:02 ProductionServer -n[4376]: launch for id = 0 event = NVsi result = 0
Feb 2 16:30:05 ProductionServer /usr/libexec/fix_prebinding: /Library/Application Support/Norton Solutions Support/Scheduler/schedLauncher could not be launched prebound.
Feb 2 16:30:06 ProductionServer /usr/libexec/fix_prebinding: /Library/Application Support/Norton Solutions Support/Scheduler/schedLauncher couldn't be prebound in the past, and probably can't be prebound now.
Feb 2 16:30:06 ProductionServer /usr/libexec/fix_prebinding: 2007-02-02 16:30:06 -0500: prebinding for schedLauncher done.
Feb 2 16:30:37 ProductionServer slpd: SR: service: afp request from 192.168.1.53
Feb 2 16:30:37 ProductionServer slpd: SR: service: smb request from 192.168.1.53
Feb 2 16:30:37 ProductionServer slpd: SR: service: nfs request from 192.168.1.53
Feb 2 16:30:37 ProductionServer slpd: SR: service: ftp request from 192.168.1.53
Feb 2 16:30:44 ProductionServer slpd: SR: service: afp request from 192.168.1.53
Feb 2 16:30:44 ProductionServer slpd: SR: service: smb request from 192.168.1.53
Feb 2 16:30:44 ProductionServer slpd: SR: service: nfs request from 192.168.1.53
Feb 2 16:30:44 ProductionServer slpd: SR: service: ftp request from 192.168.1.53
Feb 2 16:32:00 ProductionServer slpd: SR: service: afp request from 192.168.1.144
Feb 2 16:32:00 ProductionServer slpd: SR: service: smb request from 192.168.1.144
Feb 2 16:32:00 ProductionServer slpd: SR: service: nfs request from 192.168.1.144
Feb 2 16:32:00 ProductionServer slpd: SR: service: ftp request from 192.168.1.144
Feb 2 16:32:06 ProductionServer slpd: SR: service: afp request from 192.168.1.144
Feb 2 16:32:06 ProductionServer /usr/libexec/fix_prebinding: fix_prebinding quitting for now.
Feb 2 16:32:07 ProductionServer slpd: SR: service: smb request from 192.168.1.144
Feb 2 16:32:07 ProductionServer slpd: SR: service: nfs request from 192.168.1.144
Feb 2 16:32:07 ProductionServer slpd: SR: service: ftp request from 192.168.1.144
//server hung
Feb 5 09:08:32 ProductionServer syslogd: restart

-----
I've seen this a few times but not when the machine crashed.
Jan 30 03:15:02 ProductionServer sendmail[16188]: My unqualified host name (ProductionServer) unknown; sleeping for retry
Jan 30 03:16:02 ProductionServer sendmail[16188]: unable to qualify my own domain name (ProductionServer) -- using short name
Jan 30 03:16:02 ProductionServer sendmail[16188]: NOQUEUE: SYSERR(root): /etc/mail/sendmail.cf: line 93: fileclass: cannot open '/etc/mail/local-host-names': Group writable directory

Don't know if any of this is really useful.

Michael

Feb 6, 2007 4:21 PM in response to mymacsolutions

Interesting logs. What machine is 192.168.1.144? Is that a client mac, or another server? An appliance? Both of those entries had slpd (service location protocol daemon) logging those requests. It's interesting to me that slpd would be taking "requests" from another machine. The man page makes it sound like slpd is just another blathering daemon causing unneccessary network traffic. That said, when I checked my server for the man page, I found out that it's running on my machine. I've never seen a log message from it, FWIW.

Norton's history with OSX hasn't been too great. I'll admit that I haven't seen posts about Norton recently, so they might have gotten their act together, but I'm still a suspicious type. Actually, where you're running 10.2.8, I'd be REAL suspicious about Norton.

One other observation: both of those hangs were around the half-hour. Is Norton kicking off each half hour? I'm not anti Norton, but I am suspicious.

The sendmail messages, at worst, would keep the email system from working. They don't look serious enough to hang a machine.

Roger

Feb 7, 2007 5:46 AM in response to DaddyPaycheck

DaddyPaycheck,

1.114 is a 10.4 box running retrospect. It has all of the servers mounted and backs up the shares daily.

1.53 is a 10.4 box running Acrobat Distiller. It checks a watch folder every few minutes.

Norton: ah, I was reluctant to post anything that would show NAV. I don't like the product myself but here we are. It was scheduled to run at 4:30 on the local machine. This could answer the crashing at 4:30 but not the other times. I have not been able to find the uninstaller from the symantec site yet. The version is a client version 9.x.

Michael




Feb 7, 2007 6:50 AM in response to mymacsolutions

mymacsolutions-

No worries. Important thing IMHO is that you are getting the situation addressed.

I am not particularly against any software unless it causes particular problems.

At least disable the NAV for a while and see if this mitigates that problem. Check your Retrospect script and see if there is something about what to do if a client can't be found. I haven't used Retrospect in a while but there is this bell clamoring in my head saying it is worth a mention...

As discovered you can run into problems running distiller that way. Any reason why Distiller isn't sitting and running on the XServe? Might be worth digging into a little deeper.

Luck-

-DaddyPaycheck

Feb 7, 2007 4:34 PM in response to mymacsolutions

Since Norton's being started by cron, it would appear either in /etc/crontab or by running crontab -l -u admin . Either way is simple enough to remove the entry, and HUP cron to stop Norton from running.

I have to admit that I'm curious why the Retrospect machine is making so many requests during business hours. A backup machine like that usually only makes it's presence known in the wee hours of the morning.

Reseating the RAM may be taking care of the flashing lights, but if you're having an overheating problem the few minutes without power and having the case removed can be allowing things to cool down so that they work ok again. I'd keep an eye on the temperatures and fans in ServerMonitor.app just to make sure they're not running too close to the max.

With all the periodic/cron stuff that you have going on, noting the time of these problems, and knowing exactly when these periodic/cron jobs are going on may point to what's actually causing the problems.

Roger

This thread has been closed by the system or the community team. You may vote for any posts you find helpful, or search the Community for additional answers.

xserve System Crash

Welcome to Apple Support Community
A forum where Apple customers help each other with their products. Get started with your Apple Account.