Something wrong there. My gut says hardware. When it shuts down is it graceful or pow and off?
Did you enable any other services when you attached the RAID? Are you doing anything radically different.
Odd that nothing is in the panic log. Do you have redundant power supplies? One of those heading south and no backup PS could cause an immediate shutdown.
Memtest is good for rooting out errors, but you should run it at east 10 times or so just to be sure.
Something wrong there. My gut says hardware. When it
shuts down is it graceful or pow and off?
I'm not sure the exact answer. We were testing the migration to the Xserve and RAID when suddenly we noticed network connection to the server was gone. I went to the server room, the display on the Xserve was off and the two raided system drives were busy for a couple minutes, then the system rebooted.
Did you enable any other services when you attached
the RAID? Are you doing anything radically
I think the RAID is the only change. I've been reading up on how to do Open Director, DHCP, DNS and that stuff, but I don't know enough yet to actually do them.
Odd that nothing is in the panic log. Do you have
redundant power supplies? One of those heading south
and no backup PS could cause an immediate shutdown.
The Xserve does not have redundant power. Is that an upgrade I can do myself?
Memtest is good for rooting out errors, but you
should run it at east 10 times or so just to be
I just started a 10 times test.
Later today I'm going to test getting the RAID out of the configuration. We are trying to replace two firewire drives connected to one of our systems with shares on the RAID so everyone can access the files, so I plan to connect those firewire drives to the Xserve and then share them from there. Seems like a good test.
Any chance of idiot-interference (somebody hit the button?) on the power down? Is the server room on UPS? I would cover that before anything else.
Any chance your AC supply side got overloaded? A RAID and an X is a bit of a load and your circuit should be rated for that load.
I am fairly certain that you can purchase the power supplies separately. Plug and pray I would guess but I haven't done this myself.
Go slow when connecting peripherals. Make sure everything is working correctly first before proceeding to the next step. Consider troubleshooting this problem first, find the cause, and then make sure things are stable before going to the next step.
Any chance of idiot-interference (somebody hit the
button?) on the power down? Is the server room on
UPS? I would cover that before anything else.
UPS, yes. Chance of button-pressing? No.
Any chance your AC supply side got overloaded? A RAID
and an X is a bit of a load and your circuit should
be rated for that load.
We've got massive power (relatively speaking) running into our "server room". We just added a 30 amp circuit to connect the rack-mounted UPS we got for the RAID. Thinking back, I hadn't moved the RAID to the new UPS, it was connected to the two existing UPSs, though they aren't showing overload on their indicators, but they're at about 60%. When we next test the RAID it will certainly be on its own UPS.
I am fairly certain that you can purchase the power
supplies separately. Plug and pray I would guess but
I haven't done this myself.
I'll look into it.
Go slow when connecting peripherals. Make sure
everything is working correctly first before
proceeding to the next step. Consider troubleshooting
this problem first, find the cause, and then make
sure things are stable before going to the next
I've been trying to do just this, part of the reason the Xserve RAID came a couple months after the Xserve. I should have tested the Firewire drives on the Xserve before, that was an oversight.
I'll post more later today.
More data points, though I may have to get on the horn with Apple support. I am interested in any more thoughts people have, of course.
This morning, as suggested, I ran memtest in a loop of 10. No problems reported.
Later in the day we moved the two firewire drives we're trying to replace with the Xserve/Xserve RAID onto the Xserve, disconnected the RAID and shared the two Firewire drives on the network. The user of these drives mounted them and ran the test programs that use the files on the Firewire drives. Within a couple of minutes the Xserve was dead again. The network connection to the system was gone, and a few minutes after that the Xserve rebooted itself.
Still no panic.log file.
I found it fairly interesting that the crashing didn't seem to be RAID related. I mounted the RAID again and turned on sharing for the folders I need to access. However I cannot now get to the Xserve over the network!
I cannot get from other machines TO to the Xserve, either via its .local bonjour name or via its IP address. From the Xserve I cannot get out to computers on the LAN, I also cannot access web sites outside our network. I also cannot see the RAID via RAID Admin.
This causes me to question the ethernet controller. It seems dead now, maybe until now it was almost dead?
I'm still going to call Apple, but I discovered something else just now-
This morning I had switched the Xserve to a "Link Aggregate" ethernet configuration, which I read about in my handy Administering OS X Server book. Was fine until this last reboot. So in my attempts to bring life back to the system I turned off the Link Aggregate port and turned Ethernet 1 back on, but it didn't work (as I posted). I had to Delete the Link Aggregate configuration (not just disable it) and then, network life returned to the system.
But still, I don't think that our demands should be rebooting an Xserve.
I still haven't called Apple, I keep looking for clues.
Latest: I have been able to reproduce on my own the server crash, so I've been letting it happen as often as I can so I can find more clues and I think I have a good one...
As this particular test script I have runs, the amount of wired memory grows, until it practically takes up everything. In fact the server crashed between 5 and 10 seconds after I took the above screenshot.
So more RAM is needed, I'd say? Am I in danger of filling up 4GB just as easily as we're filling up 2GB?
Wow, pretty conclusive evidence of the problem. If a single app is using this much memory it should be obvious in the Activity Monitor above. Sort by memory usage see what bubbles to the top.
It will probably be Apple File Server or something opaque like kernel task but you never know. Grabbing samples may or may not help analysis. I think selecting the thread and "inspecting" is the best you can do with the regular tools. Apple's Dev tools have several interesting debugging tools, Big Top and Shark for example, that you can attach to virtually any process that is running and see where it is spending it's time and resources. I've never tried attaching them to system processes but they are technically no different than regular processes.
Out of interest how do you cause the problem? Is it just overloading it or is it something specific? I was thinking that you could hit the server and see if it recovers over time... any process that eventually swallows all memory always suggests memory leakage to me. If you could watch a thread always increment in memory every time you do X that is pretty clear evidence of leak situation. On the PPC AFP now seems to free memory way better then it ever did before the Intel versions came out. They clearly dug deeply into the guts of AFP for the Intel port and the PPC side gained some beenfits but perhaps there is still a missed "free" on the Intel side...
I had the same thought to see who is using up all my memory...
click the image to see it larger
...Doesn't tell me much though. Certainly none of our programs are running on the Xserve, we have some utilities that we run on our computers that search for and write files and these programs don't have memory issues locally. But when we mount the shares on the Xserve and our programs search the shares for files, the Xserve quickly runs out of memory.
Change the pop-up after the search field to either Active Processes or All Processes to get at the system level stuff and sort by memory. Clearly it's nothing you're running in your space that's triggering it.
This is no guarantee to reveal which process it is because it could be an underlying process but memory is much easier to trace than CPU consumption so I would be surprised if you can't find it.
Thanks, didn't know about that pop-up.
click the image to see it larger
Armed with the confidence that I finally figured out what was causing the reboots-- running out of RAM-- and what seemed to be causing the RAM consumption-- network file searches from connected clients-- I called Apple tech support and provided them a load of data. Fingers crossed for a fix, our server will remain dormant until it is fixed.