XServe G5 shuts down - hardware error ? How to verify ?

I experience a strange problem: From time to time (varies between weeks and days, happens more often recently) the server reboots. But not just the server, even the router that the server connects to.

First I believed in a DoS attack, but now more and more feel that it is a hardware issue.

Attached is the notification mail the server sends out. Maybe you find something strange there and can give me a hint.

I noticed two things:

The system controller ambient temperature is unusually high. Could this be the reason why the server restarts ? Are the sensors connected with cables which I may check ? But the number sounds strange; like something "out of range" - 65536/1024 = 64

Is it possible to turn off rebooting in case of temperature problems ?

However why is then the router affected as well ? I also suspect the UPS to have a malfunction. But it cannot be checked for that. Anyway, I removed the UPS and will see how it turns.

How can I check the hardware of the server ? I know that there is a diagnostic software at the system DVD... can I run this without booting from it ? Any other tools (beside Apple's own) that could be helpful to solve this issue.

Any help is appreciated.

Here is the mail:

"Reason for notification: Power"

OS version : Mac OS X Server 10.4.11 (8S169)
Processor : 1 x 2000 MHz
Memory : 1024 MB
BootROM : $0005.17f2
Serial : CK638000WD2

Memory:
Memory Slot "DIMM0/J11" : 512MB, ECC DDR SDRAM, PC3200U-30330
Memory Slot "DIMM1/J12" : 512MB, ECC DDR SDRAM, PC3200U-30330

Drives:
Drive 1 (disk0) : Normal

Network:
en0 Normal
fw0 (inactive) : Normal
en0 Normal
en0 Normal
en0 Normal
en0 Normal
en0 Normal
en0 Normal
en0 Normal
en0 Normal
en0 Normal
en0 Normal
en0 Normal
en0 Normal
en0 Normal
en0 Normal

Sensors:
CPU 1 Power : 19.33 watts
System Controller Ambient : 65535.50 C
CPU 1 Inlet : 27.00 C
System Controller Internal : 47.00 C
Behind the DIMMs : 39.00 C
CPU 1 Ambient : 33.00 C
Between the Processors : 30.00 C
PCI Slots : 33.50 C
CPU 2 Inlet : 26.50 C
DDR IO : 2.63 volts
CPU 1 Internal : 44.79 C
1.5v : 1.49 volts
3.3v Trickle : 3.28 volts
5v : 5.07 volts
CPU 1 12v : 3.55 amps
12v Trickle : 12.00 volts
1.2v : 1.18 volts
1.8v : 1.81 volts
System Bus IO VDD : 1.47 volts
CPU 1 Vcore : 1.27 volts
DDR IO Sleep : 2.63 volts
3.3v : 3.29 volts
3.3v Sleep : 3.27 volts
CPU 1 Core : 15.26 amps
5v Sleep : 5.10 volts
12v : 12.00 volts
1.2v Sleep : 1.18 volts
System Controller Vcore : 1.70 volts
1.5v Sleep : 1.52 volts

Controls:
Clock Slew :
CPU A1 : 4002 RPMs
CPU A2 : 3990 RPMs
CPU A3 : 4031 RPMs
CPU B1 : 4013 RPMs
CPU B2 : 4027 RPMs
CPU B3 : 3996 RPMs
System Controller : 5095 RPMs
PCI Slot : 3603 RPMs

Security:
Enclosure Intrusion : No

============================================================

Posted on Aug 1, 2009 3:31 PM

Reply
9 replies

Aug 1, 2009 5:58 PM in response to max_shawn

The system controller ambient temperature certainly does look odd, and points towards a faulty sensor, but I don't think that's your problem.

The beginning of the email warns about a Power problem. Since your router also resets I'm going to guess that you have a power voltage drop that causes everything to reboot. Removing the UPS (at least for testing) may validate this.

The other reason I think it's power related is because, unless you have a very strange setup, there's nothing happening on the server that could affect the router. Power is about the only common link.

As for the diagnostics DVD, it requires that you boot from it in order to run the tests. That's because it needs the machine in a known state and if the machine is active it cannot tell if measurements outside of normal are caused by a hardware fault or by some updated OS version, or some process that you're running on the server.

Aug 2, 2009 2:54 AM in response to Camelot

I removed the UPS for testing... do you have a recommendation for me regarding an UPS that monitors itself and is not very expensive ? It just has to take the load of two servers, one router and a WLAN, so think that 800 VA are sufficient.

Are there other hardware diagnostics tools that, for example, can run in the background and perform tests from time to time telling me that something is likely happen to go wrong soon (like SMART for harddisks) ?

By the way, just asked in general: Can a high temperature cause the server to reboot ?

Aug 2, 2009 10:16 AM in response to max_shawn

800VA is barely sufficient for one server, but certainly not two.

At 800VA, a UPS is able to sustain about 500 watts continuous

Under full load, the XServe will pull about about 400 watts. It may pull less than that under light load, but that's then number you should factor when evaluating a UPS. And bear in mind, running that 800VA UPS with a 400 watt load will just about give the machine time to shut down cleanly in the event of a power outage. It certainly won't give you much run time. And that's if it's connected to the server and can signal a shutdown - it's not enough time if you have to run down the hall (or drive to the office) to shut down the machine cleanly.

Given that you're running two servers, your router and a WLAN base station (and presumably a switch, too since it doesn't make sense to have the above equipment on a UPS if the switch they all use to talk through is down due to power). You need a UPS that's at least 2000VA, or preferably 3000VA.

Therefore I'm 99% certain that your problem is caused by the UPS draw being too high, causing brownouts and restarting your equipment.

Aug 2, 2009 11:10 AM in response to Camelot

When looking at a LIPS (a less-interruptable power supply; they've never truly been "uninterruptible", after all), there are two factors to consider: wattage, and reserve power. These are how much power can be supplied continuously, and how long that power can be supplied, respectively.

The local standard-issue LIPS is an APC Smart-UPS 1500 series. That unit runs an Intel Xserve server box and its few associated (core) giblets nicely; with an LCD display, USB hub, network switch.

With a USB connection to the host, this APC unit is automatically recognized by default by Mac OS X Server 10.5 Leopard Server.

Typical run-time with an outage with the 1500 watt unit is about five to maybe fifteen minutes before the shutdown is triggered. This range depending on how hard the box is running. (Yes, the upper-bound on the run-time is a little murky here. It's (also) definitely subject to the age of the battery in the LIPS.) Typical power outages at the sites where these are deployed are less than this range.

With the local preference for (unnecessary?) redundancy, it's typical to run parallel LIPS; one per box. That consistency also means fewer (different) spares need be stocked. Though with the Intel Xserve with dual power supplies configured, it's also feasible to run a pair of larger LIPS and to split them across some number of servers to gain some economy.

Aug 3, 2009 12:21 PM in response to MrHoffman

Do you know if APC's SmartUPS 1500 series has a self-monitoring system ? How reliable is this ?

In fact, I already took an eye on this UPS, because it's within our budget, batteries can be replaced by the user and also comes from a reputable company.

Our current one that really seems to be the cause of all trouble the last weeks was a Belkin Universal UPS, which was not able to tell upcoming issues... so the UPS caused finally more problems than it solved.

Aug 3, 2009 5:14 PM in response to max_shawn

Do you know if APC's SmartUPS 1500 series has a self-monitoring system ? How reliable is this ?


The SmartUPS 1500 series of rackmount units are adequate to power one XServe for at least one hour off of the battery. Usually they will run at 25% load for a moderately busy server. They use a standard size 12v sealed lead-acid battery that can be replaced without buying a new battery pack.

To manage these units you should buy one of the network management cards which have configurable alerting, testing, and reporting status through the built-in web interface. They are automatically recognized by the 'Energy Saver' System Prefpane where you can configure the server to shutdown under a limited set of parameters that it will read from the UPS via the USB port. The units run about $500.00US and the network cards are around $250.00US.

The units are very reliable depending on the environment that they are in. If you don't have them in a temperature controlled environment, the battery pack can be subject to damage if their internal temperature exceeds 120° F. That's also a way to know if the batteries are starting to fail when the internal battery temp starts to rise and maintains 10°F. to 20°F. above ambient.

Aug 3, 2009 5:31 PM in response to max_shawn

Do you know if APC's SmartUPS 1500 series has a self-monitoring system ?


Yes; it does.

You can either use the USB connection (which is what is in use locally), or you can use the add-on NIC.

How reliable is this ?


I've installed a number of these units for customers, swapped batteries on various of the units, and not had particular issues with the devices.

Some related [power-monitoring information|http://labs.hoffmanlabs.com/node/1045].

This thread has been closed by the system or the community team. You may vote for any posts you find helpful, or search the Community for additional answers.

XServe G5 shuts down - hardware error ? How to verify ?

Welcome to Apple Support Community
A forum where Apple customers help each other with their products. Get started with your Apple Account.