Previous 1 2 Next 17 Replies Latest reply: Mar 9, 2007 3:52 PM by DaddyPaycheck
Wes Plate Level 4 Level 4 (2,845 points)
We added an Xserve before the end of the year (2006), and we added an Xserve RAID to it just this last week.

During the period before the RAID was installed, the Xserve was used lightly as a file server, mostly just enough for me to move some files around the office and start to get to know OS X Server.

With the RAID attached we've started to try using the system as we intend, which is to have a large common file repository where common files can be shared out to the several clients on the GigE network.

So far the results have been disappointing, the Xserve has shut down and reboot itself a few times during periods of active use. I searched this forum and found a suggestion to run memtest and I have that tool running right now on the server. The system has 2GB of RAM installed.

Is RAM a good place to start with crashes that seem to be related to network load? How else can I troubleshoot?

I've not yet called Apple, I hoped to get more info before standing on line in the phone queue.

MacBook Pro, Dual Core 2.16, Mac OS X (10.4.8), Keyboard protected with leather
  • Roger Smith3 Level 6 Level 6 (13,475 points)
    If the machine is panic'ing and rebooting, what does /Library/Logs/panic.log say?

    Roger
  • Wes Plate Level 4 Level 4 (2,845 points)
    I don't know if it is panicking and rebooting, I just have noticed the rebooting. But anyway there is no panic.log file in /Library/Logs/

    I ran memtest twice, once with 3 iterations and once with 5, and both times it completed without problems.
  • DaddyPaycheck Level 6 Level 6 (16,035 points)
    Wes Plate-

    Something wrong there. My gut says hardware. When it shuts down is it graceful or pow and off?

    Did you enable any other services when you attached the RAID? Are you doing anything radically different.

    Odd that nothing is in the panic log. Do you have redundant power supplies? One of those heading south and no backup PS could cause an immediate shutdown.

    Memtest is good for rooting out errors, but you should run it at east 10 times or so just to be sure.

    Luck-

    -DaddyPaycheck
  • Wes Plate Level 4 Level 4 (2,845 points)
    Wes Plate-

    Something wrong there. My gut says hardware. When it
    shuts down is it graceful or pow and off?


    I'm not sure the exact answer. We were testing the migration to the Xserve and RAID when suddenly we noticed network connection to the server was gone. I went to the server room, the display on the Xserve was off and the two raided system drives were busy for a couple minutes, then the system rebooted.


    Did you enable any other services when you attached
    the RAID? Are you doing anything radically
    different.


    I think the RAID is the only change. I've been reading up on how to do Open Director, DHCP, DNS and that stuff, but I don't know enough yet to actually do them.


    Odd that nothing is in the panic log. Do you have
    redundant power supplies? One of those heading south
    and no backup PS could cause an immediate shutdown.


    The Xserve does not have redundant power. Is that an upgrade I can do myself?


    Memtest is good for rooting out errors, but you
    should run it at east 10 times or so just to be
    sure.


    I just started a 10 times test.



    Later today I'm going to test getting the RAID out of the configuration. We are trying to replace two firewire drives connected to one of our systems with shares on the RAID so everyone can access the files, so I plan to connect those firewire drives to the Xserve and then share them from there. Seems like a good test.
  • DaddyPaycheck Level 6 Level 6 (16,035 points)
    Wes Plate-

    Any chance of idiot-interference (somebody hit the button?) on the power down? Is the server room on UPS? I would cover that before anything else.

    Any chance your AC supply side got overloaded? A RAID and an X is a bit of a load and your circuit should be rated for that load.

    I am fairly certain that you can purchase the power supplies separately. Plug and pray I would guess but I haven't done this myself.

    Go slow when connecting peripherals. Make sure everything is working correctly first before proceeding to the next step. Consider troubleshooting this problem first, find the cause, and then make sure things are stable before going to the next step.

    Luck-

    -DaddyPaycheck
  • Wes Plate Level 4 Level 4 (2,845 points)
    Any chance of idiot-interference (somebody hit the
    button?) on the power down? Is the server room on
    UPS? I would cover that before anything else.


    UPS, yes. Chance of button-pressing? No.


    Any chance your AC supply side got overloaded? A RAID
    and an X is a bit of a load and your circuit should
    be rated for that load.


    We've got massive power (relatively speaking) running into our "server room". We just added a 30 amp circuit to connect the rack-mounted UPS we got for the RAID. Thinking back, I hadn't moved the RAID to the new UPS, it was connected to the two existing UPSs, though they aren't showing overload on their indicators, but they're at about 60%. When we next test the RAID it will certainly be on its own UPS.


    I am fairly certain that you can purchase the power
    supplies separately. Plug and pray I would guess but
    I haven't done this myself.


    I'll look into it.


    Go slow when connecting peripherals. Make sure
    everything is working correctly first before
    proceeding to the next step. Consider troubleshooting
    this problem first, find the cause, and then make
    sure things are stable before going to the next
    step.


    I've been trying to do just this, part of the reason the Xserve RAID came a couple months after the Xserve. I should have tested the Firewire drives on the Xserve before, that was an oversight.


    I'll post more later today.
  • Wes Plate Level 4 Level 4 (2,845 points)
    More data points, though I may have to get on the horn with Apple support. I am interested in any more thoughts people have, of course.

    This morning, as suggested, I ran memtest in a loop of 10. No problems reported.

    Later in the day we moved the two firewire drives we're trying to replace with the Xserve/Xserve RAID onto the Xserve, disconnected the RAID and shared the two Firewire drives on the network. The user of these drives mounted them and ran the test programs that use the files on the Firewire drives. Within a couple of minutes the Xserve was dead again. The network connection to the system was gone, and a few minutes after that the Xserve rebooted itself.

    Still no panic.log file.

    I found it fairly interesting that the crashing didn't seem to be RAID related. I mounted the RAID again and turned on sharing for the folders I need to access. However I cannot now get to the Xserve over the network!

    I cannot get from other machines TO to the Xserve, either via its .local bonjour name or via its IP address. From the Xserve I cannot get out to computers on the LAN, I also cannot access web sites outside our network. I also cannot see the RAID via RAID Admin.

    This causes me to question the ethernet controller. It seems dead now, maybe until now it was almost dead?
  • DaddyPaycheck Level 6 Level 6 (16,035 points)
    Wes Plate-

    Those data points certainly point towards the Ethernet circuitry.

    Luck-

    -DaddyPaycheck
  • Wes Plate Level 4 Level 4 (2,845 points)
    I'm still going to call Apple, but I discovered something else just now-

    This morning I had switched the Xserve to a "Link Aggregate" ethernet configuration, which I read about in my handy Administering OS X Server book. Was fine until this last reboot. So in my attempts to bring life back to the system I turned off the Link Aggregate port and turned Ethernet 1 back on, but it didn't work (as I posted). I had to Delete the Link Aggregate configuration (not just disable it) and then, network life returned to the system.

    But still, I don't think that our demands should be rebooting an Xserve.
  • Wes Plate Level 4 Level 4 (2,845 points)
    I still haven't called Apple, I keep looking for clues.

    Latest: I have been able to reproduce on my own the server crash, so I've been letting it happen as often as I can so I can find more clues and I think I have a good one...



    As this particular test script I have runs, the amount of wired memory grows, until it practically takes up everything. In fact the server crashed between 5 and 10 seconds after I took the above screenshot.

    So more RAM is needed, I'd say? Am I in danger of filling up 4GB just as easily as we're filling up 2GB?
  • Tod Kuykendall Level 4 Level 4 (2,270 points)
    Wow, pretty conclusive evidence of the problem. If a single app is using this much memory it should be obvious in the Activity Monitor above. Sort by memory usage see what bubbles to the top.

    It will probably be Apple File Server or something opaque like kernel task but you never know. Grabbing samples may or may not help analysis. I think selecting the thread and "inspecting" is the best you can do with the regular tools. Apple's Dev tools have several interesting debugging tools, Big Top and Shark for example, that you can attach to virtually any process that is running and see where it is spending it's time and resources. I've never tried attaching them to system processes but they are technically no different than regular processes.

    Out of interest how do you cause the problem? Is it just overloading it or is it something specific? I was thinking that you could hit the server and see if it recovers over time... any process that eventually swallows all memory always suggests memory leakage to me. If you could watch a thread always increment in memory every time you do X that is pretty clear evidence of leak situation. On the PPC AFP now seems to free memory way better then it ever did before the Intel versions came out. They clearly dug deeply into the guts of AFP for the Intel port and the PPC side gained some beenfits but perhaps there is still a missed "free" on the Intel side...

    =Tod
  • Wes Plate Level 4 Level 4 (2,845 points)
    I had the same thought to see who is using up all my memory...

    click the image to see it larger


    ...Doesn't tell me much though. Certainly none of our programs are running on the Xserve, we have some utilities that we run on our computers that search for and write files and these programs don't have memory issues locally. But when we mount the shares on the Xserve and our programs search the shares for files, the Xserve quickly runs out of memory.
  • Tod Kuykendall Level 4 Level 4 (2,270 points)
    Wes,

    Change the pop-up after the search field to either Active Processes or All Processes to get at the system level stuff and sort by memory. Clearly it's nothing you're running in your space that's triggering it.

    This is no guarantee to reveal which process it is because it could be an underlying process but memory is much easier to trace than CPU consumption so I would be surprised if you can't find it.

    =Tod
  • Wes Plate Level 4 Level 4 (2,845 points)
    Thanks, didn't know about that pop-up.

    click the image to see it larger


    Armed with the confidence that I finally figured out what was causing the reboots-- running out of RAM-- and what seemed to be causing the RAM consumption-- network file searches from connected clients-- I called Apple tech support and provided them a load of data. Fingers crossed for a fix, our server will remain dormant until it is fixed.
Previous 1 2 Next