We have a brand new Mac Pro (12 core with 64GB of RAM) running OS X Lion Server in a corporate environment. The server is running only file sharing and software update servers, and we have around 40 users who need to be connected over AFP at all times. This company runs 24/7 and we have an XSan environment using an ATTO Celerity 8GB 4 channel fiber card (84EN) along with a 6 port 10GB Ethernet card. The Ethernet card is configured in a link aggregation bond using ports 1-4. The idea is that clients who do not have fiber cards installed on their machines can still connect to the SAN via Ethernet and this file server. They AFP connect to the share, and of course the share is the SAN. It's a single mount point and everyone has read/write access.
The issue is that this machine keeps crashing (multiple times per day) and I cannot find any reason why. Syslog shows nothing of value and I've called into Apple Enterprise Support who also brought nothing to the table.
We initially had SMB and AFP file sharing activated but as soon as a Windows 7 client connected the machine was brought down. So, I disabled SMB via terminal (sudo serveradmin stop smb) and deactivated it via the Server app for the share point). That at least allows the machine to be up for 4-6 hours before crashing again.
This is seemingly the simplest of setups for file sharing and I would've thought that this beast of a machine would be able to handle being a file server without issue for far more than 40 clients. I'm seeing high CPU usage, which Apple support told me was perfectly normal (around 60% on the kernel_task process and around 55% on the AppleFileServer process). It also seems to consume all 64GB of memory, though it shows 60GB as inactive, but at the same time it's paging in and out.
Virtually all of the clients are running Lion (10.7.4), the server itself is running 10.7.4. There are a few ethernet connected clients running 10.6 along with two running 10.5. As I mentioned I disabled SMB so there are no Windows computers connecting to this machine at this time (though it would be nice to get that functionality back if AFP can be stabilized).
None of this makes any sense to me and I'm hoping someone can shed some light on this issue. This company simply cannot be down, especially not multiple times per day. The only way to bring things back and running from a crash is to hard boot the machine via the power button as you cannot perform a restart or a shutdown. Once the machine comes back up everything is back to working order for a few more hours until it happens again.