NFS drops under heavy load

Question

Level 1

0 points

NFS drops under heavy load

Hi everyone,

Here's a new problem I've experienced recently and for which I can't find help anywhere.

I have a grid composed of one Mac Pro server and 8 iMacs workstations, running the SunGrid Engine middleware. The server is sharing a rather large folder via NFS to all nodes in the grid. On some occasions, when I launch a distributed computing task in the grid, every machine gets quite busy exchanging files and crunching numbers, especially the server who gets the additional tasks of scheduling grid jobs, running the NFS daemon and so on, to a point where its 8GB of physical memory get pretty much filled up. And all of a sudden, at apparently random times, the server stops responding to any command and everything crashes.

After a reboot, the system log on the server shows that NFS became unavailable. Here are some excerpts from it:

Nov 20 08:40:57 medics kernel[0]: nfsd send error 32
Nov 20 08:41:40 medics kernel[0]: nfsd send error 35
Nov 20 08:41:52 medics kernel[0]: nfs server medics.crulrg.net:/Volumes/2-MEDICS/MEDICS: not responding
Nov 20 08:41:52 medics KernelEventAgent[59]: tid 00000000 type 'nfs', mounted on '/Network/MEDICS', from 'medics.crulrg.net:/Volumes/2-MEDICS/MEDICS', not responding
...
Nov 20 09:13:26 medics kernel[0]: nfs_connect: socket error 60 for medics.crulrg.net:/sge6_2u3
Nov 20 09:13:26 medics kernel[0]: nfs_connect: socket connect aborted for medics.crulrg.net:/sge6_2u3
Nov 20 09:13:26 medics kernel[0]: nfs server medics.crulrg.net:/sge6_2u3: can not connect, error 60
...
Nov 20 10:04:14 medics /sbin/nfsd[56]: lost UUID for /, was 8E0FAE66-C739-3D34-97EA-159671C915CA, keeping old UUID
Nov 20 10:04:14 medics /sbin/nfsd[56]: exports:3: path contains non-directory or non-existent components: /Volumes/2-MEDICS/MEDICS
Nov 20 10:04:14 medics /sbin/nfsd[56]: exports:3: no usable directories in export entry
Nov 20 10:04:14 medics /sbin/nfsd[56]: exports:3: using fallback (marked offline): /Volumes/2-MEDICS
Nov 20 10:05:16 medics /sbin/nfsd[56]: exports:3: export entry OK (previous errors cleared)
...

(repeat ad vitam)

While all this appears to point at NFS, I'm wondering if the problem could lie elsewhere, perhaps a faulty disk or a memory leak. I've performed a disk check on /Volumes/2-MEDICS and all appears ok. Additionally, no error occurs under light load, just when the server is very busy.

My questions:

- has anyone seen something like this before?
- how can I troubleshoot this further?
- if this is not the right place to ask such deep questions, do you know of a more appropriate forum/user group?

Mac OS X (10.5.8)

Posted on Nov 24, 2009 7:15 AM

Reply

Answer 1

LeBurt Author

Level 1

0 points

Nov 30, 2009 8:59 AM in response to LeBurt

Here's a new symptom of the problem:

The last time our system crashed, I was able to peek at the NFS Settings tab of the Server Admin window. What we usually see there is something like:

Use 8 server threads

The figure 8 is a user-changeable text box, the rest is static text in the interface window. But this time, what I got was this:

Use 99 server daemons

I don't know if this is any indication of what was really happening under the hood but never before have I seen an interface change its display text like that -- threads and daemons are two very different things are they not? This is the weirdest thing. I took a screen shot but unfortunately can't post it here.

Anyone seen this before?

Reply