NFS drops under heavy load
Hi everyone,
Here's a new problem I've experienced recently and for which I can't find help anywhere.
I have a grid composed of one Mac Pro server and 8 iMacs workstations, running the SunGrid Engine middleware. The server is sharing a rather large folder via NFS to all nodes in the grid. On some occasions, when I launch a distributed computing task in the grid, every machine gets quite busy exchanging files and crunching numbers, especially the server who gets the additional tasks of scheduling grid jobs, running the NFS daemon and so on, to a point where its 8GB of physical memory get pretty much filled up. And all of a sudden, at apparently random times, the server stops responding to any command and everything crashes.
After a reboot, the system log on the server shows that NFS became unavailable. Here are some excerpts from it:
Nov 20 08:40:57 medics kernel[0]: nfsd send error 32
Nov 20 08:41:40 medics kernel[0]: nfsd send error 35
Nov 20 08:41:52 medics kernel[0]: nfs server medics.crulrg.net:/Volumes/2-MEDICS/MEDICS: not responding
Nov 20 08:41:52 medics KernelEventAgent[59]: tid 00000000 type 'nfs', mounted on '/Network/MEDICS', from 'medics.crulrg.net:/Volumes/2-MEDICS/MEDICS', not responding
...
Nov 20 09:13:26 medics kernel[0]: nfs_connect: socket error 60 for medics.crulrg.net:/sge6_2u3
Nov 20 09:13:26 medics kernel[0]: nfs_connect: socket connect aborted for medics.crulrg.net:/sge6_2u3
Nov 20 09:13:26 medics kernel[0]: nfs server medics.crulrg.net:/sge6_2u3: can not connect, error 60
...
Nov 20 10:04:14 medics /sbin/nfsd[56]: lost UUID for /, was 8E0FAE66-C739-3D34-97EA-159671C915CA, keeping old UUID
Nov 20 10:04:14 medics /sbin/nfsd[56]: exports:3: path contains non-directory or non-existent components: /Volumes/2-MEDICS/MEDICS
Nov 20 10:04:14 medics /sbin/nfsd[56]: exports:3: no usable directories in export entry
Nov 20 10:04:14 medics /sbin/nfsd[56]: exports:3: using fallback (marked offline): /Volumes/2-MEDICS
Nov 20 10:05:16 medics /sbin/nfsd[56]: exports:3: export entry OK (previous errors cleared)
...
(repeat ad vitam)
While all this appears to point at NFS, I'm wondering if the problem could lie elsewhere, perhaps a faulty disk or a memory leak. I've performed a disk check on /Volumes/2-MEDICS and all appears ok. Additionally, no error occurs under light load, just when the server is very busy.
My questions:
- has anyone seen something like this before?
- how can I troubleshoot this further?
- if this is not the right place to ask such deep questions, do you know of a more appropriate forum/user group?
Here's a new problem I've experienced recently and for which I can't find help anywhere.
I have a grid composed of one Mac Pro server and 8 iMacs workstations, running the SunGrid Engine middleware. The server is sharing a rather large folder via NFS to all nodes in the grid. On some occasions, when I launch a distributed computing task in the grid, every machine gets quite busy exchanging files and crunching numbers, especially the server who gets the additional tasks of scheduling grid jobs, running the NFS daemon and so on, to a point where its 8GB of physical memory get pretty much filled up. And all of a sudden, at apparently random times, the server stops responding to any command and everything crashes.
After a reboot, the system log on the server shows that NFS became unavailable. Here are some excerpts from it:
Nov 20 08:40:57 medics kernel[0]: nfsd send error 32
Nov 20 08:41:40 medics kernel[0]: nfsd send error 35
Nov 20 08:41:52 medics kernel[0]: nfs server medics.crulrg.net:/Volumes/2-MEDICS/MEDICS: not responding
Nov 20 08:41:52 medics KernelEventAgent[59]: tid 00000000 type 'nfs', mounted on '/Network/MEDICS', from 'medics.crulrg.net:/Volumes/2-MEDICS/MEDICS', not responding
...
Nov 20 09:13:26 medics kernel[0]: nfs_connect: socket error 60 for medics.crulrg.net:/sge6_2u3
Nov 20 09:13:26 medics kernel[0]: nfs_connect: socket connect aborted for medics.crulrg.net:/sge6_2u3
Nov 20 09:13:26 medics kernel[0]: nfs server medics.crulrg.net:/sge6_2u3: can not connect, error 60
...
Nov 20 10:04:14 medics /sbin/nfsd[56]: lost UUID for /, was 8E0FAE66-C739-3D34-97EA-159671C915CA, keeping old UUID
Nov 20 10:04:14 medics /sbin/nfsd[56]: exports:3: path contains non-directory or non-existent components: /Volumes/2-MEDICS/MEDICS
Nov 20 10:04:14 medics /sbin/nfsd[56]: exports:3: no usable directories in export entry
Nov 20 10:04:14 medics /sbin/nfsd[56]: exports:3: using fallback (marked offline): /Volumes/2-MEDICS
Nov 20 10:05:16 medics /sbin/nfsd[56]: exports:3: export entry OK (previous errors cleared)
...
(repeat ad vitam)
While all this appears to point at NFS, I'm wondering if the problem could lie elsewhere, perhaps a faulty disk or a memory leak. I've performed a disk check on /Volumes/2-MEDICS and all appears ok. Additionally, no error occurs under light load, just when the server is very busy.
My questions:
- has anyone seen something like this before?
- how can I troubleshoot this further?
- if this is not the right place to ask such deep questions, do you know of a more appropriate forum/user group?
Mac OS X (10.5.8)