2 Replies Latest reply: Nov 17, 2011 9:49 AM by Bernie Case
matuzalem Level 1 Level 1 (5 points)

Nasty NFS bug in 10.6.8,  running a PCP Server. xgrid clients, have NFS timeouts/disconnects, and jobs fail. Anyone else experiencing this issue?

 

Jul 29 17:01:34 servername org.machx.snmp-data[46790]: Time to sleep 60 seconds

Jul 29 17:02:34 servername org.machx.snmp-data[46790]: done

Jul 29 17:02:41 servername KernelEventAgent[48]: tid 00000000 received event(s) VQ_NOTRESP (1)

Jul 29 17:02:41 servername KernelEventAgent[48]: tid 00000000 type 'nfs', mounted on '/Network/Servers/someserver.some.edu/Volumes/PCP2_Media/Podcast_Producer_Libra ry', from 'someserver.some.edu:/Volumes/PCP2_Media/Podcast_Producer_Library', not responding

Jul 29 17:02:41 servername KernelEventAgent[48]: tid 00000000 found 1 filesystem(s) with problem(s)

Jul 29 17:02:58 servername KernelEventAgent[48]: tid 00000000 unmounting 1 filesystems

Jul 29 17:03:08 servername pcastaction[46809]: PodcastProducer::Actions::QTImport: FINISH

 

* Mac Pro running 10.6.8v2 with two Mac Pro xgrid clients.

* When the clients access the NFS share, the server stops responding, All TCP connections die, and after a few minutes/seconds it responds again. Jobs fail.

* Not surprisingly, NFS disconnect has caused clients to Kernel Panic.

* I reverted to 10.6.7 and the issue subsided.

 

is anyone else having PCP/NFS/Xgrid issues with 10.6.8 server ?

 

Any insigt would be greatly appreciated.

  • 1. Re: NFS bug in 10.6.8, PCP Server. xgrid clients, NFS timeouts/disconnects, jobs fail.
    Matthew Ziegele Level 2 Level 2 (180 points)

    Just encountered this bug with 10.6.8 - seems no help is out there, going to revert  to 10.6.7

     

     

     

     

     

    Nov 14 14:27:22 archive KernelEventAgent[66]: tid 00000000 type 'nfs', mounted on '/Volumes/isilon', from '192.168.10.24:/ifs/data/', not responding
    Nov 14 14:27:22 archive KernelEventAgent[66]: tid 00000000 found 1 filesystem(s) with problem(s)
    Nov 14 14:29:14 archive KernelEventAgent[66]: tid 00000000 received event(s) VQ_NOTRESP (1)

     

    Matthew Ziegele

  • 2. Re: NFS bug in 10.6.8, PCP Server. xgrid clients, NFS timeouts/disconnects, jobs fail.
    Bernie Case Level 1 Level 1 (35 points)

    Just a thought, and it may not apply here... but - I've seen timeouts with Macs and NFS servers (I have experience with Isilon) when a Mac attempts to delete a large file.  The Mac times out waiting for confirmation from the server that the file has been deleted.  There are ways to change the NFS timeouts on the Mac to make it less likely it'll hit a timeout.  Or, the Isilon can be upgraded to OneFS 6.5 or later, which can return the delete confirmation much faster to the client.

     

    The default timeouts on the Mac for a unresponsive NFS mount are controlled via a couple of options in /etc/nfs.conf.  From the nfs.conf man page...

     

         nfs.client.initialdowndelay

                  When an NFS server is not responding, this option specifies how

                  long to wait (in seconds) before the initial notification is

                  posted.  The default is 12 seconds.

     

         nfs.client.nextdowndelay

                  When an NFS server is not responding, this option specifies how

                  long to wait (in seconds) between notifications.  The default is

                  30 seconds.

     

    So basically, if you change the initialdowndelay value (increase it from 12), it's possible you'll end up avoiding the timeout message.  Alternatively, upgrading OneFS may be a better option as it won't require going around to all of your Macs, making these config changes.