pc_forever

Q: CLUSTERS... the bane of my life

I suppose i am lucky that the worst thing about my job is that I cannot (after six months) get clustering to work over our small managed ethernet network...

I would be SO greatful to hear from one of you clever people about what i need to do to get things working.
I have followed the book and many examples on the net and am currently in a situation where i have a cluster called CLUSTERFC on our xserve (running leopard server and final cut server also) which is happy to use 7 of its 8 cores to transcode work for me via compressor from a client machine (edit1, edit2 and edit3) and via finalcutserver (from its local fibrechannel raid). This is not a SAN environment. We use edit in place over ethernet (very happily).

my problem is that though the client machines (edit1, edit2 and edit3) appear in the batch list they just seem to sit there while the server does all the work. I know a lot of people say that it takes time to kick in but they never do. in fact when i check the logs it seems they have crashed with a cryptic error message like:

msg=' error: Shared storage client timed out while subscribing to "nfs://server.local/private/var/spool/qmaster/92806255-AF31747A/shared

I find this message bizarre becuase it looks like the client machines cannot access the server’s cluster share. I know this isn’t true because if i send a batch from compressor to the cluster i can watch the var/spool/qmaster file grow in size as the file is copied in to it. Surely this would not happen if there was no access?

During transcoding, if i check the clients i find that they are no longer providing a service, and if i try and restart them from the prefs pane i get the following error

Qmastered not running
Unable to start services because qmasterd is not running. Please consult your user's manual for more information.

it is as if in trying to be part of the cluster the local machines all crash. when this happens i have to wait and wait, even a restart wont fix this. they just seem to start working again all of a sudden.

I tried moving my cluster storage to a raid drive but it appears that if i use anything but the default var/spool folder then qmaster will not create either a cluster or even a quickcluster.

BTW, this is all from a fresh installation a few days ago (I totally killed our server trying to change permissions to get them all to share!) i’ve tried reinstalling and using that very clever app to strategically remove FC components (FC Remover by digitalrebellion).

I suppose my question is (other than, why does it hurt me so); how do i ensure that my server and 3 client nodes are accessing each other properly? each day the client machines log into the server automatically via AFP and i happily use FCServer and filesharing.

should i be using some other way of connecting? i noticed the address was NFS:// should it not be AFP:// am i mixing my connections up?

I need to be honest and say that we use two network interfaces on our network, a small tree link (direct connect from client to server i.e 8.8.8.8, 3.3.3.3, 92.92.92.92) and a standard ethernet (in the range 192.168.254.250...) I’m loath to mention this because i suppose a creates an unusual variable that is specific to our system. However i have tried each setup example i have found using both network interfaces (turning the other off as i do so) to create an isolated environment and i really don’t think my network config is the issue.

thanks for listening to my yap. Any ideas gratefully received.

All the best

Dan

mac pro/ Xserve, Mac OS X (10.6.5), xserve hardware (leopard server) with ActiveRAID

Posted on Jan 20, 2011 3:36 AM

Close

Q: CLUSTERS... the bane of my life

  • All replies
  • Helpful answers