Q: Can't start volume
Yesterday, 2 or 3 client machines couldn't mount the Xsan volumes. The volumes were mounted on the other 7 or 8 client machines and the 2 controllers. Under computers, ALL machines said "unreachable or offline", "no visible LUNs", or "check fibre channel cables"... even the machines that had the volumes mounted at the time.
To try to fix this, I shut down all the client machines, stopped the volume, shut down the meta data controllers, and powered off the XServe RAIDs, Promise Vtrak, fibre switches, and meta data network switch. After powering everything back on, I can't start the volume. Every LUN has an exclamation point next to it in the LUNs section of XSan Admin. The LUNs do appear in Disk Utility. All client machines and controllers show the same errors as before. Forward and reverse DNS is functioning properly.
Before stopping the volume, this appeared thousands of times in the logs...
May 25 00:30:03 x1 fsm[2310]: Xsan FSS 'SanDrive[1]': [Node 82] Disk Stripe Group 1 is DOWN for this client. # disks 7 unitmap[1] 0xfffff partaccess 0x1
May 25 00:30:03 x1 fsm[2310]: Xsan FSS 'SanDrive[1]': [Node 82] Disk Stripe Group 2 is DOWN for this client. # disks 7 unitmap[2] 0xfffff partaccess 0x1
May 25 00:30:03 x1 fsm[2310]: Xsan FSS 'SanDrive[1]': [Node 82] Disk Stripe Group 3 is DOWN for this client. # disks 7 unitmap[5] 0xfffff partaccess 0x1
May 25 00:30:04 x1 fsm[2207]: Xsan FSS 'SanDrive2[0]': [Node 36] Disk Stripe Group 1 is DOWN for this client. # disks 3 unitmap[1] 0xfffff partaccess 0x1
When I try to start the volume...
May 25 14:02:17 x2 servermgrd[108]: Got error -9806 for SSLHandshake remote address is 192.168.100.99:57024
May 25 14:02:17 x2 servermgrd[108]: Exception in threadListen: Socket: Connect failed
May 25 14:02:17 x2 fsmpm[171]: PortMapper: Starting FSS service 'SanDrive[0]' on x2.twcable.com.
May 25 14:02:17 x2 servermgrd[108]: Got error -9806 for SSLHandshake remote address is 192.168.100.99:59072
May 25 14:02:17 x2 servermgrd[108]: Exception in threadListen: Socket: Connect failed
May 25 14:02:17 x2 fsm[885]: Xsan FSS 'SanDrive[0]': Server could not find any Meta-Data devices!
May 25 14:02:56 x2 fsmpm[171]: PortMapper: Initiating activation vote for FSS 'SanDrive'.
May 25 14:03:26: --- last message repeated 34 times ---
May 25 14:03:26 x2 fsmpm[171]: PortMapper: Initiating activation vote for FSS 'SanDrive'.
May 25 14:03:36: --- last message repeated 9 times ---
May 25 14:03:36 x2 servermgrd[108]: xsan: [108/216E190] ERROR: activatevolume_byhost(SanDrive): Waited for activation but it never happened
May 25 14:03:36 x2 servermgrd[108]: xsan: [108/2111E0] ERROR: getfsm_processstats(SanDrive): Unable to find pid of fsm
May 25 14:03:37 x2 servermgrd[108]: xsan: [108/2111E0] ERROR: getfsm_processstats(SanDrive2): Unable to find pid of fsm
May 25 14:03:52 x2 servermgrd[108]: xsan: [108/2111E0] ERROR: getfsm_processstats(SanDrive): Unable to find pid of fsm
May 25 14:03:52 x2 servermgrd[108]: xsan: [108/2111E0] ERROR: getfsm_processstats(SanDrive2): Unable to find pid of fsm
May 25 14:04:52 x2 servermgrd[108]: xsan: [108/2111E0] ERROR: getfsm_processstats(SanDrive): Unable to find pid of fsm
May 25 14:04:52 x2 servermgrd[108]: xsan: [108/2111E0] ERROR: getfsm_processstats(SanDrive2): Unable to find pid of fsm
May 25 14:04:55 x2 servermgrd[108]: Got error -9806 for SSLHandshake remote address is 192.168.100.99:9409
May 25 14:04:55 x2 servermgrd[108]: Exception in threadListen: Socket: Connect failed
May 25 14:04:55 x2 fsmpm[171]: PortMapper: Starting FSS service 'SanDrive[0]' on x2.twcable.com.
May 25 14:04:55 x2 servermgrd[108]: Got error -9806 for SSLHandshake remote address is 192.168.100.99:11457
May 25 14:04:55 x2 servermgrd[108]: Exception in threadListen: Socket: Connect failed
May 25 14:04:55 x2 fsm[985]: Xsan FSS 'SanDrive[0]': Server could not find any Meta-Data devices!
May 25 14:05:09 x2 /System/Library/CoreServices/CCacheServer.app/Contents/MacOS/CCacheServer[749]: No valid tickets, timing out
May 25 14:05:34 x2 fsmpm[171]: PortMapper: Initiating activation vote for FSS 'SanDrive'.
May 25 14:06:04: --- last message repeated 34 times ---
May 25 14:06:04 x2 fsmpm[171]: PortMapper: Initiating activation vote for FSS 'SanDrive'.
May 25 14:06:14: --- last message repeated 9 times ---
May 25 14:06:14 x2 servermgrd[108]: xsan: [108/2068FD0] ERROR: activatevolume_byhost(SanDrive): Waited for activation but it never happened
May 25 14:06:14 x2 Xsan Admin[769]: ERROR: Error starting volume…: The operation couldn’t be completed. (SANTransactionErrorDomain error 100036.) (100036)
May 25 14:06:14 x2 servermgrd[108]: xsan: [108/2111E0] ERROR: getfsm_processstats(SanDrive): Unable to find pid of fsm
Info about the setup:
Controllers: 2 XServe running 10.6.3 Server with XSan 2.2.1
Clients: new MacPro towers running 10.6.3 with XSan 2.2.1
2x QLogic SANbox 5200 fibre switches
2 Volumes- 1 from the LUNs on 3 XServe RAIDS, 1 from the LUNs on a Promise Vtrak (going to add 3-4 more if we get this working)
The SAN was setup 3-4 weeks ago and had been running perfectly until Monday morning.
Any advice would be greatly appreciated!
To try to fix this, I shut down all the client machines, stopped the volume, shut down the meta data controllers, and powered off the XServe RAIDs, Promise Vtrak, fibre switches, and meta data network switch. After powering everything back on, I can't start the volume. Every LUN has an exclamation point next to it in the LUNs section of XSan Admin. The LUNs do appear in Disk Utility. All client machines and controllers show the same errors as before. Forward and reverse DNS is functioning properly.
Before stopping the volume, this appeared thousands of times in the logs...
May 25 00:30:03 x1 fsm[2310]: Xsan FSS 'SanDrive[1]': [Node 82] Disk Stripe Group 1 is DOWN for this client. # disks 7 unitmap[1] 0xfffff partaccess 0x1
May 25 00:30:03 x1 fsm[2310]: Xsan FSS 'SanDrive[1]': [Node 82] Disk Stripe Group 2 is DOWN for this client. # disks 7 unitmap[2] 0xfffff partaccess 0x1
May 25 00:30:03 x1 fsm[2310]: Xsan FSS 'SanDrive[1]': [Node 82] Disk Stripe Group 3 is DOWN for this client. # disks 7 unitmap[5] 0xfffff partaccess 0x1
May 25 00:30:04 x1 fsm[2207]: Xsan FSS 'SanDrive2[0]': [Node 36] Disk Stripe Group 1 is DOWN for this client. # disks 3 unitmap[1] 0xfffff partaccess 0x1
When I try to start the volume...
May 25 14:02:17 x2 servermgrd[108]: Got error -9806 for SSLHandshake remote address is 192.168.100.99:57024
May 25 14:02:17 x2 servermgrd[108]: Exception in threadListen: Socket: Connect failed
May 25 14:02:17 x2 fsmpm[171]: PortMapper: Starting FSS service 'SanDrive[0]' on x2.twcable.com.
May 25 14:02:17 x2 servermgrd[108]: Got error -9806 for SSLHandshake remote address is 192.168.100.99:59072
May 25 14:02:17 x2 servermgrd[108]: Exception in threadListen: Socket: Connect failed
May 25 14:02:17 x2 fsm[885]: Xsan FSS 'SanDrive[0]': Server could not find any Meta-Data devices!
May 25 14:02:56 x2 fsmpm[171]: PortMapper: Initiating activation vote for FSS 'SanDrive'.
May 25 14:03:26: --- last message repeated 34 times ---
May 25 14:03:26 x2 fsmpm[171]: PortMapper: Initiating activation vote for FSS 'SanDrive'.
May 25 14:03:36: --- last message repeated 9 times ---
May 25 14:03:36 x2 servermgrd[108]: xsan: [108/216E190] ERROR: activatevolume_byhost(SanDrive): Waited for activation but it never happened
May 25 14:03:36 x2 servermgrd[108]: xsan: [108/2111E0] ERROR: getfsm_processstats(SanDrive): Unable to find pid of fsm
May 25 14:03:37 x2 servermgrd[108]: xsan: [108/2111E0] ERROR: getfsm_processstats(SanDrive2): Unable to find pid of fsm
May 25 14:03:52 x2 servermgrd[108]: xsan: [108/2111E0] ERROR: getfsm_processstats(SanDrive): Unable to find pid of fsm
May 25 14:03:52 x2 servermgrd[108]: xsan: [108/2111E0] ERROR: getfsm_processstats(SanDrive2): Unable to find pid of fsm
May 25 14:04:52 x2 servermgrd[108]: xsan: [108/2111E0] ERROR: getfsm_processstats(SanDrive): Unable to find pid of fsm
May 25 14:04:52 x2 servermgrd[108]: xsan: [108/2111E0] ERROR: getfsm_processstats(SanDrive2): Unable to find pid of fsm
May 25 14:04:55 x2 servermgrd[108]: Got error -9806 for SSLHandshake remote address is 192.168.100.99:9409
May 25 14:04:55 x2 servermgrd[108]: Exception in threadListen: Socket: Connect failed
May 25 14:04:55 x2 fsmpm[171]: PortMapper: Starting FSS service 'SanDrive[0]' on x2.twcable.com.
May 25 14:04:55 x2 servermgrd[108]: Got error -9806 for SSLHandshake remote address is 192.168.100.99:11457
May 25 14:04:55 x2 servermgrd[108]: Exception in threadListen: Socket: Connect failed
May 25 14:04:55 x2 fsm[985]: Xsan FSS 'SanDrive[0]': Server could not find any Meta-Data devices!
May 25 14:05:09 x2 /System/Library/CoreServices/CCacheServer.app/Contents/MacOS/CCacheServer[749]: No valid tickets, timing out
May 25 14:05:34 x2 fsmpm[171]: PortMapper: Initiating activation vote for FSS 'SanDrive'.
May 25 14:06:04: --- last message repeated 34 times ---
May 25 14:06:04 x2 fsmpm[171]: PortMapper: Initiating activation vote for FSS 'SanDrive'.
May 25 14:06:14: --- last message repeated 9 times ---
May 25 14:06:14 x2 servermgrd[108]: xsan: [108/2068FD0] ERROR: activatevolume_byhost(SanDrive): Waited for activation but it never happened
May 25 14:06:14 x2 Xsan Admin[769]: ERROR: Error starting volume…: The operation couldn’t be completed. (SANTransactionErrorDomain error 100036.) (100036)
May 25 14:06:14 x2 servermgrd[108]: xsan: [108/2111E0] ERROR: getfsm_processstats(SanDrive): Unable to find pid of fsm
Info about the setup:
Controllers: 2 XServe running 10.6.3 Server with XSan 2.2.1
Clients: new MacPro towers running 10.6.3 with XSan 2.2.1
2x QLogic SANbox 5200 fibre switches
2 Volumes- 1 from the LUNs on 3 XServe RAIDS, 1 from the LUNs on a Promise Vtrak (going to add 3-4 more if we get this working)
The SAN was setup 3-4 weeks ago and had been running perfectly until Monday morning.
Any advice would be greatly appreciated!
Intel Xserve, Mac OS X (10.6.3)
Posted on May 25, 2010 8:41 PM