Q: Troubleshooting Xsan Volume Panic
I've been having a semi-regular Xsan panic and haven't been able to isolate the cause. I'm hoping someone here has seen something similar and can offer some tips on solving this.
Here's what the cvlog says when the panic happens (it's the same every time):
0209 07:47:13 0x10da87000 (*FATAL*) PANIC: /Library/Filesystems/Xsan/bin/fsm ASSERT failed "IPXATTRINODE(ip)" file fsm_xattr.c, line 736
0209 07:47:13.258787 0x11de39000 (Debug) timedfree_pending_inodethread: flushing journal.
0209 07:47:13.258805 0x11de39000 (Debug) timedfree_pending_inodethread: journal flush complete.
0209 07:47:13 0x10da87000 (*FATAL*) PANIC: aborting threads now.
The primary MDC panics, and fails over to the backup. The backup panics immediately, and fails back to the primary, which panics again and stops the SAN.
Here are the setup details:
• 4 Early 2008 Xserves (2 controllers, 2 clients), Xsan 2.2.1
• All 4 servers running 10.6.5 (same panic also occurred under 10.6.4)
• Clients share open directory home folders over AFP for users with portable home directories.
• 2 Vtrak E610F enclosures, one hosts a data LUN, the other a metadata LUN for one Xsan volume
• QLogic Sanbox 5602 fiber switch
I can't find any hardware problems on the Vtraks or the Sanbox
After the san panics, I run cvfsck -j followed by cvfsck -wv. The output doesn't show any problems, to my (admittedly untrained) eye. Here's what I get from cvfsck -wv, let me know if I'm missing something:
BUILD INFO:
#!@$ Server Revision 3.5.0 Build 7443 Branch branches_35X (412.3)
#!@$ Built for Darwin 10.0 i386
#!@$ Created on Mon Dec 7 12:52:39 PST 2009
Created directory /tmp/cvfsck15061a for temporary files.
Attempting to acquire arbitration block... successful.
Creating MetadataAndJournal allocation check file.
Creating Homes allocation check file.
Recovering Journal Log.
Super Block information.
FS Created On : Wed Dec 22 07:22:21 2010
Inode Version : '2.5'
File System Status : Clean
Allocated Inodes : 1305600
Free Inodes : 26073
FL Blocks : 85
Next Inode Chunk : 0x32c55
Metadump Seqno : 0
Restore Journal Seqno : 0
Windows Security Indx Inode : 0x5
Windows Security Data Inode : 0x6
Quota Database Inode : 0x7
ID Database Inode : 0xb
Client Write Opens Inode : 0x8
Stripe Group MetadataAndJournal ( 0) 0x746a080 blocks.
Stripe Group Homes ( 1) 0xe8bd3c0 blocks.
Building Inode Index Database 1305600 (100%).
Verifying NT Security Descriptors
Found 697 NT Security Descriptors: all are good
Verifying Free List Extents.
Scanning inodes 1305600 (100%).
Sorting extent list for MetadataAndJournal pass 1/1
Updating bitmap for MetadataAndJournal extents 102400 ( 8%). Updating bitmap for MetadataAndJournal extents 112640 ( 8%). Updating bitmap for MetadataAndJournal extents 113322 ( 9%).
Sorting extent list for Homes pass 1/1
Updating bitmap for Homes extents 1257585 (100%).
Checking for dead inodes 1305600 (100%).
Checking directories 109996 (100%).
Scanning for orphaned inodes 1305600 (100%).
Verifying link & subdir counts 1305600 (100%).
Checking free list. 1305600 (100%).
Checking pending free list.
Checking Arbitration Control Block.
Checking MetadataAndJournal allocation bit maps (100%).
Checking Homes allocation bit maps (100%).
File system 'ODHome'. Blocks-244044736 free-215328011 Inodes-1305600 free-26073.
File System Check completed successfully.
However, if I try to restart the volume after running cvfsck, it just panics again. Shutting down the clients and rebooting the primary controller allows a normal startup, and the volume mounts and runs fine until the next panic. Everything in the cvlog between panics is labeled either debug or info.
I've reversed the roles for the primary and backup controllers, but it didn't make a difference in the panic.
Most, but not all, of the panics seem to happen in the morning when users are logging in, or at quitting time, when the users' portable homes are syncing. I've tried to isolate the panic to a specific user's sync, by temporarily disabling syncing for usergroups (one at a time), but can't tie them to any single user. Nightly incremental backups and weekly full backups run without causing a panic. I can also use rsync to mirror the volume's contents to another server without causing a panic.
I'd appreciate any insight into the cause of the panic, or strategies to diagnose or prevent it.
Thanks!
Here's what the cvlog says when the panic happens (it's the same every time):
0209 07:47:13 0x10da87000 (*FATAL*) PANIC: /Library/Filesystems/Xsan/bin/fsm ASSERT failed "IPXATTRINODE(ip)" file fsm_xattr.c, line 736
0209 07:47:13.258787 0x11de39000 (Debug) timedfree_pending_inodethread: flushing journal.
0209 07:47:13.258805 0x11de39000 (Debug) timedfree_pending_inodethread: journal flush complete.
0209 07:47:13 0x10da87000 (*FATAL*) PANIC: aborting threads now.
The primary MDC panics, and fails over to the backup. The backup panics immediately, and fails back to the primary, which panics again and stops the SAN.
Here are the setup details:
• 4 Early 2008 Xserves (2 controllers, 2 clients), Xsan 2.2.1
• All 4 servers running 10.6.5 (same panic also occurred under 10.6.4)
• Clients share open directory home folders over AFP for users with portable home directories.
• 2 Vtrak E610F enclosures, one hosts a data LUN, the other a metadata LUN for one Xsan volume
• QLogic Sanbox 5602 fiber switch
I can't find any hardware problems on the Vtraks or the Sanbox
After the san panics, I run cvfsck -j followed by cvfsck -wv. The output doesn't show any problems, to my (admittedly untrained) eye. Here's what I get from cvfsck -wv, let me know if I'm missing something:
BUILD INFO:
#!@$ Server Revision 3.5.0 Build 7443 Branch branches_35X (412.3)
#!@$ Built for Darwin 10.0 i386
#!@$ Created on Mon Dec 7 12:52:39 PST 2009
Created directory /tmp/cvfsck15061a for temporary files.
Attempting to acquire arbitration block... successful.
Creating MetadataAndJournal allocation check file.
Creating Homes allocation check file.
Recovering Journal Log.
Super Block information.
FS Created On : Wed Dec 22 07:22:21 2010
Inode Version : '2.5'
File System Status : Clean
Allocated Inodes : 1305600
Free Inodes : 26073
FL Blocks : 85
Next Inode Chunk : 0x32c55
Metadump Seqno : 0
Restore Journal Seqno : 0
Windows Security Indx Inode : 0x5
Windows Security Data Inode : 0x6
Quota Database Inode : 0x7
ID Database Inode : 0xb
Client Write Opens Inode : 0x8
Stripe Group MetadataAndJournal ( 0) 0x746a080 blocks.
Stripe Group Homes ( 1) 0xe8bd3c0 blocks.
Building Inode Index Database 1305600 (100%).
Verifying NT Security Descriptors
Found 697 NT Security Descriptors: all are good
Verifying Free List Extents.
Scanning inodes 1305600 (100%).
Sorting extent list for MetadataAndJournal pass 1/1
Updating bitmap for MetadataAndJournal extents 102400 ( 8%). Updating bitmap for MetadataAndJournal extents 112640 ( 8%). Updating bitmap for MetadataAndJournal extents 113322 ( 9%).
Sorting extent list for Homes pass 1/1
Updating bitmap for Homes extents 1257585 (100%).
Checking for dead inodes 1305600 (100%).
Checking directories 109996 (100%).
Scanning for orphaned inodes 1305600 (100%).
Verifying link & subdir counts 1305600 (100%).
Checking free list. 1305600 (100%).
Checking pending free list.
Checking Arbitration Control Block.
Checking MetadataAndJournal allocation bit maps (100%).
Checking Homes allocation bit maps (100%).
File system 'ODHome'. Blocks-244044736 free-215328011 Inodes-1305600 free-26073.
File System Check completed successfully.
However, if I try to restart the volume after running cvfsck, it just panics again. Shutting down the clients and rebooting the primary controller allows a normal startup, and the volume mounts and runs fine until the next panic. Everything in the cvlog between panics is labeled either debug or info.
I've reversed the roles for the primary and backup controllers, but it didn't make a difference in the panic.
Most, but not all, of the panics seem to happen in the morning when users are logging in, or at quitting time, when the users' portable homes are syncing. I've tried to isolate the panic to a specific user's sync, by temporarily disabling syncing for usergroups (one at a time), but can't tie them to any single user. Nightly incremental backups and weekly full backups run without causing a panic. I can also use rsync to mirror the volume's contents to another server without causing a panic.
I'd appreciate any insight into the cause of the panic, or strategies to diagnose or prevent it.
Thanks!
Xserve, Mac OS X (10.6.5)
Posted on Feb 11, 2011 5:45 AM