This discussion is locked
ClintR

Q: Troubleshooting Xsan Volume Panic

I've been having a semi-regular Xsan panic and haven't been able to isolate the cause. I'm hoping someone here has seen something similar and can offer some tips on solving this.

Here's what the cvlog says when the panic happens (it's the same every time):

0209 07:47:13 0x10da87000 (*FATAL*) PANIC: /Library/Filesystems/Xsan/bin/fsm ASSERT failed "IPXATTRINODE(ip)" file fsm_xattr.c, line 736
0209 07:47:13.258787 0x11de39000 (Debug) timedfree_pending_inodethread: flushing journal.
0209 07:47:13.258805 0x11de39000 (Debug) timedfree_pending_inodethread: journal flush complete.
0209 07:47:13 0x10da87000 (*FATAL*) PANIC: aborting threads now.

The primary MDC panics, and fails over to the backup. The backup panics immediately, and fails back to the primary, which panics again and stops the SAN.

Here are the setup details:
• 4 Early 2008 Xserves (2 controllers, 2 clients), Xsan 2.2.1
• All 4 servers running 10.6.5 (same panic also occurred under 10.6.4)
• Clients share open directory home folders over AFP for users with portable home directories.
• 2 Vtrak E610F enclosures, one hosts a data LUN, the other a metadata LUN for one Xsan volume
• QLogic Sanbox 5602 fiber switch

I can't find any hardware problems on the Vtraks or the Sanbox

After the san panics, I run cvfsck -j followed by cvfsck -wv. The output doesn't show any problems, to my (admittedly untrained) eye. Here's what I get from cvfsck -wv, let me know if I'm missing something:

BUILD INFO:
#!@$ Server Revision 3.5.0 Build 7443 Branch branches_35X (412.3)
#!@$ Built for Darwin 10.0 i386
#!@$ Created on Mon Dec 7 12:52:39 PST 2009
Created directory /tmp/cvfsck15061a for temporary files.

Attempting to acquire arbitration block... successful.

Creating MetadataAndJournal allocation check file.
Creating Homes allocation check file.

Recovering Journal Log.

Super Block information.
FS Created On : Wed Dec 22 07:22:21 2010
Inode Version : '2.5'
File System Status : Clean
Allocated Inodes : 1305600
Free Inodes : 26073
FL Blocks : 85
Next Inode Chunk : 0x32c55
Metadump Seqno : 0
Restore Journal Seqno : 0
Windows Security Indx Inode : 0x5
Windows Security Data Inode : 0x6
Quota Database Inode : 0x7
ID Database Inode : 0xb
Client Write Opens Inode : 0x8

Stripe Group MetadataAndJournal ( 0) 0x746a080 blocks.
Stripe Group Homes ( 1) 0xe8bd3c0 blocks.

Building Inode Index Database 1305600 (100%).

Verifying NT Security Descriptors
Found 697 NT Security Descriptors: all are good
Verifying Free List Extents.

Scanning inodes 1305600 (100%).

Sorting extent list for MetadataAndJournal pass 1/1
Updating bitmap for MetadataAndJournal extents 102400 ( 8%). Updating bitmap for MetadataAndJournal extents 112640 ( 8%). Updating bitmap for MetadataAndJournal extents 113322 ( 9%).
Sorting extent list for Homes pass 1/1
Updating bitmap for Homes extents 1257585 (100%).

Checking for dead inodes 1305600 (100%).

Checking directories 109996 (100%).

Scanning for orphaned inodes 1305600 (100%).

Verifying link & subdir counts 1305600 (100%).

Checking free list. 1305600 (100%).
Checking pending free list.

Checking Arbitration Control Block.

Checking MetadataAndJournal allocation bit maps (100%).
Checking Homes allocation bit maps (100%).

File system 'ODHome'. Blocks-244044736 free-215328011 Inodes-1305600 free-26073.

File System Check completed successfully.



However, if I try to restart the volume after running cvfsck, it just panics again. Shutting down the clients and rebooting the primary controller allows a normal startup, and the volume mounts and runs fine until the next panic. Everything in the cvlog between panics is labeled either debug or info.

I've reversed the roles for the primary and backup controllers, but it didn't make a difference in the panic.

Most, but not all, of the panics seem to happen in the morning when users are logging in, or at quitting time, when the users' portable homes are syncing. I've tried to isolate the panic to a specific user's sync, by temporarily disabling syncing for usergroups (one at a time), but can't tie them to any single user. Nightly incremental backups and weekly full backups run without causing a panic. I can also use rsync to mirror the volume's contents to another server without causing a panic.

I'd appreciate any insight into the cause of the panic, or strategies to diagnose or prevent it.

Thanks!

Xserve, Mac OS X (10.6.5)

Posted on Feb 11, 2011 5:45 AM

Close

Q: Troubleshooting Xsan Volume Panic

  • All replies
  • Helpful answers

  • by Wil H,

    Wil H Wil H Feb 21, 2011 8:53 PM in response to ClintR
    Level 1 (0 points)
    Feb 21, 2011 8:53 PM in response to ClintR
    Hi There ClintR,

    I'm having an identical issue that occurred out of nowhere @ 3pm yesterday.
    Have been working on finding a solution to this, however haven't found anything effective yet.

    Did you have some luck in finding a resolution?

    Here is a excerpt of my logs - if anyone can offer some further assistance I'd be very grateful.

    [0222 07:52:01] 0x7fff70355ca0 (Info) Branding Arbitration Block (attempt 1) votes 2.
    [0222 07:52:03.363744] 0x7fff70355ca0 (Debug) Cannot find fail over script [/Library/Filesystems/Xsan/bin/cvfail.xsan02.lifestyletrader.com.au] - looking for generic script.
    [0222 07:52:03] 0x7fff70355ca0 (Info) Launching fail over script ["/Library/Filesystems/Xsan/bin/cvfail" xsan02.lifestyletrader.com.au 59564 Homes]
    [0222 07:52:03.404814] 0x7fff70355ca0 (Debug) Starting journal log recovery.
    [0222 07:52:03.667002] 0x7fff70355ca0 (Debug) Completed journal log recovery.
    [0222 07:52:03.667148] 0x7fff70355ca0 (Debug) Inodeinit_postactivation: FsStatus 0x2525, Brl_ResyncState 1
    [0222 07:52:03] 0x11ef11000 (Info) FSM Alloc: Loading Stripe Group "MetadataAndJournal". 931.31 GB.
    [0222 07:52:03] 0x11f395000 (Info) FSM Alloc: Loading Stripe Group "Homes". 3.64 TB.
    [0222 07:52:03] 0x11ef11000 (Info) FSM Alloc: Stripe Group "MetadataAndJournal" active.
    [0222 07:52:04] 0x11f395000 (Info) FSM Alloc: free blocks 423988623 with 0 blocks currently reserved for client delayed buffers.Reserved blocks may change with client activity.
    [0222 07:52:04] 0x11f395000 (Info) FSM Alloc: Stripe Group "Homes" active.
    [0222 07:52:04] 0x7fff70355ca0 (Warning) Windows Security has been turned off in config file but clients have been requested to enforce ACLs. Windows Security remains in effect.
    [0222 07:52:04.135555] 0x7fff70355ca0 (Debug) FSUUID_init: found `FSUUID' xattr on root inode: fecc6178-3f8b-4e10-98bc-52a932539a15
    [0222 07:52:04] 0x7fff70355ca0 (Info) RPL_init: RPL upgrade required.
    [0222 07:52:04] 0x7fff70355ca0 (Info) RPL_Upgrade: Removing existing RPL attributes.
    [0222 07:52:07] 0x7fff70355ca0 (Info) RPL_Upgrade: removal completed 20 IEL blocks
    [0222 07:52:15] 0x7fff70355ca0 (Info) RPL_Upgrade: removal completed 40 IEL blocks
    [0222 07:52:28] 0x7fff70355ca0 (Info) RPL_Upgrade: removal completed 60 IEL blocks
    [0222 07:52:45] 0x7fff70355ca0 (Info) RPL_Upgrade: removal completed 80 IEL blocks
    [0222 07:53:08] 0x7fff70355ca0 (Info) RPL_Upgrade: removal completed 100 IEL blocks
    [0222 07:53:36] 0x7fff70355ca0 (Info) RPL_Upgrade: removal completed 120 IEL blocks
    [0222 07:54:10] 0x7fff70355ca0 (Info) RPL_Upgrade: removal completed 140 IEL blocks
    [0222 07:54:53] 0x7fff70355ca0 (Info) RPL_Upgrade: removal completed 160 IEL blocks
    [0222 07:55:50] 0x7fff70355ca0 (Info) RPL_Upgrade: removal completed 180 IEL blocks
    [0222 07:57:02] 0x7fff70355ca0 (Info) RPL_Upgrade: removal completed 200 IEL blocks
    [0222 07:58:34] 0x7fff70355ca0 (Info) RPL_Upgrade: removal completed 220 IEL blocks
    [0222 08:00:25] 0x7fff70355ca0 (Info) RPL_Upgrade: removal completed 240 IEL blocks
    [0222 08:02:43] 0x7fff70355ca0 (Info) RPL_Upgrade: removal completed 260 IEL blocks
    [0222 08:05:17] 0x7fff70355ca0 (Info) RPL_Upgrade: removal completed 280 IEL blocks
    [0222 08:08:13] 0x7fff70355ca0 (Info) RPL_Upgrade: removal completed 300 IEL blocks
    [0222 08:11:22] 0x7fff70355ca0 (Info) RPL_Upgrade: removal completed 320 IEL blocks
    [0222 08:14:47] 0x7fff70355ca0 (Info) RPL_Upgrade: removal completed 340 IEL blocks
    [0222 08:18:27] 0x7fff70355ca0 (Info) RPL_Upgrade: removal completed 360 IEL blocks
    [0222 08:22:22] 0x7fff70355ca0 (Info) RPL_Upgrade: removal completed 380 IEL blocks
    [0222 08:26:32] 0x7fff70355ca0 (Info) RPL_Upgrade: removal completed 400 IEL blocks
    [0222 08:30:56] 0x7fff70355ca0 (Info) RPL_Upgrade: removal completed 420 IEL blocks
    [0222 08:35:36] 0x7fff70355ca0 (Info) RPL_Upgrade: removal completed 440 IEL blocks
    [0222 08:40:30] 0x7fff70355ca0 (Info) RPL_Upgrade: removal completed 460 IEL blocks
    [0222 08:45:26] 0x7fff70355ca0 (Info) RPL_Upgrade: removal completed 480 IEL blocks
    [0222 08:50:50] 0x7fff70355ca0 (Info) RPL_Upgrade: removal completed 500 IEL blocks
    [0222 08:56:23] 0x7fff70355ca0 (Info) RPL_Upgrade: removal completed 520 IEL blocks
    [0222 09:02:07] 0x7fff70355ca0 (Info) RPL_Upgrade: removal completed 540 IEL blocks
    [0222 09:08:10] 0x7fff70355ca0 (Info) RPL_Upgrade: removal completed 560 IEL blocks
    [0222 09:14:22] 0x7fff70355ca0 (Info) RPL_Upgrade: removal completed 580 IEL blocks
    [0222 09:20:55] 0x7fff70355ca0 (Info) RPL_Upgrade: removal completed 600 IEL blocks
    [0222 09:27:42] 0x7fff70355ca0 (Info) RPL_Upgrade: removal completed 620 IEL blocks
    [0222 09:34:44] 0x7fff70355ca0 (Info) RPL_Upgrade: removal completed 640 IEL blocks
    [0222 09:41:36] 0x7fff70355ca0 (Info) RPL_Upgrade: removal completed 660 IEL blocks
    [0222 09:49:06] 0x7fff70355ca0 (Info) RPL_Upgrade: removal completed 680 IEL blocks
    [0222 09:56:44] 0x7fff70355ca0 (Info) RPL_Upgrade: removal completed 700 IEL blocks
    [0222 10:04:44] 0x7fff70355ca0 (Info) RPL_Upgrade: removal completed 720 IEL blocks
    [0222 10:12:54] 0x7fff70355ca0 (Info) RPL_Upgrade: removal completed 740 IEL blocks
    [0222 10:21:22] 0x7fff70355ca0 (Info) RPL_Upgrade: removal completed 760 IEL blocks
    [0222 10:30:03] 0x7fff70355ca0 (Info) RPL_Upgrade: removal completed 780 IEL blocks
    [0222 10:38:59] 0x7fff70355ca0 (Info) RPL_Upgrade: removal completed 800 IEL blocks
    [0222 10:48:10] 0x7fff70355ca0 (Info) RPL_Upgrade: removal completed 820 IEL blocks
    [0222 10:57:40] 0x7fff70355ca0 (Info) RPL_Upgrade: removal completed 840 IEL blocks
    [0222 11:07:20] 0x7fff70355ca0 (Info) RPL_Upgrade: removal completed 860 IEL blocks
    [0222 11:17:15] 0x7fff70355ca0 (Info) RPL_Upgrade: removal completed 880 IEL blocks
    [0222 11:27:19] 0x7fff70355ca0 (Info) RPL_Upgrade: removal completed 900 IEL blocks
    [0222 11:37:32] 0x7fff70355ca0 (Info) RPL_Upgrade: removal completed 920 IEL blocks
    [0222 11:48:11] 0x7fff70355ca0 (Info) RPL_Upgrade: removal completed 940 IEL blocks
    [0222 11:59:01] 0x7fff70355ca0 (Info) RPL_Upgrade: removal completed 960 IEL blocks
    [0222 12:10:02] 0x7fff70355ca0 (Info) RPL_Upgrade: removal completed 980 IEL blocks
    [0222 12:21:17] 0x7fff70355ca0 (Info) RPL_Upgrade: removal completed 1000 IEL blocks
    [0222 12:32:46] 0x7fff70355ca0 (Info) RPL_Upgrade: removal completed 1020 IEL blocks
    [0222 12:44:23] 0x7fff70355ca0 (Info) RPL_Upgrade: removal completed 1040 IEL blocks
    [0222 12:56:14] 0x7fff70355ca0 (Info) RPL_Upgrade: removal completed 1060 IEL blocks
    [0222 13:07:15] 0x7fff70355ca0 (Info) RPL_Upgrade: removal completed 1080 IEL blocks
    [0222 13:15:32] 0x7fff70355ca0 (Info) RPL_Upgrade: removal completed 1100 IEL blocks
    [0222 13:25:30] 0x7fff70355ca0 (Info) RPL_Upgrade: removal completed 1120 IEL blocks
    [0222 13:37:29] 0x7fff70355ca0 (Info) RPL_Upgrade: removal completed 1140 IEL blocks
    [0222 13:50:16] 0x7fff70355ca0 (Info) RPL_Upgrade: removal completed 1160 IEL blocks
    [0222 14:02:47] 0x7fff70355ca0 (*FATAL*) PANIC: /Library/Filesystems/Xsan/bin/fsm ASSERT failed "IPXATTRINODE(ip)" file fsm_xattr.c, line 736
    [0222 14:02:47] 0x7fff70355ca0 (*FATAL*) PANIC: wait 3 secs for journal to flush
    [0222 14:02:50] 0x7fff70355ca0 (*FATAL*) PANIC: aborting threads now.
    Logger_thread: sleeps/44494 signals/0 flushes/65 writes/65 switches 0
    Logger_thread: logged/91 clean/91 toss/0 signalled/0 toss_message/0
    Logger_thread: waited/0 awakened/0
  • by ClintR,

    ClintR ClintR Feb 22, 2011 6:05 AM in response to Wil H
    Level 1 (10 points)
    Feb 22, 2011 6:05 AM in response to Wil H
    Wil,
    Unfortunately, what I've turned up isn't good news. In discussing the issue with some knowledgeable folks, it seems there's an issue that Apple hasn't fixed involving extended attributes on Xsan volumes. The only way around it is to destroy the volume and recreate it with extended attributes turned off. Since I was able to mount and read from my volume, I bit the bullet and did a backup, destroy, recreate (without extended attributes), and restore from backup over the weekend. The volume's been up for nearly 48 hours now with no problems so far.

    If you can't recover from the RPL upgrade, you may have to resort to destroying the volume, recreating it and restoring from your most recent backup. I had a similar situation last year, and that was the recommendation from Apple's Xsan support. If you want to keep looking for other solutions, check out the Troubleshooting forum at http://www.xsanity.com/. Lots of good info and helpful folks there.