This discussion is locked
canon273

Q: failing over from MDC02 back to MDC01

Looking for some advice on an XSan MetaData controller problem.  Last week, we had a failover from our primary MetaData controler (metasvr01) to our backup MetaData Controller (metasvr02).  So far, so good.

 

After some investigation, it looked like metasvr01 had locked up - so we rebooted it.  It appears to have come back alive somewhat normally, except the cvadmin command is not seeing both MetaData servers (like it used to).

 

Should it be "safe" to try a metadata server failover to metasvr01?

 

Are there any potential problems / "gotcha's" we should be aware of?

 

 

Here's the output from metasvr01:

========================================================

metasvr01:~ metasvr01$ sudo cvadmin

Password:

Xsan Administrator

 

Enter command(s)

For command help, enter "help" or "?".

 

List FSS

 

File System Services (* indicates service is in control of FS):

1> EditB[0]         located on metasvr01.private:49248 (pid 140)

2> EditA[0]         located on metasvr01.private:49247 (pid 139)

 

No FSSs are active.

Select FSM "none"

 

 

 

Here's the output from metasvr02:

========================================================

metasvr02:~ metasvr02$ sudo cvadmin

Password:

Xsan Administrator

 

Enter command(s)

For command help, enter "help" or "?".

 

List FSS

 

File System Services (* indicates service is in control of FS):

1>*EditB[0]         located on metasvr02.private:49200 (pid 128)

2>*EditA[0]         located on metasvr02.private:49201 (pid 127)

 

Select FSM "none"

 

 

NOTE:

==========================================================

Prior to this incident, we would see both metadata servers (on both metasvr01 and metasvr02) - with the "active" one having the asterisk indicating correctly.

 

 

 

Original XSan / XRaid systems.

Still running XSan version 1.4.x

 

Here are the details regarding metasvr01:

OS/X Server 10.5.5

Normally serving as the primary / active metadata controller.

 

 

Here are the details regarding metasvr02:

OS/X Server 10.5.5

Normally serving as the secondary / backup metadata controller,

                         secondary / backup Open Directory Server,

                         secondary / backup DNS server

 

Here are the details regarding xsangw01:

OS/X Server 10.5.5

Normally serving as the primary / master Open Directory server,

                         primary / master DNS server,

                         smb / afp shares throughout the LAN.

 

 

================================================================================ ===

 

One final set of notes regarding these servers.  Over the last couple of months, the servers have been increasingly "problematic".  So far, we've lost the following capabilities on them:

- no more ARD access (generally)

- no more local KB, Mouse, & Monitor access (generally)

- no more XSan GUI / Server tools GUI access

================================================================================ ==

Xserve, Mac OS X (10.6.8)

Posted on Nov 16, 2011 8:52 AM

Close

Q: failing over from MDC02 back to MDC01

  • All replies
  • Helpful answers

  • by receng,

    receng receng Nov 16, 2011 10:49 AM in response to canon273
    Level 1 (20 points)
    Nov 16, 2011 10:49 AM in response to canon273

    It is common that when only one of the Xsan servers are restarted that their configuration would get out of sync.

    It looks like your metasvr01 thinks it is still hosting the volumes that were failed over.

    What you can try to do is from metasvr02 demote metasvr01 to a client and then promote it back to controller. That will force a configuration rewrite and bring metasvr01 back in sync. If cvadmin shows you a correct configuration, then you could safely fail over back to metasvr01 assuming you are confident that whatever issue led to the failover is solved.

     

    A full restart of the XSAN would probably work too (shutdown order - clients, controllers, raids, switches and power on on reverse order)

     

    I would also recomend regular maintenance tasks, like checking disk and permissions on the servers and stopping the volume(s) and running cvfsck.