MDC locking up. XSAN 3.1 OSX 10.9.4, MDC Mac Mini

XSAN configuration went well. System appears to work well for about a week, then MDC1 will usually freeze up. It appears to fail over to MDC2, but XSAN admin and cvadmin still list MDC1 as hosting. This is usually discovered by clients losing access to volume. It requires a hard shutdown of MDC1 and sometimes MDC2 and reboot of storage to get back. The log files are usually lost in the Hard Boot. Once restarted, everything works ok again for about a week then freezes again.

Any thoughts?

thomasb's picture

Have you made sure all Xsan volumes hosted by these two MDCs are mounted on the MDCs? This is required for things to work properly, especially failover.

What do you see in the cvlog and nssdbg.out log at the time of the freeze/failover? Please don't copy/paste the whole log here. Only find the relevant messages around the time frame of the freeze/failover.

Volume is always mounted on both MDC's. I'll try to gather info from logs. It usually requires a hard boot, so logs get wiped. What is your recommendation on leaving XSAN Admin open all of the time on one of the MDC's?
This is looking like a storage problem. We have been working with Promise, but are still getting frequent failed drives. Whenever a drive fails or is replaced and initiates a rebuild, the lockup problem happens.

AaronW
 

thomasb's picture

Leaving Xsan Admin open shouldn't be a problem, but I would recommend closing it when not in use, just to make sure the data you see in the GUI is updated every time you use it.

cvlog and nssdbg.out do not get wiped on a reboot, so you should be able to find some valuable data there, to help troubleshoot this issue.

/Library/Logs/Xsan/data//log/cvlog*
/Library/Logs/Xsan/debug/nssdbg.out

Failed drives in our Promise arrays should not cause problems for your MDCs, unless a whole LUN gets damaged, but then you're in deep trouble.

What kind of fibre channel switches are you using? If it's Qlogic, have you double checked that the target and initiator ports are configured correctly?

Also make sure that DNS is looking right on your MDCs, and that the date/time of both MDCs is in sync.

Are you able to ssh into your MDCs after a lockup?