XSAN 4 won't update SAN using configuration files

brunsdan3's picture
Tags: 

I apologize for the long thread but I've been dealing with XSan issues for weeks now and wanted to get it all off my chest. Hopefully you'll all still feel like helping a tired administrator out by reading the thread and giving any advice you can!

About three weeks ago, we decided to upgrade our Mac OS X Mountain Lion server to Mac OS X Yosemite. I'm the laboratory technician for a documentary production house at a University in northern California. We use the Adobe suite of products for our two editing workstations and weren't able to upgrade to CC 2014 due to the new OS requirements for the Mac.

At our production lab, we also run a Promise 32 TB RAID system using XSan 2.25 over Fibre Channel to our workstations (edit bays). This RAID system is also connected to our two servers which are running OS X Server Mountain Lion. One server named ALVA is supposed to be the Open Directory master, while the other server named ALVA2 is supposed to be the OD replica as well as the metadata controller, or MDC. Our XSan has been working fine for almost 2 years on this system.

We started off by upgrading our OS to Mavericks on each server. Then we updated the OS X Server application to the Mavericks version and the XSan was upgraded to XSan 3. This seemed to work perfectly, In fact, even before we upgraded our MDC to XSan 3, it was already able to mount the SAN volume named "Anthro". Interestingly, we were not able to get the Server software running on the MDC (or the OD replica), to successfully reconnect it's OD replica. The other server, ALVA, was able to find all of the settings of the OD during the upgrade and ran the OD just fine, however, our other server, ALVA2 could not be reconnected as an Open Directory replica. We decided to ignore this odd behavior and went ahead and upgraded both server computers to OS X Yosemite. Upon starting up the new OS, we realized that both servers were not seeing our SAN Volume called "Anthro". This was a bit odd, but it got odder. We then upgraded each server to OS X Server Yosemite. Once again, the master Open Directory on ALVA upgraded and reconnected without an issue while we found that we had to once again reconnect the Open Directory to the replica server on ALVA2. This time, everything worked beautifully. Unfortunately, things got even more odd at this point.

Next, we decided to start up the XSan on our previous metadata controller ALVA2. After clicking on the tab to start the service in the Server app, we were asked if we wanted to start a new SAN volume, connect to an existing SAN volume, or upgrade our XSan service to a previous configuration. This was all rather normal. We tried to do the third option, but after a second or two of the window showing a spinning progress bar, it suddenly stopped. Every time we chose the option to upgrade from a previous configuration, this would occur. In order to troubleshoot, we tried to turn on the XSan tab on the other server, ALVA. The same problem occurred.

My only thought right now is that when I opened up the Directory Administrator tool on the computer that had previously been the metadata controller, I noticed that a second LDAPv3 server was running called fcps.csuchico.edu. This was the FQDN that we gave the ALVA computer when it used to run Final Cut Server 4 years ago. In addition, there was also an LDAP server called 127.0.0.1 that I recognized as the standard for when Open Directory is originally turned on. I recognized that fcps.csuchico.edu is also the DNS address for our first server, ALVA. However, when I tried to connect to the fcps.csuchico.edu LDAP server using Directory Administrator, the service would give me an error 2100. I actually tried to delete the fcps.csuchico.edu LDAP server by running this Terminal command: sudo slapconfig -destroyldapserver. When I did, the LDAP server was still there in the Directory Administrator app. Also, my two edit stations which have Local Network accounts were only seeing the previous Open Directory users and not the ones which I had set up most recently. This was true even after I changed and rebound the two edit machines to the DNS of the Open Directory master computer, ALVA.

With all of this evidence in place, I'm led to believe that an Open Directory LDAP server had been previously set up on the first server, ALVA. Then, the first XSan installation (XSan 1.1) was likely tied to the Open Directory made by the Server application. At some point (I don't know when), the Open Directory master on ALVA got deleted or became corrupt leaving XSan still running without the ability to edit the LDAP server through the Server App or through Workgroup Administrator. However, the users were kept intact and allowed to login to the LDAP server on the client machines. With the upgrade to OS X Yosemite Server, which now hosts XSan users through the Open Directory inside the Server app, I believe that the corrupted LDAP server is not reachable or even recognized. As such, the RAID SAN volume "Anthro" can't mount on either the server or client machines. That's the best I've been able to come up with over the last three weeks. The problem is that I feel like I'm missing something in this whole process and every solution I've tried has failed. Also, my current OS X Server setup feels like it's about ready to collapse at any moment. I'd love to just erase the two servers, destroy the XSan volume, start a new Open Directory and XSan volume but at 36 TBs, we don't really have a drive that could back all of that data up. We have archived many of the projects on the SAN RAID drive to LTO-5 tape, but I don't trust restoring from tape at the moment.

Any help as to what I should check or what I should do would be appreciated! Thanks!

chandler's picture

Hi - I am far from the person that could help you, but one thing I can tell you is that Xsan 4 is set up to store all Xsan configuration in the OD directory, AND the master metadata controller must also be the OD master. I don't know if there are ways around this but this is the way its presented on setup. I noticed your primary MDC is your OD replica, so attempting to upgrade that setup through so many versions combined with this fact could be some of your issue?

Digital Artist Assist @ Pretty Damn Sweet

brunsdan3's picture

Thanks for the comment Chandler, I'll give that a try and report back to you as soon as I can on how it worked. I still think that there is some weirdness to be figured out with the fcps.csuchico.edu OD that is still popping up in LDAP, but this could be one way to fix my issue. I appreciate the help!

brunsdan3's picture

Update: I went ahead and moved the OD master over to ALVA2 which is the MDC (metadata controller) for the SAN "Anthro". I have not yet tried to start up the SAN using the previous configuration since I had to revert back to Mountain Lion in order to get the Xsan back up and running. Once I've upgraded to Yosemite again, I'll go ahead and try that option.

brianwells's picture

Make sure you have a backup of the /Library/Preferences/Xsan folder from before you started the migration. With this in hand you can always start over, which we had to do a few times during our upgrade to Xsan 4.

We also purchased a one-time support incident from AppleCare which turned out the be well worth the expense.

brunsdan3's picture

Hi Brian,

We already have all of the Xsan preference files from /Library/Preferences/Xsan backed up already. I guess we were a little paranoid before starting the migration :) As such, how would I go about using this to start over without losing any of the information on the 32 TB SAN? Would I need to create a new volume in the Xsan initial start up menu? I've always been a bit confused how the configuration files could help me bring back the SAN if it ever got destroyed...

JSamuel's picture

brunsdan3 wrote:
Would I need to create a new volume in the Xsan initial start up menu?/quote

That would be a sure fire way of wiping out data on the data/metadata LUNs ;-)

brunsdan3 wrote:
I've always been a bit confused how the configuration files could help me bring back the SAN if it ever got destroyed./quote

It depends on if you intend to revert back to the same Xsan/OSX version? You can protect the LUNs and effectively copy the directory back and restart.

Joel Samuel.
/thirtytwo - Consultancy & Direction
Proud sponsor of Xsanity.com

All contributions are my own personal opinions - not those of any entity I represent.

JSamuel's picture

(So I b0rked the formatting, but you get what I mean... probably.) :-O

Joel Samuel.
/thirtytwo - Consultancy & Direction
Proud sponsor of Xsanity.com

All contributions are my own personal opinions - not those of any entity I represent.

brunsdan3's picture

Hi Joel,

Thanks for the information. I definitely won't start a new volume then and try to copy the old configuration files back to the appropriate location after the setup is complete. I had a suspicion that would result in total SAN failure.

That being said and to understand you correctly, are you saying that if I copied the configuration files and had the same OS version that I could effectively paste the configuration files back to the original location, restart, and then the SAN would be working properly again? Also, would this same process not work if you pasted the configuration files from one version of the OS to another newer version? Lastly, is this the only way to use the Xsan configuration files to recover a SAN?

I appreciate your help!

JSamuel's picture

It would fail pretty badly, if your LUNs on the storage nodes were in read-only, you would be just about be OK and could wipe the config folder and re-try.

If you have the same OS/Xsan version, and simply brought back the config (and the metadata/data LUNs on the storage nodes had not been changed) then yes, you could simply restart. The joys of previous Xsans basically being a bunch of files :)

It may work depending on OS/Xsan version you are going to/from, I would suspect you would encounter issues. If you're in that scenario where you need to do this, you're probably late night and just want to go back to last-known-good.

Depends on what has been lost. You can reconstruct a metadata LUN if you really wanted to, but theres a few in the world who would do so, so at quite a few points its best to go back to backups.

Joel Samuel.
/thirtytwo - Consultancy & Direction
Proud sponsor of Xsanity.com

All contributions are my own personal opinions - not those of any entity I represent.

brunsdan3's picture

Also, I should say that I already have both a DMG image of both server's main hard drives and a Time Machine backup. I'm planning on using either one to restore my previous OS version and Xsan configuration if need be. I'm keeping my fingers crossed though!

Lastly, I should also say that I can still see the LDAP that was previously used on Xsan 2 when we had that installed. Every time I open up the Directory Utility, I still see the fcps.csuchico.edu LDAP even though I can't connect to it (Error 2100) or see it in the Server app. I'm really suspicious that this LDAP is still being used to run my Xsan setup even though I'm not using that particular OD or LDAP in OS X Server. By not being able to connect to it though, I'm also wondering if I'm going to be hosed if I ever want to upgrade to Yosemite and OS X Server Yosemite which has Xsan built right in. Has anyone ever run into the Error 2100 before when working with LDAPs? I'd love to know how to get past it!

JSamuel's picture

Time Machine is horrible and ill-advised for servers. Its OK to have as a safety net, but as you're doing, don't rely on it or use as primary restore method. Carbon Copy Cloner (or a bit of rsync) is your friend.

IIRC, 2100 is an SSL issue. What do your logs say? You can force the bind/edit using CLI tools, but thats quite a lot to talk you through, so Google and man will be your friend when it comes to the dsconfig prefixed tools.

Joel Samuel.
/thirtytwo - Consultancy & Direction
Proud sponsor of Xsanity.com

All contributions are my own personal opinions - not those of any entity I represent.

JSamuel's picture

In our client Xsan deployments on 10.10.x (well, hopefully 10.10.5 and something like SUIDGuard - or just offline, given root issues!) the MDC1 is OD master; DNS master (this doesn't have to be the case, but in most situations we've found it easier given the physical kit/network seperation between the Xsan env and everything else.) and primary Xsan controller.

We have completed a few 10.6.8 to 10.10.x (direct, no stations...) for Xsan purposes, and its behaved accordingly when working with seperated LANs (a non-routing network for metdata traffic; coherent A/PTR etc). A main note is that the finalised Xsan will re-organise interfaces, so you will need to use 'cv' CLI tools (I forget the exact syntax) to ensure the Xsan traffic is being passed over the metadata network.

I started using deviceX (client or mdc) . division . company . xsan as the zone when I first started /thirtytwo so we still do that now (our pet subnet being 10.32.x.x - given our name) so as a result configurations look clean and quite easy to manipulate even if you have to handover directory and DNS to the wider org I.T.

As an aside, using OpenLDAP for the directory also works with Xsan4 in 10.10.x... Mostly. (So far.) but it does take a bit of hacking as Xsan4 still must thing its a primary OD.

Joel Samuel.
/thirtytwo - Consultancy & Direction
Proud sponsor of Xsanity.com

All contributions are my own personal opinions - not those of any entity I represent.

brunsdan3's picture

Hi Joel,

Thanks for the reply. Here's a little more information to fill in some gaps.

We do have a private network or 10.241.28.x and DNS set up by our I.T. department here at the school. We also have a public network, 132.241.3.x with DNS set by Apple OSX Server (as far as I can tell). However, our client machine DNS has been set by the school's I.T. department.

That being said, and since you mentioned that the metadata for Xsan has to be set up on the private IP address and DNS, should we be setting up our OS X Server and Open Directory to be using the private IP (10.241.28.x) instead of the public IP address (132.241.28.)? I've always wondered if this was the right way to set up our Server app especially since the client machines don't have the same subnet for their public IP address (132.241.232.x).

Any help would be appreciated!

Thanks!

JSamuel's picture

Unless I'm missing something, 132.241.3.x is public IPv4 space, so I'm not sure how that works... In reality we're talking about two local networks, just one of them has a gateway etc (so its routing) which is where you get to the wider LAN resources and out to the webs, while the other is non-conflicting, on a second NIC, and does not have a gateway (it could have DNS, ideally not) so OS X doesn't send any traffic down it, apart from what is on that subnet (just the Xsan clients/MDCs)

It does sound like a change is required.

For example, if the routing LAN (wider school subnet, has file shares on it, other stuff, has a gateway which eventually gets you out to the web etc) is 192.168.0.x/22 you would plug this into the first ethernet port (or on a single-port endpoint, a USB/Thunderbolt adapter). This would be your highest priority NIC in SysPref>Network, so OS X uses it as the default path for everything.

You would have a separate private LAN, that is just IP/subnet (static IPs OR perpetual static allocation in DHCP pool, but the DHCP service does not issue a gateway address in the DHCP response), of say, 10.32.0.x/24. This networking would ideally be physically diverse, and plug into the on-board NIC of the endpoints. Or, you could VLAN tag if you're in a larger physical network (but you never share the NICs, its still a unique port) just make sure I.T. understand latency is an issue and it should be prioritised in the switch config.

The wider LAN (192.168.0.x/22) - endpoint would use this to talk to the central I.T.'s DNS services, and you could ask them to create records (say for example your domain is dept.pretendco.com, you could ask them to setup mdc1.dept.pretendco.com and mdc1-meta.dept.pretendco.com, and mdc1.dept.pretendco.com was the MDC1's IP on the 192.168.0.x/22 network, and mdc1-meta would be 10.32.0.1 (for example). Get them to setup the PTRs too for good practise. Normal TTLs etc though :)

Your Xsan metadata network will then be as private (and low latency) as they come. No noise. Also disable IPv6 on all interfaces and so on. This then comes back to making sure Xsan is using the right network, which is easily done/checked (in cvadmin, it'll show all devices being named by the -meta DNS name)

Does the above make sense?

Joel Samuel.
/thirtytwo - Consultancy & Direction
Proud sponsor of Xsanity.com

All contributions are my own personal opinions - not those of any entity I represent.

brunsdan3's picture

Hi Joel,

Thanks!

The way you've described the networking is exactly the way we have it now. A public network (managed through the school with a manual DHCP IP address and DNS that never changes) or 132.241.3.x, and a private network (which the school sets the DNS on but nothing else) which is set up manually with a static IP address, or 10.241.28.x. I'm not sure why we wouldn't put a gateway (or in Mac Preferences a router) on the NIC but that field is already blank so I won't be messing with it. As such, it seems that we are set up just the way you're talking about.

The problem with the OD still persists though. Even when I rebind my clients to the new OD, it seems to only see an older OD that no longer exists (as far as I can tell) and won't let me log in with the new users that I've created. If I could figure out this problem, or how to get past the 2100 error, that would be huge!

Dan

JSamuel's picture

You'll need to upload the logs from OD on the server for anyone to have the wildest chance of figuring this out :)

"I'm not sure why we wouldn't put a gateway (or in Mac Preferences a router) on the NIC but that field is already blank so I won't be messing with it."

Because a lack of a gateway indicates to OSX that it cannot route through it, so that NIC can only be used to talk directly to hosts on the same subnet. Perfect for Xsan and media content networks - i.e.: if that NIC was 10G and it wasn't Xsan, it was NAS.

Joel Samuel.
/thirtytwo - Consultancy & Direction
Proud sponsor of Xsanity.com

All contributions are my own personal opinions - not those of any entity I represent.

brunsdan3's picture

Thanks for explaining why you should not have a default gateway set for the private network on an Xsan. That makes perfect sense.

As for the logs, where would I find the logs for the OD? In the Console utility on the Mac? I'd love to upload them.

Thanks,

Dan

brunsdan3's picture

I finally found out what the problem was. First, I needed to install Mavericks and then Yosemite in order for the server to upgrade successfully and for Xsan's configuration files to update correctly. I had tried to upgrade right to Yosemite from Mountain Lion previously. This might have something to do with the different versions of Xsan between each OS upgrade, but I'm not sure.

When it comes to the Open Directory, I found that if I deleted the User's Home folder off of the server and the local edit station drives, the accounts would no longer show up in the user list when logging in to the client machines. Instead, once I bound to my new Open Directory server in the Users preference pane, I was able to log in by typing in both the username and password at the login screen. Once I had successfully logged in once with any given user, their name would then show up in the list of users every time I went to log in to the computer again. That seemed to solve it!

Hopefully that can help someone!

wrstuden's picture

I finally found out what the problem was. First, I needed to install Mavericks and then Yosemite in order for the server to upgrade successfully and for Xsan's configuration files to update correctly. I had tried to upgrade right to Yosemite from Mountain Lion previously. This might have something to do with the different versions of Xsan between each OS upgrade, but I'm not sure.

I don't know how you fixed it, but you do not need to stop at Mavericks before going to Yosemite. You must have done some detail differently in the different updates, and that was what made it work.

You should be running the latest Software Updates of both Yosemite and Server. 10.10.3 and associated Server.app update give a much-better experience. You do need to Activate the Xsan machine which is the OD Primary before any of the others. The older versions of Server.app didn't report an error well when you got this wrong.

Definitely save your /Library/Preferences/Xsan/ files as you can restore them if you get in a pinch. If you accidentally destroy your OD nodes, you can restore these files and re-activate. Note that this trick only works if the LDAP information for the Xsan config is missing.