I apologize for the long thread but I've been dealing with XSan issues for weeks now and wanted to get it all off my chest. Hopefully you'll all still feel like helping a tired administrator out by reading the thread and giving any advice you can!
About three weeks ago, we decided to upgrade our Mac OS X Mountain Lion server to Mac OS X Yosemite. I'm the laboratory technician for a documentary production house at a University in northern California. We use the Adobe suite of products for our two editing workstations and weren't able to upgrade to CC 2014 due to the new OS requirements for the Mac.
At our production lab, we also run a Promise 32 TB RAID system using XSan 2.25 over Fibre Channel to our workstations (edit bays). This RAID system is also connected to our two servers which are running OS X Server Mountain Lion. One server named ALVA is supposed to be the Open Directory master, while the other server named ALVA2 is supposed to be the OD replica as well as the metadata controller, or MDC. Our XSan has been working fine for almost 2 years on this system.
We started off by upgrading our OS to Mavericks on each server. Then we updated the OS X Server application to the Mavericks version and the XSan was upgraded to XSan 3. This seemed to work perfectly, In fact, even before we upgraded our MDC to XSan 3, it was already able to mount the SAN volume named "Anthro". Interestingly, we were not able to get the Server software running on the MDC (or the OD replica), to successfully reconnect it's OD replica. The other server, ALVA, was able to find all of the settings of the OD during the upgrade and ran the OD just fine, however, our other server, ALVA2 could not be reconnected as an Open Directory replica. We decided to ignore this odd behavior and went ahead and upgraded both server computers to OS X Yosemite. Upon starting up the new OS, we realized that both servers were not seeing our SAN Volume called "Anthro". This was a bit odd, but it got odder. We then upgraded each server to OS X Server Yosemite. Once again, the master Open Directory on ALVA upgraded and reconnected without an issue while we found that we had to once again reconnect the Open Directory to the replica server on ALVA2. This time, everything worked beautifully. Unfortunately, things got even more odd at this point.
Next, we decided to start up the XSan on our previous metadata controller ALVA2. After clicking on the tab to start the service in the Server app, we were asked if we wanted to start a new SAN volume, connect to an existing SAN volume, or upgrade our XSan service to a previous configuration. This was all rather normal. We tried to do the third option, but after a second or two of the window showing a spinning progress bar, it suddenly stopped. Every time we chose the option to upgrade from a previous configuration, this would occur. In order to troubleshoot, we tried to turn on the XSan tab on the other server, ALVA. The same problem occurred.
My only thought right now is that when I opened up the Directory Administrator tool on the computer that had previously been the metadata controller, I noticed that a second LDAPv3 server was running called fcps.csuchico.edu. This was the FQDN that we gave the ALVA computer when it used to run Final Cut Server 4 years ago. In addition, there was also an LDAP server called 127.0.0.1 that I recognized as the standard for when Open Directory is originally turned on. I recognized that fcps.csuchico.edu is also the DNS address for our first server, ALVA. However, when I tried to connect to the fcps.csuchico.edu LDAP server using Directory Administrator, the service would give me an error 2100. I actually tried to delete the fcps.csuchico.edu LDAP server by running this Terminal command: sudo slapconfig -destroyldapserver. When I did, the LDAP server was still there in the Directory Administrator app. Also, my two edit stations which have Local Network accounts were only seeing the previous Open Directory users and not the ones which I had set up most recently. This was true even after I changed and rebound the two edit machines to the DNS of the Open Directory master computer, ALVA.
With all of this evidence in place, I'm led to believe that an Open Directory LDAP server had been previously set up on the first server, ALVA. Then, the first XSan installation (XSan 1.1) was likely tied to the Open Directory made by the Server application. At some point (I don't know when), the Open Directory master on ALVA got deleted or became corrupt leaving XSan still running without the ability to edit the LDAP server through the Server App or through Workgroup Administrator. However, the users were kept intact and allowed to login to the LDAP server on the client machines. With the upgrade to OS X Yosemite Server, which now hosts XSan users through the Open Directory inside the Server app, I believe that the corrupted LDAP server is not reachable or even recognized. As such, the RAID SAN volume "Anthro" can't mount on either the server or client machines. That's the best I've been able to come up with over the last three weeks. The problem is that I feel like I'm missing something in this whole process and every solution I've tried has failed. Also, my current OS X Server setup feels like it's about ready to collapse at any moment. I'd love to just erase the two servers, destroy the XSan volume, start a new Open Directory and XSan volume but at 36 TBs, we don't really have a drive that could back all of that data up. We have archived many of the projects on the SAN RAID drive to LTO-5 tape, but I don't trust restoring from tape at the moment.
Any help as to what I should check or what I should do would be appreciated! Thanks!