Very strange volume issue 2gb vs 4gb

tye1138's picture

Howdy, I've been working with Xsan since its very first rendition, doing build-out's and supporting the solution for clients in and around Los Angeles. I currently work for a post production house here and my IT staff and I have been puzzled by a recent problem and we'd like some clarification.

We have a 800tb Xsan with 5 volumes, clients and servers are all intel, running 10.6.8 and 2.2.1 version of Xsan. We have over 40 clients on Xsan, most of which are using Final Cut Pro. We have mixed storage brands on volumes, promise and active. We added promise and expanded the volumes about 6 months ago. The solution has been bulletproof for years. We do minor maintenance like defrag and rebooting volumes every month, but everything has been working great.

In January of this year, one of our volumes was critically low on space. A huge restore and data copy was done overnight and in the morning, we found the volume to be sluggish. Of course, the solution was very easy, delete unused material. However, we noticed random glitches and dropped frames after we opened up the free space. Defragging seemed to help some files, but only made others worse. As a last resort, we re-copied some of the files and many of those still don't work on this particular volume. Copy those same files to a different volume on the same san and they work fine.

It seems like we have a volume level problem, but here is the funny part. We have 2 machines in the building with 2gb fiber cards, everyone else is running 4gb fiber. Those 2gb fiber machines, they have no issues with the question volume. In fact, they work fine doing everything!

We've got a great IT staff here and we're all stumped. We've checked the switches, checked the log files on every device and can't find anything. We've had an Xsan/Active expert login and check things, he can't find anything wrong. The clue to the problem lies in the 2gb vs 4gb fiber situation. The 2gb fiber cards mask the issue because they're not grabbing data at the same speed as the 4gb card. The volumes speed is the issue, or is it? Why are some files fine and others not?

I appreciate any feedback!

thanks!

tye

abstractrude's picture

Tye

You kind of broke rule number one by letting the volume/pools fill up.

You then created whats called free space fragmentation by snfs defragging and/or deleting. Anyway to fix this blow out your volume and start from scratch. or copy off all data. delete everything and restore data. leaving plenty of free space 5-20% is safe with modern volume sizes. remember running snfsdefrag often is a bad idea. if your doing things correctly you should rarely if ever need to run it, unless you are of course re-allocating data from one pool to another.

snfs defrag can get you out of a jam, but is not part of normal maintenance cycle. as for the 2gig vs 4 gig thats weird, maybe the read ahead increase on the 4gig causes it hit more free space fragments causing hits and dropped frames you don't see with 2 gig.

i never do this on here but I am in LA and can help for a fee. let me know. or i will try to help you on here for free! :)]

and LMAO who is your Xsan expert? its pretty easy to tell this issue from the cvsummary. what a n00b. expert my ass. n00000bbbbbbbb

Quote:
The 2gb fiber cards mask the issue because they're not grabbing data at the same speed as the 4gb card. The volumes speed is the issue, or is it? Why are some files fine and others not?
/quote

I think you are on the right track about grabbing more data at 4GB.
As for why some files are good and some are not. It has to do with their fragmentation. When certain files were written they had to search the volume for free space and allocated themselves to a thousands of different places. Increasing time for fs requests. anyway this problem is m00t, your volume is hosed in my opinion, follow steps in this post to fix.

run a snfsdefrag -c to see the number of extents on your bad files, bet they have a ton of extents.

I have a rule of thumb for Video Stornext volumes, if you need to run snfsdefrag to play out you need to burn the volume to the ground. :roll: :roll: :roll: :arrow:

-Trevor Carlson
THUMBWAR

tye1138's picture

abstractrude wrote:

You kind of broke rule number one by letting the volume/pools fill up. /quote

Yep! Its hard when you have 40 people hitting the volume at the same time, writing, saving, updating and deleting files. This is the problem with single volume architectures.

abstractrude wrote:

You then created whats called free space fragmentation by snfs defragging and/or deleting. Anyway to fix this blow out your volume and start from scratch. or copy off all data. delete everything and restore data. leaving plenty of free space 5-20% is safe with modern volume sizes. remember running snfsdefrag often is a bad idea. if your doing things correctly you should rarely if ever need to run it, unless you are of course re-allocating data from one pool to another. /quote

Its even worse then that, we didn't know this until it was too late, but some editors were writing files to the drive which were open ended. This is a big no-no for Xsan and it causes really bad fragmentation. We yelled at the guys who did it and it won't happen again, but the damage is done.

We ran NSFSdefrag as a last resort because the fragmentation was so high. We deal with HUGE files, 40 - 100GB is average, so when the volume gets full, it has no clean space to put the data so fragmentation is the result.

abstractrude wrote:

and LMAO who is your Xsan expert? its pretty easy to tell this issue from the cvsummary. what a n00b. expert my ass. n00000bbbbbbbb /quote

LOL, its been 8 years since I've had to fix a san issue... and I bet our other guys have similar issues with remembering commands when you don't ever use them. Usually when I've had issues like this, I've copied the data and blown the san away. Sadly in this case, we have two 400tb volumes and nowhere to put the data, so we're screwed.

abstractrude wrote:

run a snfsdefrag -c to see the number of extents on your bad files, bet they have a ton of extents./quote

Files are all clean, fragmentation is gone. We've also tried to duplicate bad files and replace them with no luck.

abstractrude wrote:

I have a rule of thumb for Video Stornext volumes, if you need to run snfsdefrag to play out you need to burn the volume to the ground. :roll: :roll: :roll: :arrow:/quote

Great information and it sucks when you get taught one thing, but the truth is entirely different!

I appreciate your help and if I do get into a bind, I'll shoot you a PM. We have some great people here and I believe you're completely right about the volume being hosed.

thanks again

tye

abstractrude's picture

Just to be clear, I wasn't saying you were doing anything wrong. But your consultant should have identified this issue pretty quickly. I'm sorry he did not.

The way I solve this problem, the problem of volumes filling up is simple. I write a script that watches volume useage, once hits a certain amount of space it starts emailing everyone on the volume, especially producers and creative directors. like in the morning. warning volume almost full etc.

Good luck!

-Trevor Carlson
THUMBWAR

tye1138's picture

abstractrude wrote:
Just to be clear, I wasn't saying you were doing anything wrong. But your consultant should have identified this issue pretty quickly. I'm sorry he did not. /quote

Yep, ohh well.

abstractrude wrote:

The way I solve this problem, the problem of volumes filling up is simple. I write a script that watches volume useage, once hits a certain amount of space it starts emailing everyone on the volume, especially producers and creative directors. like in the morning. warning volume almost full etc.

Good luck!/quote

We actually do have a system which warns everyone. Sadly, the editors work at night and they MUST do the work sometimes. We're just going to implement a new policy that keeps more then enough free space on the volumes for the future.

I told my lead tech about our conversation today and he rolled his eyes because deep down in side, we both knew the solution was gonna be blowing the volume away.

For the time being, we're going to archive and delete a lot of data off the volume and then copy off the corrupt files, delete them and copy them back on again. Hopefully it may resolve some of the issues... Its the best thing we can do! :(

thanks again!

tye

abstractrude's picture

The only other thing you can do is move all the data to one storage pool and back. You need to reclaim that space as fresh.

-Trevor Carlson
THUMBWAR

tye1138's picture

abstractrude wrote:
The only other thing you can do is move all the data to one storage pool and back. You need to reclaim that space as fresh./quote

Basically re-formatting that volume.

Worst part is, now our other volume is critically full. :(

Its such a loose-loose situation right now, sucks!

thanks again for your help.

tye