xSAN 3.1 Very Slow to show Folder Listing/Contents

Growl's picture

Hi all,

I wonder if anyone can shed any light on an odd problem I am experiencing. Our SAN is online and clean with no errors. Read/Write are fast (testing at over 800MBs). Spec:

2 x MDC’s
8 x Clients.

All - Xserve3,1
10.9.5 (13F1112 - not patched this year yet)
xSAN 3.1. Spotlight, ACL, Case Insens - enabled. Ext Attr - Disabled.
Block Size = 16k, Alloc - RR.
fsm process on the Primary mdc runs at about 200% cpu. Spikes up to 450 then back to 150 with not much loading….
Failed over end of last year as a ‘test’ and bar the ‘spotlight’ antics - was fine and the secondary mdc had similar fsm values so have assumed this to be normal.
24GB RAM, Local System 240GB SSD’s.
Metadata runs though 2 x 12pt Cisco 500G switches (flat - no mgmnt ctl - latest iOS). Nothing cleaver bar dual 1Gb trunk between them. MDC’s + Clients split across for semi-resilience.
Fibre is all 4 x 4Gb from each node into 2 x Brocade 4100 (32pt full config running FabOS v6.4.3g - not upgrade to 6.4.3h yet). Again - MDCs/Clients spread across for full resilience.

Storage: 2 x ActiveRAID 32GB units. Total presented storage = 36TB on the main volume. MD = 2TB
35% full.

Content is made up of about 1.5m files of varying sizes all re-shared via primarily AFP (2 are afp/smb) from the clients.

For the last 6 months - we have had intermittent reports of it running ‘slowly’… What I mean when I say that - file copy and searching is still very fast - the issue is ‘folder lag’. Initially this seemed to only affect 10.9 reshare clients not 10.11 but it is now being seen not only on the reshare clients but on the xsan clients themselves. This can either be in the Finder or from within an App. Double click on a folder (or click the drop down) and it can take ages to display the contents. Sometimes there can be no files, sometimes there can be hundreds - still takes ages to think about it (anything from 4s to 1min). It is worse when under load. The issue is visibly present on the xsan client even before the re-shares. I have started to look at .ds_store files, disabling icons preview etc etc. Not sure if I am working along the right lines…

I need to migrate another 20 high bandwidth users onto this in a about 4-6 weeks and this is not possible until we resolve this issue.

I stop the SAN at least every 4 months and give it a full cvfsck (-j / -nv / -wv / -nw). Nothing has come up. Nothing in the logs points to anything wrong at all…

Frankly i’m stumped.

Has anyone seen this and/or got any ideas what might cause this behaviour? Have I made a school-boy error somewhere I am just not seeing?
(FYI - I built this in parallel to our old now decommissioned SANs (2.2.2 (Intel 10.6.8) + 2.1.1 (PPC 10.5.8)) both ran for many years without issues (bar spotlight!). This replaced our 1.4.2 SAN).

Massive thanks in advance for any assistance or pointers