Fai2

Q: Client Beachballs on OSXS

TL;DR: Server connections slow down to a crawl for clients, no apparent reason.

________________________________________________________________________


Server: Final model Intel Xserve, quad Xeon 2.26 GHz, 12GB RAM, running El Cap 10.11.6/Server 5.1.7 with all updates from factory internal SSD. Served files are on separate drive. Bare metal system install and reconfig a few months ago (server previously ran Snow Leopard). Our only enabled services for clients are File Sharing and Time Machine. OD is off since we don't use any services that require it.

 

Network: Wired cat 6 GigE through HP managed switch, plus wireless for mobile devices. About 60 total network connections at any time. Xserve has bonded ethernet link (i.e. double GigE) to switch. Internet connection via Comcast Business is theoretically 150/20 Mbps down/up but very variable in practice.

 

Clients: About 30 Macs, recent OS versions (except one with 10.6.8). Some connected via AFP, some SMB. About 8 core users mostly on 27" iMacs with lots of RAM launch Adobe CC files from the server over the network and work that way. Remaining users are typically connected to the server but not actively using it most of the time.

 

Problem: Periodic drastic slowdowns for core users. Normal activities such as saving or opening files over the network now take place at a glacial rate, accompanied by the dreaded spinning beachball. All core users are affected when this happens. Affected apps are Adobe CC (Photoshop, Illustrator, InDesign), but that's what these core users work with almost exclusively, so may not be significant. However these apps are said to be fairly chatty (phone home, autosave, etc.). Often but not always occurs toward the end of the day.

 

Observations: Restarting the server usually clears the problem...until the next time. Then sometimes we're OK for a few days, sometimes not. No apparent correlation with AFP or SMB connections. Server logs don't show anything that catches my eye, but they're cryptic enough anyway for mere mortals. Stats in Server.app show very light server loading for cpu (<20%), memory (<6GB), network (<25MB/s). Activity Monitor shows no runaway processes. Server drives SMART status is good. Basically our Xserve hardware is in good shape and just loafing along most of the time.

 

At this point I'm desperate for any ideas, or suggestions for a methodology to arrive at a solution.

Xserve, OS X Server, El Cap

Posted on Aug 24, 2016 9:22 AM

Close

Q: Client Beachballs on OSXS

  • All replies
  • Helpful answers

  • by cdhw,Helpful

    cdhw cdhw Aug 25, 2016 11:04 AM in response to Fai2
    Level 4 (2,623 points)
    Servers Enterprise
    Aug 25, 2016 11:04 AM in response to Fai2

    IME, you get these symptoms if the switch is dropping packets. AFP (I'm not sure about SMB) has a crucial ACK packet with a long (30s IIRC) timeout for a response. If your network loses that packet, or its response, the client just locks up for 30s at a time. Packet-sniffing (tcpdump or Wire Shark) with appropriate filters would demonstrate if this was the problem.

     

    If your users are treating the server volume like a local drive and opening the files 'in place' for editing then performance is likely to be disappointing. It's much better from many points of view to download to a local disk, do the work, then upload again.

     

    If you haven't already done so, make your menu clocks display the time in seconds and give each workstation a note-pad and pen each to make a not of the precise displayed time (HH:MM:SS) that beach balling occurs. You can then narrow down the section of logs you need to study and see whether all machines are affected at the same time.

     

    I fought a problem like this for months. It wasn't solved until the network team replaced a HP switch with a Juniper one.

     

    C.

  • by Fai2,

    Fai2 Fai2 Aug 25, 2016 7:51 AM in response to cdhw
    Level 1 (14 points)
    Servers Enterprise
    Aug 25, 2016 7:51 AM in response to cdhw

    @cdhw: Thanks for your suggestions. Network issues are one of the things I'd considered, but lacked the tools (and expertise...) to explore. I've now dowloaded Wireshark and its (150 page, gulp!) user manual and will install it this weekend.

     

    However our network isn't highly loaded. The server stats graph shows an eyeball average of 5 MB/s outgoing during the workday, with a handful of very brief spikes to 20 MB/s or more. Incoming volume is much lower. This doesn't seem like much stress for a GigE wired network. I also have jumbo frames enabled on the switch, which should reduce overhead. Since I had to reboot the server yesterday, this morning I also had a look at the network traffic in Activity Monitor. It's sent out about 80 GB over 7 work hours, which averages out to 3 MB/s, roughly consistent with the servers stats report. Incoming is about another 20% over that. SMB (my 8 core users) is about 80% of total traffic, AFP the rest. Much of the AFP traffic is probably Time Machine client backups to external drives at the server.

     

    Re working directly off the server rather than copying locally, this is pretty much a workflow requirement for non-technical reasons. It's also standard practice in our industry. We have only 8 users working this way, and all the stats suggest the server is way under-stressed in every respect. In between the beachball events everything works smoothly this way.

     

    Replacing the switch certainly isn't out of the question, but it's not just spaghetti I want to throw at the wall either. IIRC we paid about $800 for the 48-port HP about 4 years ago. OTOH our 8 core users are billing out at well over $100 an hour, so there's that. It'll be interesting to see if I can glean anything from Wireshark.

  • by dwbrecovery,Helpful

    dwbrecovery dwbrecovery Aug 25, 2016 11:04 AM in response to Fai2
    Level 3 (520 points)
    Servers Enterprise
    Aug 25, 2016 11:04 AM in response to Fai2

    Hi fai2,

    Check as well:

    - On the machines of the 8 core users , open Console.app and  search for: com.apple.backupd

    messages to confirm that Time Machine backups are completing.

    - On the Server, within Server.app check under Logs ->   AFP Access Log and AFP Error Log,  and check Connected Users under the File Sharing Service.

     

    There might be error messages on the machines and the Server in these areas which assist to identify the issue.

     

    HTH

     

    Cheers, dwbrecovery

  • by Fai2,

    Fai2 Fai2 Aug 25, 2016 11:02 AM in response to dwbrecovery
    Level 1 (14 points)
    Servers Enterprise
    Aug 25, 2016 11:02 AM in response to dwbrecovery

    @dbwrecovery:

     

    Thanks for the suggestions. I'll check those out.

  • by lennert020,

    lennert020 lennert020 Sep 7, 2016 3:26 AM in response to Fai2
    Level 1 (4 points)
    Servers Enterprise
    Sep 7, 2016 3:26 AM in response to Fai2

    Fai2,

     

    I'm having the exact same problem:

    • periodic drastic slowdown
    • all users affected
    • Adobe CC doesn't help (and although Adobe says not to use a server with InDesign it's not efficient to save files locally)
    • Turning File-Sharing on and off solves the problem
    • only using SMB but that is not relevant
    • stats show very light server load
    • Activity monitor flat

     

    Have started to install WireShark on another machine and hoping to get something through there but that is another learning curve. Other suspects are my HP Switches (cdhw), Airport Extreme Basestations, cabling in the building. Thinking about getting Meraki Switches and base stations. Not happy that Apple turned off SNMP on the latest Airport Extremes. Makes it impossible to see the throughput of those devices.

     

    Let me know if you have progressed since this posting.

     

    P.S. Am only running File Server

     

    Lennert

  • by Fai2,

    Fai2 Fai2 Sep 7, 2016 5:57 AM in response to lennert020
    Level 1 (14 points)
    Servers Enterprise
    Sep 7, 2016 5:57 AM in response to lennert020

    Hi Lennert,

     

    Thanks for your post. I installed WireShark about 10 days ago and have used it to check periodically. However we have lots of network traffic (although well within the network capacity), so analyzing it is a bit like trying to drink from a fire hose. Nonetheless, when reviewing the packet contents I see nothing that appears unusual. I'm inclined to think that this type of network issue is not involved.

     

    However information from another admin with the same problem might implicate SMB. They have found that killing the server's SMB process via Activity Monitor (it relaunches automatically shortly afterward) clears the problem for them, although then the clock starts counting again and the slowdown eventually reoccurs. In their case they also support Windows clients, so completely turning off SMB isn't an option.

     

    However we have no Win clients, so it is something I can try. This past weekend I disabled SMB for all the server's sharepoints and reconfigured the client Macs not to call for SMB connections. Today, all log-ins are normal and showing as AFP. So far, so good — no beachballs. Since Monday was a public holiday here we only have one day of experience with this, but so far so good. If things keep working through the end of the week I might begin to feel some real confidence.

     

    A couple of things about completely disabling SMB. The first is that, while the server still shows up in the list of shares in a client Mac's Finder window sidebar, it won't successfully log on that way. No idea why. Instead we're directly calling the server's local IP address via the Connect to Server item in the Finder's Go menu. I've now set up aliases to the sharepoints on the client Macs so that our users don't have to think about this. The second thing is that even if reverting to AFP turns out to solve the immediate problem, we know that Apple intends to do away with AFP at some point, so there's still a longer term problem.

     

    There's also been a suggestion of setting up a chron script to periodically kill the SMB process and allow it to respawn and reset. If that works it might be a practical solution to continued use of SMB.

  • by lennert020,

    lennert020 lennert020 Sep 7, 2016 6:10 AM in response to Fai2
    Level 1 (4 points)
    Servers Enterprise
    Sep 7, 2016 6:10 AM in response to Fai2

    Hi Fai2,

     

    Agree on Wireshark. It's a lot of new information I would have to learn how to analyze. Unfortunately for us we have a couple of Windows users so I can't turn off SMB either. And I agree with you that SMB is the way to go.

     

    With respect to Activity Monitor you mean smbd, correct?

     

    Good luck with your AFP experiment. Will see what happens if I reset smbd when it's getting out of control.

     

    Lennert

  • by lennert020,

    lennert020 lennert020 Sep 7, 2016 7:51 AM in response to lennert020
    Level 1 (4 points)
    Servers Enterprise
    Sep 7, 2016 7:51 AM in response to lennert020

    Hi Fai2,

     

    Just happened to us. Force Quit smbd and after a couple of seconds everything was back without the need to turn File-Sharing on/off or restart.

     

    Below the messages I found on Console around the time of the problem (±16:35)

    07/09/16 15:56:47,467 servermgr_smb[11680]: idle exit

    07/09/16 15:56:47,467 Server[538]: Dispatcher: servermgr_smb plugin disconnected

    07/09/16 16:35:24,271 servermgr_smb[12490]: validating connection from 501 : 100008

    07/09/16 16:35:24,271 servermgr_smb[12490]: Connected to the Auth Service

    07/09/16 16:35:24,275 servermgr_smb[12490]: Connected to the Notify Service

    07/09/16 16:35:56,308 smbd[588]: exiting on signal Terminated: 15

    07/09/16 16:35:57,504 smbd[12496]: reconnect_durable_handle:  did not find a match for guid

    07/09/16 16:35:57,506 smbd[12496]: reconnect_durable_handle:  did not find a match for guid

    07/09/16 16:35:58,113 smbd[12496]: reconnect_durable_handle:  did not find a match for guid

    07/09/16 16:35:58,139 smbd[12496]: reconnect_durable_handle:  did not find a match for guid

    07/09/16 16:35:58,139 smbd[12496]: reconnect_durable_handle:  did not find a match for guid

    07/09/16 16:36:13,321 smbd[12496]: BUG in libdispatch client: kevent[EVFILT_VNODE] delete: "Bad file descriptor" - 0x9

     

    Lennert

  • by Fai2,

    Fai2 Fai2 Sep 7, 2016 10:53 AM in response to lennert020
    Level 1 (14 points)
    Servers Enterprise
    Sep 7, 2016 10:53 AM in response to lennert020

    Well, my "turn off SMB and run AFP only" experiment was a miserable failure. Late this morning the slowdown struck again after only one-and-a-half days of operation. Toggling File Sharing services off and then on again seemed to have fixed it. Over lunch (we're on US Eastern Time) I also rebooted the server, since I was in the office and I could easily do it.

     

    This would seem to exonerate SMB, which had not been running, but leaves open the question of whether there's an AFP problem. However since AFP is necessary for Time Machine, turning it off really isn't practical.

     

    It seems as likely to me that the problem isn't with AFP or SMB specifically, but with some part of the file sharing process in general. Something builds up over a period of 1-3 days until a critical level is reached, at which point the client user experience falls off a cliff. The server itself seems not to be affected. My remote connections using ARD work fine, as does the UI at the server if I happen to be there. This means that while my users are going crazy with frustration, I can't see the problem myself — very frustrating.

     

    This weekend I'll probably turn SMB back on, since I now see no point in suspending it. Next time the slowdown happens I'll try killing the smbd (and AppleFileServer?) processes via Activity Monitor.

  • by dalenorman2005,

    dalenorman2005 dalenorman2005 Sep 7, 2016 1:18 PM in response to Fai2
    Level 1 (4 points)
    Sep 7, 2016 1:18 PM in response to Fai2

    I'm watching this thread closely as I believe I have the identical issue with a very similar setup.  Kindly keep us posted as your troubleshooting continues.

     

    Currently, I've got no solution to propose, but am holding my breath in anticipation that OSX Sierra fixes it.  Perhaps a daring soul cares to install the beta on their production server??

  • by dwbrecovery,

    dwbrecovery dwbrecovery Sep 7, 2016 6:05 PM in response to Fai2
    Level 3 (520 points)
    Servers Enterprise
    Sep 7, 2016 6:05 PM in response to Fai2

    Hi,

    Check this Re: Idle afp Connections

    I use these settings for a few months and found them to stabilise the Time Machine Service.

     

    HTH

    Cheers, dwbrecovery

  • by lennert020,

    lennert020 lennert020 Sep 8, 2016 6:21 AM in response to Fai2
    Level 1 (4 points)
    Servers Enterprise
    Sep 8, 2016 6:21 AM in response to Fai2

    Fai2,

     

    We should continue to find points of similarities & differences in our setup.

     

    I have:

    • Mac Mini (Late 2014)
    • Promise Pegasus2 R8 connected by Thunderbolt2
    • HP Switches
    • Mac and Windows Clients
    • ±8 Airport Extreme Basestations

     

    You have:

    • Xserve 2009
    • Storage? Fibre Channel?

     

    We both use:

    • El Capitan Server 5.1.7
    • Adobe CC superusers
    • Cat5e & Cat6
    • No Open Directory
    • No binding

     

    Lennert

  • by Fai2,

    Fai2 Fai2 Sep 8, 2016 7:58 AM in response to lennert020
    Level 1 (14 points)
    Servers Enterprise
    Sep 8, 2016 7:58 AM in response to lennert020

    Lennert,

     

    Good idea.

     

    We have:

    • Xserve 2009 (final) model, in service since end 2011

    • Boot drive: Original equipment Apple 128GB SSD

    • Data storage: External RAID 1 (mirrored) 2TB (2x2TB) via eSATA PCIe card

    • 48 port HP GigE managed switch

    • Wired ethernet cat 6

    • 3xAEBS for DHCP/NAT plus distributed WiFi for mobiles

    • Services are File sharing and network Time Machine only

    • No OD, no binding, no local email server, no web hosting

    • Typically about 25 users logged in

     

    Notes:

    - We switched to an external data drive when one member of the internal/removable drive-sled RAID pair started to fail at ~51K hours. These drives are no longer available.

    - I expect to take the Xserve out of service within a couple of years, probably switching to a redundant pair of Mac Minis (or whatever is available then)

    - Network Time Machine is flakey for us. I have a couple of users whose backups always work, but others that work once or twice and then say they need to replace the old backup and start over.

     

    Neil

  • by lennert020,

    lennert020 lennert020 Sep 12, 2016 2:01 AM in response to Fai2
    Level 1 (4 points)
    Servers Enterprise
    Sep 12, 2016 2:01 AM in response to Fai2

    @Fai2 Also join MacAdmins on Slack: macadmins.slack.com.

     

    Lennert