Mavericks and Failed ARP causing network drops!

Question

Level 1

0 points

Mavericks and Failed ARP causing network drops!

I have been wracking my brain about why on our corporate network, after Mavericks upgrade, we start to see dropped packets every 30-60 seconds. Here is an example of that ping.

64 bytes from 10.11.12.13: icmp_seq=135 ttl=63 time=3.705 ms

64 bytes from 10.11.12.13: icmp_seq=136 ttl=63 time=3.473 ms

64 bytes from 10.11.12.13: icmp_seq=137 ttl=63 time=3.811 ms

64 bytes from 10.11.12.13: icmp_seq=138 ttl=63 time=4.110 ms

Request timeout for icmp_seq 139

Request timeout for icmp_seq 140

Request timeout for icmp_seq 141

Request timeout for icmp_seq 142

Request timeout for icmp_seq 143

64 bytes from 10.11.12.13: icmp_seq=144 ttl=63 time=5.417 ms

64 bytes from 10.11.12.13: icmp_seq=145 ttl=63 time=3.587 ms

64 bytes from 10.11.12.13: icmp_seq=146 ttl=63 time=3.744 ms

64 bytes from 10.11.12.13: icmp_seq=147 ttl=63 time=3.486 ms

64 bytes from 10.11.12.13: icmp_seq=148 ttl=63 time=3.466 ms

I think I have found a strange ARPing issue which is causing it. In our corporate environment, we run GLBP (Gateway load balancing protocol) on Cisco gear. As such, the gateway address floats between two devices requiring the mac_addr to change. Looks something like this in the arp table:

efl-ashvartsman:~ ashvartsman$ arp -a

? (10.224.165.1) at 0:7:b4:2:cb:2 on en0 ifscope [ethernet]

efl-ashvartsman:~ ashvartsman$ arp -a

? (10.224.165.1) at 0:7:b4:2:cb:1 on en0 ifscope [ethernet]

On my mountain lion machine, it does a broadcast arp and gets a response for the new mac_addr immediately.

258

26.783206000

Apple_78:29:dd

Broadcast

ARP

42

Who has 10.224.165.1? Tell 10.224.165.55

259

26.786929000

Cisco_e0:ff:40

Apple_78:29:dd

ARP

60

10.224.165.1 is at 00:07:b4:02:cb:01

This happens seemlessly in the background and no packet loss is observed. However, looks like Mavericks is doing something completely different, and WRONG. It is sending out 5 UNICAST requests back to the mac address it had before (ARP should always be broadcast!!!). It fails these 5 times and then finally does a BROADCAST attempt. Looks like the below. It causes then about a 5 second outage to the network of the machine.

394 67.052366000 Apple_b9:a6:b2 Cisco_02:cb:02 ARP 42 Who has 10.224.165.1? Tell 10.224.165.225

395 68.053450000 Apple_b9:a6:b2 Cisco_02:cb:02 ARP 42 Who has 10.224.165.1? Tell 10.224.165.225

396 69.053595000 Apple_b9:a6:b2 Cisco_02:cb:02 ARP 42 Who has 10.224.165.1? Tell 10.224.165.225

397 70.053893000 Apple_b9:a6:b2 Cisco_02:cb:02 ARP 42 Who has 10.224.165.1? Tell 10.224.165.225

398 71.054363000 Apple_b9:a6:b2 Cisco_02:cb:02 ARP 42 Who has 10.224.165.1? Tell 10.224.165.225

399 72.054466000 Apple_b9:a6:b2 Broadcast ARP 42 Who has 10.224.165.1? Tell 10.224.165.225

400 72.058079000 Cisco_e0:ff:40 Apple_b9:a6:b2 ARP 60 10.224.165.1 is at 00:07:b4:02:cb:01

Here is the arp table during this period:

macsccmtest:~ administrator$ arp -a

? (10.224.165.1) at (incomplete) on en1 ifscope [ethernet]

? (10.224.165.220) at f0:b4:79:21:4c:ec on en1 ifscope [ethernet]

My hunch is that Apple did this to try to reduce bandwidth utilization on the network but it will cause BIG problems on corporate networks that use GLBP or any other protocol to provide redundancy across multiple devices!

Anyone else seeing this? Everyone in my office who has moved to Mavericks can replicate this behavior.

OS X Mavericks (10.9)

Posted on Oct 25, 2013 11:12 AM

Reply

Answer 1

Guido39

Level 1

0 points

Apr 7, 2014 10:39 AM in response to anfedoro

anfedoro:

Did you find a solution for your iPad issue with GLBP? We've been having the same issue and just discovered that GLBP was the cause. What's strange is that most of our iPads are working but we have three that aren't. As soon as I shut down one of the vlan interfaces, which basically shuts down load balancing, it works fine.

Reply

Answer 2

Apr 23, 2014 11:14 PM in response to jonschwenn

Same problem here

*Before patch MackBook Pro Retina 2014 ( bought 1 month ago )

200kbps download

After patch

Download Speed: 6216 kbps (777 KB/sec transfer rate)

Upload Speed: 744 kbps (93 KB/sec transfer rate)

So https://github.com/MacMiniVault/Mac-Scripts/blob/master/unicastarp/unicastarp-RE ADME.md

Worked for me

Reply

Answer 3

nick1329

Level 1

0 points

Apr 28, 2014 11:42 AM in response to Akira Okumura

Same here. We have HP 8212 procurve switches and we are having the same issue. My machines are on 10.9.2 and this script didn't work on them. We are attacking this problem from both sides of the coin, we have a call into our HP technical support people as well.

We are potentially looking to purchase 10 or so more of these servers but not if this issue isn't resolved. Regardless of who's issue it is I think all parties should work together to get it resolved and hopefully we can do that.

Reply

Answer 4

nick1329

Level 1

0 points

Apr 30, 2014 1:27 PM in response to ashvartsman

I wanted to throw out an update on this.

One of our network guys had the bright idea of trying out a usb to ethernet adapter and we haven't had any time outs for the last 2 hours. I'm going to keep monitoring it for the next day, so it seems it might be related to the NIC driver/port.

Reply

Answer 5

May 4, 2014 10:09 AM in response to nick1329

I thought I would add a few data points: this problem started happening to me when I upgraded a Mac Mini and a MacBook Air (latest models) to Mavericks. It happened immediately on both machines. I am just using them with a Time Warner Cable home connection. It happens when either machine is using either Ethernet or WiFi.

When upgrading to 10.9.2 the problem seemed to finally mostly go away, although it still happened occasionally. It seems to be happening a little more now as well. Before 10.9.2, changing the net.link.ether.inet.arp_unicast_lim setting helped significatly, but didn't eliviate the problem.

Most interestingly, I run Windows 7x64 in a Parallels Virtual Machine. When the problem happens, and I run Ping tests to www.google.com, I get dropped packets in OS X, but not on Windows. I recetnly ran identical ping tests simulatenously in each on the same machine and only OS X dropped packets. So I would defintely say it is an OS X software issue.

A question: does anyone think other devices on the network (e.g. print server, phone SIP, etc.) might somehow cause some interaction problem?

Reply

Answer 6

May 6, 2014 6:55 AM in response to ashvartsman

Did you not read what the last update was for?? Mainly patching for security purposes for bluetooth and network. Connect to your router just as it request and you will be fine.

Reply

Answer 7

May 10, 2014 6:29 PM in response to codythedog

What's most worrisome about this entire issue is that something as fundamental as networking within an operating system of this maturity should have been solved permanently at this point. That Apple has introduced a problem in a layer so fundamental to computing today indicates a much more broad issue at work within Apple. Worse, that someone hasn't actually addressed this problem after having been informed indicates even wider product problems. You can't tell me there aren't a number of experienced TCP OS developers out there who couldn't figure out the root cause of this issue within OS X and get it fixed within a few weeks.

Apple used to pride itself on its engineering polish, its innovation and its robust software releases. Since the Jobs era has ended, that unfortunately all seems to have stopped. I really want to like Mavericks as it does have some nice features, but having something as fundamental as networking broken doesn't really do it for me. There's just no reason for this to be a problem today.

Reply

Answer 8

May 28, 2014 10:26 AM in response to ashvartsman

Everyone else still experiencing this issue in 10.9.3 I assume? Because I still am.

Reply

Answer 9

May 28, 2014 1:50 PM in response to rossoneri91

Yes, we are at the university I work at. The "fix" doesn't fix it either. I did notice something really weird though, I was trying to get a packet capture on one of the affected machines for the Apple Enterprise case I just opened. So I headed over to terminal and ran:

tcpdump -ennqti en0 \( arp or icmp \)

This way I could just capture arp and icmp (I had an active ping from another station already in progress). Low and behold no dropped packets as soon as the packet capture started. I Ctrl+C'd the packet capture as I thought it was a fluke, but as soon as I did packet loss. Started it back up and no dropped packets. If someone else could try to replicate this I would greatly appreciate it as I have been able to replicate it on 2 machines (10.9 & 10.9.3).

Reply

Answer 10

nick1329

Level 1

0 points

May 28, 2014 2:02 PM in response to jwsullivan

This is the exact same thing that happened to me and exactly what I told enterprise support. I went on ahead and sent the information onto support though and they are reviewing it.

We have actually created a whole separate vlan (same setup as previous, I was told) with 1 apple server on that vlan and 1 on the other vlan, and the server on the (new)vlan by itself hasn't dropped a ping (that I've seen) in over 5 days.

Our theory is that it's somehow related to traffic on the vlans and maybe a software/driver update on either the server or our switches will resolve the issue.

I'm keeping the case open with Apple to see if they can find the issue.

Reply

Answer 11

May 28, 2014 2:12 PM in response to rossoneri91

Yes, I'm still seeing it, very randomly (home user on Time Warner Cable connection). I noticed twice when I started an active ping test to www.google.com, it seemed to "unstuck" it (that's a technical term 😁). Usually pinging does not, though.

Had another interesting data point: I was uploading hundreds of files to a web-server, and was getting all kind of failed transfers because of it. I was on my MacBook connected to thunderbolt display, with a hard-wire ethernet connection. I turned on WiFi on the MacBook -- and as soon as it connected, the ARP problems instantly stopped -- the transfers started all working. Note that the WiFi connection was still secondary to the ethernet as the primary network connection so it really should not have changed anything. Another clue perhaps?

Reply

Answer 12

Jun 16, 2014 1:41 PM in response to ashvartsman

After MANY months, I received a follow up from Cisco. They're sticking with the current implementation, citing RFCs, and blaming clients. So this won't be fixed on the Cisco side of things. There's an internal bug ID if anyone needs to reference it. A snippet from the conversation follows :

I was in touch with the development team to look into this and based on that here is what the information I received. - This issue has been raised by many engineers with BU for the behavior seen with Apple, Linux and even some wireless devices. - The development team has looked into this in depth and also referred to the RFCs for the specification on how this should work. - To track this issue, there was an internal bug CSCeg05955 was filed. - After considering all the specifications and design standards, it was determined that Cisco would not be changing its behavior for the unicast ARPs for GLBP IPs as it is as per the standard and the workaround has to be implemented on the client side. - The bug is internal, however I will provide the details from it so that you can have it. Symptom:A client device may send unicast ARP requests to GLBP members, but doesn't get a response. The result may be an intermittent - perhaps periodic - inability of that client to transmit IP traffic. Conditions:GLBP is being used in load balancing mode for clients' default gateway. Workaround:1. Disable GLBP load balancing. 2. Install static ARP entries on the GLBP members, mapping from the GLBP address to their MAC addresses. More Info:Unicast ARP is increasingly utilized by various clients, especially wireless ones, to determine whether they have reconnected to the same subnet. See for example RFC 4436 for a use case. This bug will not be fixed by Cisco, as it is viewed as a client problem. Client devices should not invalidate their ARP entry before using the unicast ARP.

Reply

Answer 13

Dogcow-Moof

Level 8

36,557 points

Jun 16, 2014 11:51 PM in response to XenoPhage

The most interesting part of Cisco''s response is:

This issue has been raised by many engineers with BU for the behavior seen with Apple, Linux and even some wireless devices.

which basically means "yeah, we know people are having issues, but we're right."

It would be nice to know whether it's just RFC 4436 or other RFCs that make Cisco believe this.

So, bottom line, it's apparently Cisco vs. the world on whether a host's ARP entry should be invalidated before a Unicast ARP to determine if it's still connected to the last network it was connected to.

Reply

Answer 14

Aug 11, 2014 2:06 PM in response to ashvartsman

Hey just to let you know that we have been tracking this for about 9 months... a full write-up is here on our blog here:

http://www.macstadium.com/blog/osx-10-9-mavericks-bugs/

A couple days ago, one of our subscribers who happens to also be an Apple employee put us directly in touch with the CoreOS developers at Apple in Cupertino, CA. This was really exciting news because we have been trying to get an escalation at Apple via the support and beta communities for a very, very long time with no traction.

Apple has since validated the issue via a lab we setup for them here at Macstadium.com, and they are now looking at a driver stack problem being the main issue. The biggest clue was the Apple USB <> Ethernet adapter not presenting the problem on the exact same host as the internal Gigabit Ethernet adapter - on the same network - at the same time.

They are moving very quickly, so we are very optimistic for a patch for 10.9.x as well as the solution being rolled into the next 10.10 beta release. We'll keep everybody posted in coming days.

Reply

Answer 15

Aug 16, 2014 9:20 PM in response to MacStadium

Wow. I'm just a little taken aback by this response. Apple doesn't have a testing lab internally? Apple had to rely upon a lab that MacStadium put together? Concerned. And, specifically, I'm concerned that if Apple doesn't have a testing lab, how the heck are they testing releases? Though, I realize you can't possibly build a lab configuration for every possible hardware config. Knowing that this issue exists, Apple should have been able to put a lab together to replicate the problem and determine a cause. That it took an outside third party 9 months to get Apple to notice, again worrisome.

I'm glad that Apple has finally acknowledged the issue and was able to replicate the problem thanks to MacStadium. And I'm glad to know a fix is on the way. That it required this level of third party intervention is troubling on so many levels.

Reply