ashvartsman

Q: Mavericks and Failed ARP causing network drops!

I have been wracking my brain about why on our corporate network, after Mavericks upgrade, we start to see dropped packets every 30-60 seconds.  Here is an example of that ping.

 

64 bytes from 10.11.12.13: icmp_seq=135 ttl=63 time=3.705 ms

64 bytes from 10.11.12.13: icmp_seq=136 ttl=63 time=3.473 ms

64 bytes from 10.11.12.13: icmp_seq=137 ttl=63 time=3.811 ms

64 bytes from 10.11.12.13: icmp_seq=138 ttl=63 time=4.110 ms

Request timeout for icmp_seq 139

Request timeout for icmp_seq 140

Request timeout for icmp_seq 141

Request timeout for icmp_seq 142

Request timeout for icmp_seq 143

64 bytes from 10.11.12.13: icmp_seq=144 ttl=63 time=5.417 ms

64 bytes from 10.11.12.13: icmp_seq=145 ttl=63 time=3.587 ms

64 bytes from 10.11.12.13: icmp_seq=146 ttl=63 time=3.744 ms

64 bytes from 10.11.12.13: icmp_seq=147 ttl=63 time=3.486 ms

64 bytes from 10.11.12.13: icmp_seq=148 ttl=63 time=3.466 ms

 

 

I think I have found a strange ARPing issue which is causing it.  In our corporate environment, we run GLBP (Gateway load balancing protocol) on Cisco gear.  As such, the gateway address floats between two devices requiring the mac_addr to change.  Looks something like this in the arp table:

 

efl-ashvartsman:~ ashvartsman$ arp -a

? (10.224.165.1) at 0:7:b4:2:cb:2 on en0 ifscope [ethernet]

efl-ashvartsman:~ ashvartsman$ arp -a

? (10.224.165.1) at 0:7:b4:2:cb:1 on en0 ifscope [ethernet]

 

On my mountain lion machine, it does a broadcast arp and gets a response for the new mac_addr immediately. 

 

25826.783206000Apple_78:29:ddBroadcastARP42Who has 10.224.165.1?  Tell 10.224.165.55
25926.786929000Cisco_e0:ff:40Apple_78:29:ddARP6010.224.165.1 is at 00:07:b4:02:cb:01

 

This happens seemlessly in the background and no packet loss is observed.  However, looks like Mavericks is doing something completely different, and WRONG.  It is sending out 5 UNICAST requests back to the mac address it had before (ARP should always be broadcast!!!).  It fails these 5 times and then finally does a BROADCAST attempt.  Looks like the below.  It causes then about a 5 second outage to the network of the machine.

 

394          67.052366000          Apple_b9:a6:b2          Cisco_02:cb:02          ARP          42          Who has 10.224.165.1?  Tell 10.224.165.225

395          68.053450000          Apple_b9:a6:b2          Cisco_02:cb:02          ARP          42          Who has 10.224.165.1?  Tell 10.224.165.225

396          69.053595000          Apple_b9:a6:b2          Cisco_02:cb:02          ARP          42          Who has 10.224.165.1?  Tell 10.224.165.225

397          70.053893000          Apple_b9:a6:b2          Cisco_02:cb:02          ARP          42          Who has 10.224.165.1?  Tell 10.224.165.225

398          71.054363000          Apple_b9:a6:b2          Cisco_02:cb:02          ARP          42          Who has 10.224.165.1?  Tell 10.224.165.225

399          72.054466000          Apple_b9:a6:b2          Broadcast          ARP          42          Who has 10.224.165.1?  Tell 10.224.165.225

400          72.058079000          Cisco_e0:ff:40          Apple_b9:a6:b2          ARP          60          10.224.165.1 is at 00:07:b4:02:cb:01

 

 

Here is the arp table during this period:

 

macsccmtest:~ administrator$ arp -a

? (10.224.165.1) at (incomplete) on en1 ifscope [ethernet]

? (10.224.165.220) at f0:b4:79:21:4c:ec on en1 ifscope [ethernet]

 

 

My hunch is that Apple did this to try to reduce bandwidth utilization on the network but it will cause BIG problems on corporate networks that use GLBP or any other protocol to provide redundancy across multiple devices!

 

Anyone else seeing this?  Everyone in my office who has moved to Mavericks can replicate this behavior.

OS X Mavericks (10.9)

Posted on Oct 25, 2013 11:12 AM

Close

Q: Mavericks and Failed ARP causing network drops!

  • All replies
  • Helpful answers

first Previous Page 4 of 5 last Next
  • by Peter-Erik,

    Peter-Erik Peter-Erik Mar 30, 2014 11:57 PM in response to Peter-Erik
    Level 1 (10 points)
    Mar 30, 2014 11:57 PM in response to Peter-Erik

    Duplicate the Ethernet port dont solved the problem back to the drawning board.

  • by Guido39,

    Guido39 Guido39 Apr 7, 2014 10:39 AM in response to anfedoro
    Level 1 (0 points)
    Apr 7, 2014 10:39 AM in response to anfedoro

    anfedoro:

     

    Did you find a solution for your iPad issue with GLBP? We've been having the same issue and just discovered that GLBP was the cause. What's strange is that most of our iPads are working but we have three that aren't. As soon as I shut down one of the vlan interfaces, which basically shuts down load balancing, it works fine.

  • by TRIBALHOST,

    TRIBALHOST TRIBALHOST Apr 23, 2014 11:14 PM in response to jonschwenn
    Level 1 (0 points)
    Apr 23, 2014 11:14 PM in response to jonschwenn

    Same problem here

     

    *Before patch MackBook Pro Retina 2014 ( bought 1 month ago )

    200kbps download

     

    After patch

     

    Download Speed: 6216 kbps (777 KB/sec transfer rate)

    Upload Speed: 744 kbps (93 KB/sec transfer rate)

     

    So https://github.com/MacMiniVault/Mac-Scripts/blob/master/unicastarp/unicastarp-RE ADME.md

     

    Worked for me

  • by nick1329,

    nick1329 nick1329 Apr 28, 2014 11:42 AM in response to Akira Okumura
    Level 1 (0 points)
    Apr 28, 2014 11:42 AM in response to Akira Okumura

    Same  here. We have HP 8212 procurve switches and we are having the same issue. My machines are on 10.9.2 and this script didn't work on them. We are attacking this problem from both sides of the coin, we have a call into our HP technical support people as well.

     

    We are potentially looking to purchase 10 or so more of these servers but not if this issue isn't resolved. Regardless of who's issue it is I think all parties should work together to get it resolved and hopefully we can do that.

  • by nick1329,

    nick1329 nick1329 Apr 30, 2014 1:27 PM in response to ashvartsman
    Level 1 (0 points)
    Apr 30, 2014 1:27 PM in response to ashvartsman

    I wanted to throw out an update on this.

     

    One of our network guys had the bright idea of trying out a usb to ethernet adapter and we haven't had any time outs for the last 2 hours. I'm going to keep monitoring it for the next day, so it seems it might be related to the NIC driver/port.

  • by codythedog,

    codythedog codythedog May 4, 2014 10:09 AM in response to nick1329
    Level 1 (0 points)
    May 4, 2014 10:09 AM in response to nick1329

    I thought I would add a few data points: this problem started happening to me when I upgraded a Mac Mini and a MacBook Air (latest models) to Mavericks. It happened immediately on both machines. I am just using them with a Time Warner Cable home connection. It happens when either machine is using either Ethernet or WiFi.

     

    When upgrading to 10.9.2 the problem seemed to finally mostly go away, although it still happened occasionally. It seems to be happening a little more now as well. Before 10.9.2, changing the net.link.ether.inet.arp_unicast_lim setting helped significatly, but didn't eliviate the problem.

     

    Most interestingly, I run Windows 7x64 in a Parallels Virtual Machine. When the problem happens, and I run Ping tests to www.google.com, I get dropped packets in OS X, but not on Windows. I recetnly ran identical ping tests simulatenously in each on the same machine and only OS X dropped packets. So I would defintely say it is an OS X software issue.

     

    A question: does anyone think other devices on the network (e.g. print server, phone SIP, etc.) might somehow cause some interaction problem?

  • by Samanthahubacek,

    Samanthahubacek Samanthahubacek May 6, 2014 6:55 AM in response to ashvartsman
    Level 1 (0 points)
    May 6, 2014 6:55 AM in response to ashvartsman

    Did you not read what the last update was for?? Mainly patching for security purposes for bluetooth and network. Connect to your router just as it request and you will be fine.

  • by commorancy,

    commorancy commorancy May 10, 2014 6:29 PM in response to codythedog
    Level 1 (0 points)
    May 10, 2014 6:29 PM in response to codythedog

    What's most worrisome about this entire issue is that something as fundamental as networking within an operating system of this maturity should have been solved permanently at this point. That Apple has introduced a problem in a layer so fundamental to computing today indicates a much more broad issue at work within Apple. Worse, that someone hasn't actually addressed this problem after having been informed indicates even wider product problems. You can't tell me there aren't a number of experienced TCP OS developers out there who couldn't figure out the root cause of this issue within OS X and get it fixed within a few weeks.

     

    Apple used to pride itself on its engineering polish, its innovation and its robust software releases. Since the Jobs era has ended, that unfortunately all seems to have stopped. I really want to like Mavericks as it does have some nice features, but having something as fundamental as networking broken doesn't really do it for me. There's just no reason for this to be a problem today.

  • by rossoneri91,

    rossoneri91 rossoneri91 May 28, 2014 10:26 AM in response to ashvartsman
    Level 1 (0 points)
    May 28, 2014 10:26 AM in response to ashvartsman

    Everyone else still experiencing this issue in 10.9.3 I assume? Because I still am.

  • by jwsullivan,

    jwsullivan jwsullivan May 28, 2014 1:50 PM in response to rossoneri91
    Level 1 (0 points)
    May 28, 2014 1:50 PM in response to rossoneri91

    Yes, we are at the university I work at.  The "fix" doesn't fix it either.  I did notice something really weird though, I was trying to get a packet capture on one of the affected machines for the Apple Enterprise case I just opened. So I headed over to terminal and ran:

     

    tcpdump -ennqti en0 \( arp or icmp \)

     

    This way I could just capture arp and icmp (I had an active ping from another station already in progress).  Low and behold no dropped packets as soon as the packet capture started.  I Ctrl+C'd the packet capture as I thought it was a fluke, but as soon as I did packet loss.  Started it back up and no dropped packets.  If someone else could try to replicate this I would greatly appreciate it as I have been able to replicate it on 2 machines (10.9 & 10.9.3).

  • by nick1329,

    nick1329 nick1329 May 28, 2014 2:02 PM in response to jwsullivan
    Level 1 (0 points)
    May 28, 2014 2:02 PM in response to jwsullivan

    This is the exact same thing that happened to me and exactly what I told enterprise support. I went on ahead and sent the information onto support though and they are reviewing it.

     

    We have actually created a whole separate vlan (same setup as previous, I was told) with 1 apple server on that vlan and 1 on the other vlan, and the server on the (new)vlan by itself hasn't dropped a ping (that I've seen) in over 5 days.

     

    Our theory is that it's somehow related to traffic on the vlans and maybe a software/driver update on either the server or our switches will resolve the issue.

     

    I'm keeping the case open with Apple to see if they can find the issue.

  • by codythedog,

    codythedog codythedog May 28, 2014 2:12 PM in response to rossoneri91
    Level 1 (0 points)
    May 28, 2014 2:12 PM in response to rossoneri91

    Yes, I'm still seeing it, very randomly (home user on Time Warner Cable connection). I noticed twice when I started an active ping test to www.google.com, it seemed to "unstuck" it (that's a technical term ). Usually pinging does not, though.

     

    Had another interesting data point: I was uploading hundreds of files to a web-server, and was getting all kind of failed transfers because of it. I was on my MacBook connected to thunderbolt display, with a hard-wire ethernet connection. I turned on WiFi on the MacBook -- and as soon as it connected, the ARP problems instantly stopped -- the transfers started all working. Note that the WiFi connection was still secondary to the ethernet as the primary network connection so it really should not have changed anything. Another clue perhaps?

  • by XenoPhage,

    XenoPhage XenoPhage Jun 16, 2014 1:41 PM in response to ashvartsman
    Level 1 (0 points)
    Jun 16, 2014 1:41 PM in response to ashvartsman

    After MANY months, I received a follow up from Cisco.  They're sticking with the current implementation, citing RFCs, and blaming clients.  So this won't be fixed on the Cisco side of things.  There's an internal bug ID if anyone needs to reference it.  A snippet from the conversation follows :

     

    I was in touch with the development team to look into this and based on
    that here is what the information I received.
    
    
    - This issue has been raised by many engineers with BU for the behavior
    seen with Apple, Linux and even some wireless devices.
    
    - The development team has looked into this in depth and also referred to
    the RFCs for the specification on how this should work.
    
    - To track this issue, there was an internal bug CSCeg05955 was filed.
    
    - After considering all the specifications and design standards, it was
    determined that Cisco would not be changing its behavior for the unicast
    ARPs for GLBP IPs as it is as per the standard and the workaround has to
    be implemented on the client side.
    
    - The bug is internal, however I will provide the details from it so that
    you can have it. 
    
    <B>Symptom:</B>A client device may send unicast ARP requests to GLBP
    members, but doesn't get a response.  The result may be an intermittent -
    perhaps periodic - inability of that client to transmit IP traffic.
    
    <B>Conditions:</B>GLBP is being used in load balancing mode for clients'
    default gateway.
    
    <B>Workaround:</B>1. Disable GLBP load balancing.
    
    2. Install static ARP entries on the GLBP members, mapping from the GLBP
    address to their MAC addresses.
    
    <B>More Info:</B>Unicast ARP is increasingly utilized by various clients,
    especially wireless
    ones, to determine whether they have reconnected to the same subnet. See
    for
    example RFC 4436 for a use case.
    
    This bug will not be fixed by Cisco, as it is viewed as a client problem.
    Client devices should not invalidate their ARP entry before using the
    unicast ARP.
  • by William Kucharski,

    William Kucharski William Kucharski Jun 16, 2014 11:51 PM in response to XenoPhage
    Level 6 (15,118 points)
    Mac OS X
    Jun 16, 2014 11:51 PM in response to XenoPhage

    The most interesting part of Cisco''s response is:

     

    This issue has been raised by many engineers with BU for the behavior
    seen with Apple, Linux and even some wireless devices.
    

     

    which basically means "yeah, we know people are having issues, but we're right."

     

    It would be nice to know whether it's just RFC 4436 or other RFCs that make Cisco believe this.

     

    So, bottom line, it's apparently Cisco vs. the world on whether a host's ARP entry should be invalidated before a Unicast ARP to determine if it's still connected to the last network it was connected to.

  • by MacStadium,

    MacStadium MacStadium Aug 11, 2014 2:06 PM in response to ashvartsman
    Level 1 (0 points)
    Aug 11, 2014 2:06 PM in response to ashvartsman

    Hey just to let you know that we have been tracking this for about 9 months... a full write-up is here on our blog here:

     

    http://www.macstadium.com/blog/osx-10-9-mavericks-bugs/

     

    A couple days ago, one of our subscribers who happens to also be an Apple employee put us directly in touch with the CoreOS developers at Apple in Cupertino, CA.   This was really exciting news because we have been trying to get an escalation at Apple via the support and beta communities for a very, very long time with no traction.

     

    Apple has since validated the issue via a lab we setup for them here at Macstadium.com, and they are now looking at a driver stack problem being the main issue.   The biggest clue was the Apple USB <> Ethernet adapter not presenting the problem on the exact same host as the internal Gigabit Ethernet adapter - on the same network - at the same time.

     

    They are moving very quickly, so we are very optimistic for a patch for 10.9.x as well as the solution being rolled into the next 10.10 beta release.  We'll keep everybody posted in coming days.

first Previous Page 4 of 5 last Next