Microsoft Failover Cluster node not sending out Gratuitous ARP requests after a failover

This was a particularly odd issue which I had never experienced before so I thought it’s worth blogging about it.

During a normal MS Failover Cluster failover operation, the node calming the cluster roles sends out a GARP request to notify the networking infrastructure of the MAC address change. The Layer 3 switch / router then updates the MAC address in the ARP table and packets are routed to the node which claimed the cluster roles.  Recently I found myself troubleshooting a MS Failover cluster deployment which wasn’t behaving quite in this manner.

Some background info:

  • For the sake of this blog post lets call the 2 nodes A and B.
  • The nodes are running Server 2016, SQL 2012 and Microsoft Failover Cluster services.
  • Each node has 2 NICs, one for the client and management network, and one for the heartbeat network.
  • The cluster consists of 3 Network resource; a cluster IP address and 2 SQL instance addresses which float between the 2 nodes depending on which one is active.
  • All 3 IP addresses are in the same VLAN.
  • Running continues ping to all 3 IP addresses during failover tests.

So when the active node was A and I failed over the cluster roles to node B, the failover process was completing successfully however the 3 cluster IP addresses would stop pinging and wouldn’t start pinging again until an hour or so… if I reverted back to node A instantly, the pinging would start again.

Looking at the ARP table on the Layer 3 switch I realized that the MAC address associated with the cluster IP addresses wasn’t changing to the MAC address of node B – which is what we would expect as a result of the failover operation.

Thinking this was a network issue , I asked the network team to make sure GARP requests were allowed on the relevant VLANs  – they confirmed this was already the case on all downstream devices.

Not being able to resolve the issue, I decided to carry out a packet capture on the node claiming the cluster roles.  Looking at the capture results I realized no GARP requests were being generated by the node – which was unusual!!!

Googling I came across the following blog post –

https://icookservers.blog/2016/07/19/windows-2012-r2-cluster-wont-send-gratuitous-arp-garp-packets-by-default/

It appears there is a registry entry in Windows which enables GARP requests to be sent out when a failover occurs. By default this entry doesn’t exist in Server 2012 R2. Thinking this might be the case with Server 2016 as well, I looked at the registry of node B. The registry entry was there but it was set to 0 – i.e. don’t send garp!!! The blog post suggested that this value is set to 3 – i.e. send garp three times, so I set the value to 3, then gave the node a reboot. Once the node was accessible again, I carried out another failover test – and voila!!! only experienced a single ping drop this time before all 3 cluster IP addresses were accessible again.

The registry entry that needs to be applied:

-HKLM\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters
REG_DWORD > ArpRetryCount
Value is between 0-3 (use value 3)

0 – dont send garp
1 – send garp once only
2 – send garp twice
3 – send garp three times (Default Value – actually not present on Windows 2012 R2)

NOTE – I’m not convinced that this registry fix is required for the every cluster deployment, as other cluster deployments which lack the particular registry entry don’t exhibit the same behavior.  Perhaps the network infrastructure has a role to play in creating this issue and registry is required to force the GARP requests – I really don’t know ??????????

https://wiki.wireshark.org/Gratuitous_ARP

https://support.microsoft.com/en-us/help/244331/mac-address-changes-for-virtual-server-during-a-failover-with-clusteri

4 thoughts on “Microsoft Failover Cluster node not sending out Gratuitous ARP requests after a failover

  1. Tapan

    Thanks Cengiz for sharing this. You saved me for three days struggling and surprise MS have no clue about this. Much Appreicate.

    Reply
  2. Ross

    Have you by chance got a packet capture of what you were seeing? I’ve got a similar issue where I have a guest file-server cluster on a Hyper-V cluster. My packet capture using port mirroring seems to show a GARP being sent, however it has the wrong op code which the switch vendor are claiming their switching won’t accept. Additionally I can see that the destination MAC address on it shows as 00:00:00:00:00:00 instead of FF:FF:FF:FF:FF:FF.

    However in Wireshark, it reports it as GARP when using the arp.isgratuitous == 1 filter.

    Address Resolution Protocol (request/gratuitous ARP)
    Hardware type: Ethernet (1)
    Protocol type: IPv4 (0x0800)
    Hardware size: 6
    Protocol size: 4
    Opcode: request (1) <- this should be 2 to indicate a reply
    [Is gratuitous: True]
    Sender MAC address: Microsof_03:34:07 (00:15:5d:03:34:07)
    Sender IP address: x.x.x.x (cluster IP)
    Target MAC address: 00:00:00_00:00:00 (00:00:00:00:00:00) <- this should be ff:ff:ff:ff:ff:ff
    Target IP address: x.x.x.x (cluster IP)

    The switch vendor have come back advising that the Opcode should be 2 indicating a response, which to be fair does seem correct. So I am a bit unsure whether this is a switch issue or something else cluster wise. I have tried the ArpRetry registry key and rebooted the nodes, and removed the switch embedded teaming and neither made any difference. The MAC address table on the switch is out of date which is the cause of the issue. Everything within the same layer 2 segment is completely fine though, updates instantly.

    Reply
    1. Sinan Korkmaz

      Hi Ross,
      In fact, the switch vendor seems not correct to me.
      The packet you describe is defined to be an “ARP announcement” by rfc3927.
      By the way, note that the all-zero address is NOT the desttination address of the packet. The Layer-2 destination address of the packet is still all-ones (ff:ff:ff:ff:ff:ff). This is the TARGET mac address, which is inside the ARP protocol data. As per the rfc, it can be very well a request, and target will be set to all zeros (nothing wrong here). Basically, the host is announcing the use of “Sender IP adress” is now resides on the “Sender MAC address” given in the ARP data.
      In any GARP situation, all hosts on the local network should update their ARP caches with the new address provided. Any layer-3 switch is also a host in those terms, so the switch SHOULD (should is the term rfc uses – not must by the way) update the ARP cache information accordingly.

      Hope this helps
      Sinan

      Reply
  3. jasiulaka

    As default in windows 2012r2 : Arp probe is sent three times but Arp announcement only once.
    When ArpRetryCount is set 0 to none ot them are sent.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.