When NSX Tunnels Disappear — Understanding and Fixing Hidden Service Failures

Home When NSX Tunnels Disappear — Understanding and Fixing Hidden Service Failures

When NSX Tunnels Disappear — Understanding and Fixing Hidden Service Failures

17th Nov 2025 vmware nsx TEP tunnels

I guess most people working with NSX are aware of the overlay tunnels, they need to be there to have networks connected to your workloads. After some host reboots to change power settings, we didn't see tunnels coming back online. At fist you might think there are no workloads migrated to the hosts, but unfortunately this was not the case! Hosts look connected, transport nodes report green, and yet there’s no sign of any tunnel sessions. For many environments, this is the first visible symptom of a deeper issue within the NSX control plane.

Tunnels form the foundation of NSX overlay networking. They enable East-West traffic between transport nodes, support distributed firewall rules, and ensure vMotion can move workloads across the environment seamlessly. When tunnels vanish, communication between virtual networks effectively stops — even if the underlying infrastructure still appears healthy.

Let’s take a closer look at what causes this in our case.

The Technical Cause

In certain NSX versions, a JDK bug (JDK-8330017) causes the internal thread pool (ForkJoinPool) to stop executing tasks. This happens when a counter called the Release Count (RC) overflows. Once that occurs, key NSX Manager services stop working — even though the UI may still show them as up.

This issue affects several critical components:

Controller Service – impacts network provisioning, firewall rule publishing, and vMotion

Upgrade Coordinator – causes upgrade tasks to fail or consume excessive memory

Corfu Service – disrupts state storage and data synchronization between managers

The problem develops slowly, often surfacing during configuration changes, upgrades, or when the system runs low on memory. By the time tunnels disappear, the affected services have already stopped processing new tasks.

How to Resolve It

Broadcom has addressed this bug in NSX 4.2.1.4, 4.2.2.0, and 9.0.1.0 and newer releases. The official guidance (see Broadcom KB 396719 ) outlines a rolling reboot procedure to recover affected services before upgrading to a fixed build.

Rolling reboot procedure:

Reboot the first NSX Manager.
Verify cluster health:

get cluster status

Wait until all services report up on all nodes.
Repeat for the next NSX Manager until all three have been rebooted.

This clears the stuck thread pools and restores normal operation temporarily.

When Tunnels Are Still Missing

In some cases, the NSX Manager services recover, but the transport node services remain stuck, leaving the tunnel list empty even after the reboot cycle.

To restore tunnel connectivity on affected hosts, restart the local NSX agents:

/etc/init.d/nsx-opsagent restart /etc/init.d/nsx-proxy restart

After a few minutes, the tunnels should reappear under Networking → Tunnels in the NSX UI or via the CLI:

get logical-tunnels

Recommended Actions

Upgrade NSX to one of the fixed versions as soon as possible, remember that the reboot procedure only mitigates the problem temporarily!
Restart transport node services if tunnels remain missing after manager recovery.
Watch for early warning signs, such as delayed rule publishing, failed vMotions, or controller task backlog.

Final Thoughts

When tunnels disappear, it’s rarely a cosmetic issue — it’s a sign that core NSX services have silently stopped working. The underlying JDK bug exposes how sensitive distributed network control can be to service execution failures.

Until you move to a fixed NSX release, the best approach is a disciplined reboot schedule, active monitoring, and quick service restarts when needed. That combination keeps your overlay stable — and ensures those tunnels stay visible where they belong.