High Control-Plane CPU Usage / SNMPd Software Loop LibreNMS Port Table Polling

High Control-Plane CPU Usage / SNMPd Software Loop LibreNMS Port Table Polling

High Control-Plane CPU Usage / SNMPd Software Loop LibreNMS Port Table Polling
High Control-Plane CPU Usage / SNMPd Software Loop LibreNMS Port Table Polling
Tuesday - last edited Tuesday
Tags: #SNMP
Model: SG2008P   TL-SX3206HPP  
Hardware Version: V1
Firmware Version: 1.20.23

Affected Hardware:

  • SG2008P v3.20 (Firmware: 3.20.24)

  • SX3206HPP v1.20 (Firmware:1.20.23)

Environment Context:

The network consists of multiple Omada devices (ER8411 v1.0, SX3206HPP v1.20, SG2008P v3.20, and multiple EAPs) monitored via a dedicated Network Management System (LibreNMS) polling over SNMP.

Issue Description:

When enabling SNMP at the Site level via the Omada Controller, the Control Plane CPU utilization on the JetStream switches suddenly spikes and locks into a high plateau (ranging from 45% to over 80% depending on the SNMP version used). This utilization behaves like a software serialization loop or thread backlog inside the switch management daemon (snmpd / bcm_mgmt).

Crucially, there is zero data plane performance loss or packet drops; the hardware switching ASIC behaves perfectly, but the control plane processor remains pinned indefinitely. Power cycling the switches provides transient relief, but the CPU resource pinning immediately recurs once polling frames strike the interface again.

Diagnostic Methodology & Scientific Isolation Tests:

To isolate whether this behavior was a network broadcast/multicast storm, standard crypto overhead, or a management plane bug, I executed the following methodical testing protocol:

Test 1: SNMPv3 Overhead Check

  • Configuration: SNMPv3 enabled with MD5/SHA Auth and DES/AES Encryption.

  • Result: CPU immediately spiked to a flat-red 50% plateau. While standard cryptographic overhead is expected on low-power MIPS/ARM management CPUs lacking hardware crypto acceleration, the behavior survived long after initialization.

Test 2: Fallback to SNMPv2c

  • Configuration: Migrated the entire SDN site configuration down to clean cleartext SNMPv2c to eliminate the cryptographic payload computation bottleneck.

  • Result: The issue worsened. The CPU shot up from the 50% baseline to a sustained 80%+ utilization. This paradoxically proved that removing the authentication choke point allowed the NMS to query tables much faster, overwhelming the internal management daemon queue.

Test 3: Isolating External Traffic vs. Internal Firmware Loops

  • Configuration: I completely disabled SNMP at the Omada Site level for 3 minutes.

  • Result: Switch CPU immediately dropped and idled at a perfect 3%.

  • Configuration: I then re-enabled SNMPv2c, but assigned a completely random, unmapped community string that no device on my network knew.

  • Result: The switch continued to idle perfectly low at 3–5% with SNMP active. This conclusively proved that the underlying switch firmware does not suffer from an unprovoked internal memory leak or looping condition when idling; it only triggers when actively being queried.

Test 4: Component-by-Component Module Granular Isolation Using the blind, random community string, I configured safety constraints in the NMS (Max Repeaters = 1, Max OIDs = 5) to slow request grouping down to a slow "sip" through a straw. I then selectively allowed LibreNMS to poll individual system tables:

  • Active Tables: processors, mempools, sensors, system.

  • Result: The switch processed these cleanly, idling comfortably between 7% and 11% CPU.

  • Trigger Table: The moment the ports module was enabled (which requests standard ifTable and 64-bit high-capacity ifXTable OIDs like ifHCInOctets/ifHCOutOctets), the management plane was instantly completely flooded and pinned high.  Even with Max Repeaters 1 and Max OIDS to 5

Technical Conclusion:

The systemic bottleneck lies squarely within how the JetStream firmware architecture handles the serialization of the Interface Counter Tables (ifTable / ifXTable) via standard SNMP daemons.

When an external manager requests standard counter tables, the underlying JetStream operating system kernel drops into an inefficient, high-overhead synchronous loop to read the raw switching hardware registers and translate them into SNMP UDP frames. Because this process lacks proper queue throttling or non-blocking buffer separation inside the firmware, even serialized single-request queries (Max Repeaters = 1) cause the switch's internal management process (snmpd / bcm_mgmt) to hit a synchronization bottleneck and spin out—pinning both entry-level hardware like the SG2008P and higher-end aggregation gear like the SX3206HPP.

Request to TP-Link Engineering Team:

Could the development team investigate how ifTable/ifXTable values are cached or processed by the management plane daemon in upcoming firmware cycles? A mechanism to protect the control plane from processing-intensive bulk counter walks—or optimizing how the kernel handles standard GetBulk / 64-bit counter serialization—is drastically needed to make Omada switches truly compatible with standard enterprise NMS environments.

  1      
1
#1
Options