RootNode changes so frequently that consensus doesn't seem to occur with topology discovery
Hello, I am running 4 APs and 9 Switches with mbentley's great docker image (latest version) on a Raspberry PI4B 4GB model with usb flash storage. Controller is not a bottleneck here as it is faster than the oc200 and oc300.
Switched are configured with different priorities (core switches are lowest and most reliable) using STP with Loopback detection disabled. Edge Switch is a bridge as next hop is a FreeBSD based firewall which the OS does not support bridging. This is to enable LACP between the Bridge switch in hopes of some load balancing / aggregation for separate switches downstream for the tagged trunks. Untagged trunks go to one interface and the rest of the interfaces are in LACP with the firewall. Currently, only a single path is connected until I've optimized. Sample logs:
1-01-2023 23:47:59.537 WARN [monitor-topology-pool-0] [] c.t.s.o.m.t.p.s.t.TopologyTask(): RootNode has changed, rediscovering topology, omadacId=f4bd606c8bece330b24823e5294a3b04, siteId=6541c6fdaf78c275fd1d13a2
11-01-2023 23:48:59.540 WARN [monitor-topology-pool-0] [] c.t.s.o.m.t.p.s.t.TopologyTask(): RootNode has changed, rediscovering topology, omadacId=f4bd606c8bece330b24823e5294a3b04, siteId=6541c6fdaf78c275fd1d13a2
Happy to share more details including diagrams if a ticket is to be created (assuming I am not chasing an unsupported topology).
- Copy Link
- Subscribe
- Bookmark
- Report Inappropriate Content
Hi @seanvan
Thanks for posting in our business forum.
Don't get your point. Don't see relations between the terms you said.
What's the issue with your network? I am saying the issue you experienced.
Would love to see a diagram to illustrate your issue.
This seems to be the log exported to the log system instead of the log system in controller.
What are the models you have?
Update: "RootNode has changed, rediscovering topology.."
I consulted with the senior engineer and this is a log indicating that your root node has been changed. You might not have the Omada router. That's why it occurs.
- Copy Link
- Report Inappropriate Content
@Clive_A Apologies for the delay.
Those logs are directly from the controller, they are the server.log file. There are a number of issues that are linked to this:
- The topology view in the GUI is almost always blank or when it does show, it is not complete and/or inaccurate in it's layout.
- The network is under very low load, but the CPU utilization is rather high on core switches which can cause the network (especially wifi) to be unusable.
- The logs are flooded with with this entry and sadly there is not much detail beyond this to confirm what this issue is (is there a way to increase verbosity of the logging? I do not see one in the GUI, but thankfully I have a software controller, so perhaps editing the start up script for the controller?)
I completely disagree with your network engineer based on the above and "rootnode" in the log. it sounds like a loop in the network / layer 2 consensus issues with loop prevention failing (i.e. Root Node is changing, rediscovering topology). Besides, that is a terrible business model to give performance and network convergence issues when we use a 3rd-party router? (it is layer 2, not layer 3...so 3rd party router doesn't matter ). I have done the following so far:
- Scaled down the network to as little as two switches + my friewall and ensured there is no redundant cables that could cause a loop. I have tried repalcing cables (all cables are relatively new already). The network performed a bit better which would occur as network convergence would occur faster, but the constant Root Node changing still occurs.
- I eliminated the untagged switchport to the firewall and instead just have a single LACP trunk to the firewall with my VLANs and a "dummy" VLAN as the Omada software forced you to have a untagged network for some rather odd reason. Eliminating the untagged traffic especially the default "LAN / VLAN 1" from the trunk (this is default when you create a network or profile for some reason) had quite a positive impact on perofmance, but did not resolve the constant topology change issue. FYI: per 802.1q standards, you are not supposed to mix untagged and tagged traffic in the same VLAN trunk.
- When scaling up the network, I have tried RSTP (no loopback), STP (no loopback), loopback and none of it resolves the topology change issue. STP and RSTP are both working as expected and I do see in the GUI that redundant links are getting blocked except for one scenario which is quite odd: If you have two seperate paths that are each an LACP/LAGG, STP/RSTP does not block and you end up with a loop (I knew it was impossible, but I was silently hoping that Omada was doing MSTP or MLAG to use both links at the same time). It does block when you have 1 link as an LACP/LAGG and the 2nd a single connection (either configured as a LACP/LAGG with it's other link not connected or as standalone).
My visio diagrams have a bit too much detail to share on a public forum, I will try to reduce it when I have some time. I did manage to get a screenshot as the GUI is actually listing all the switches and APs, but is not accurate in layout. Bridge01 is set with the lowest priority for RSTP, followed by dist01 which is occurect, but you can see it doesn't know how core01 and dist02 are connected even though they are also very close in priority and by number of hops (dist02 is directly connected to bridge01, but blocked by RSTP) and core01 is connected to both dist01 and dist02. There are more inaccuracies further along and it makes sense considering the issues going on. I would be suprised if the software was printing an accurate topology on it's own at this point.
Models are the following
:
- Copy Link
- Report Inappropriate Content
Hi @seanvan
Thanks for posting in our business forum.
seanvan wrote
@Clive_A Apologies for the delay.
Those logs are directly from the controller, they are the server.log file. There are a number of issues that are linked to this:
- The topology view in the GUI is almost always blank or when it does show, it is not complete and/or inaccurate in it's layout.
- The network is under very low load, but the CPU utilization is rather high on core switches which can cause the network (especially wifi) to be unusable.
- The logs are flooded with with this entry and sadly there is not much detail beyond this to confirm what this issue is (is there a way to increase verbosity of the logging? I do not see one in the GUI, but thankfully I have a software controller, so perhaps editing the start up script for the controller?)
I completely disagree with your network engineer based on the above and "rootnode" in the log. it sounds like a loop in the network / layer 2 consensus issues with loop prevention failing (i.e. Root Node is changing, rediscovering topology). Besides, that is a terrible business model to give performance and network convergence issues when we use a 3rd-party router? (it is layer 2, not layer 3...so 3rd party router doesn't matter ). I have done the following so far:
- Scaled down the network to as little as two switches + my friewall and ensured there is no redundant cables that could cause a loop. I have tried repalcing cables (all cables are relatively new already). The network performed a bit better which would occur as network convergence would occur faster, but the constant Root Node changing still occurs.
- I eliminated the untagged switchport to the firewall and instead just have a single LACP trunk to the firewall with my VLANs and a "dummy" VLAN as the Omada software forced you to have a untagged network for some rather odd reason. Eliminating the untagged traffic especially the default "LAN / VLAN 1" from the trunk (this is default when you create a network or profile for some reason) had quite a positive impact on perofmance, but did not resolve the constant topology change issue. FYI: per 802.1q standards, you are not supposed to mix untagged and tagged traffic in the same VLAN trunk.
- When scaling up the network, I have tried RSTP (no loopback), STP (no loopback), loopback and none of it resolves the topology change issue. STP and RSTP are both working as expected and I do see in the GUI that redundant links are getting blocked except for one scenario which is quite odd: If you have two seperate paths that are each an LACP/LAGG, STP/RSTP does not block and you end up with a loop (I knew it was impossible, but I was silently hoping that Omada was doing MSTP or MLAG to use both links at the same time). It does block when you have 1 link as an LACP/LAGG and the 2nd a single connection (either configured as a LACP/LAGG with it's other link not connected or as standalone).
My visio diagrams have a bit too much detail to share on a public forum, I will try to reduce it when I have some time. I did manage to get a screenshot as the GUI is actually listing all the switches and APs, but is not accurate in layout. Bridge01 is set with the lowest priority for RSTP, followed by dist01 which is occurect, but you can see it doesn't know how core01 and dist02 are connected even though they are also very close in priority and by number of hops (dist02 is directly connected to bridge01, but blocked by RSTP) and core01 is connected to both dist01 and dist02. There are more inaccuracies further along and it makes sense considering the issues going on. I would be suprised if the software was printing an accurate topology on it's own at this point.
Models are the following
:
Based on the senior engineer, I confirmed several points:
1. That log should come from the running log. Confirmed again.
2. That log should not affect anything about the STP/RSTP. It is used to generate the Map > Topology.
3. You should set a switch as the core switch which can help stabilize a root node.
- Copy Link
- Report Inappropriate Content
- Happy to share my server.log file for your engineer. The error message is continually logged every few mins.
- Thanks for confirming that RootNode is Omada specific. It's an industry term used with RSTP/STP, so easily confused (and further confusing as the error still occurred with STP disabled).
- I did this, but it does not help.
I actually did a lot more troubleshooting and it appears that there are a few issues at play.
- One of my switches is consistently at 90+ % CPU and even after performing a factory reset, the CPU still remains consistently used. I moved every single connected device to other switches and performed a factory reset again, which fixed this issue. Other switches did not see an increase in CPU load as I have been monitoring for a few days, must have been an anomaly.
- I tested my network with only a single switch and started adding additional switches. With just core01, dist01 and ap04, there is no error message about RootNode changing, the map in Omada GUI appears and the network performs fine. As soon as I add any other switch, the error message is logged with an additional message about the device in question (mac address) not being found in topology, the map will disappear and the network will slowly degrade in performance.
It appears as though there might be some corruption in my mongodb that the devices are not being added/removed properly? I have decided to rebuild a new controller from scratch and manually create everything (no migration) to ensure a clean db and see if that resolves the issue. Unless you have any other suggestions based on the 2 updates above?
Thanks!
- Copy Link
- Report Inappropriate Content
Information
Helpful: 0
Views: 541
Replies: 4
Voters 0
No one has voted for it yet.