Reply
Member
Posts: 238
Registered: ‎04-09-2013
Kudos: 89
Solutions: 6
Accepted Solution

[ERL3 - Offloading] Hardware Offloading "breaking" after >30days uptime

[ Edited ]

Hello,

 

TL;DR: Has anyone else experienced the same issue? What data to retrieve and bring to you UBNT guys when the situation occurs next time?

 

Context:

hardware: ERL 3

version: v1.10.6

setup: zone-based firewall, vlans, hw offloading enabled, no dpi, no netflow

 

It's been years my trusteed ERL3 has been running great (well, like mosts I had to replace the USB flash storage and the PSU at some point). Updates after updates.

BUT (like in any good story), it's the second time that, after more than one month of uptime, the network gets sloppy, bandwith gets reduced aaaand, yeah, you've seen the trailer: I've all the symptoms showing a disabled hardware offload.

I confirmed it when logging into the ERL via SSH, and when testing bandwith:

  • SSH gets mostly unresponsive while the ERL is busy routing
  • top shows CPU cores are hogged by "si" (software interrupts), which typically happens when you have offload disabled and the ERL bandwith is CPU limited.

So I rebooted the ERL and hardware offloading was back on:

  • SSH responsive even under load
  • top showing the CPU isn't sweating when testing bandwith
  • Network bandwith not limited by ERL

 

So, my question is: obviously, I'll try to catch the console logs (and post my whole sanitized config) when it happens again, but is there any additionnal data I could gather that would be helpful?

 

Cheers,

see you in one month Man Wink

 

--

edit: added config file to post


Accepted Solutions
Ubiquiti Employee
Posts: 1,228
Registered: ‎07-20-2015
Kudos: 1444
Solutions: 81

Re: [ERL3 - Offloading] Hardware Offloading "breaking" after >30days uptime

[ Edited ]

> Hardware Offloading "breaking" after >30days uptime

After reviewing source code I finally was able to find root cause of this issue. This bug was introduced in v1.10.2 on all Cavium-based routers and it causes offloading to forward packets via slowpath after ~30 days of uptime (268.435.456 msec to be precise).

 

We shall provide fix in v2.0.1 and v1.10.9 firmwares.

 

Update: actually this bug was present even in prev-1.10.2 firmwares, but would take 25 months of uptime to trigger it on ER-Lite (or 12 months on ER8, or 5 months on ER-Infinity).

View solution in original post

Ubiquiti Employee
Posts: 1,228
Registered: ‎07-20-2015
Kudos: 1444
Solutions: 81

Re: [ERL3 - Offloading] Hardware Offloading "breaking" after >30days uptime

[ Edited ]

> I'd be curious to hear more details of what caused this if possible.

  1. Offloading has 28-bit wide time counter that will wrap around when reaching value 268435455 and it will stop processing new flows after this event (bug!)
  2. Before v1.10.2 time counter was growing differently on each CPU. For instance on ER-Lite it used to grow 4 units/sec which would take 25 months for it to overlap
  3. In v1.10.2 we syncronized time counter with "linux jiffies"  to make constant growth speed on all CPUs, since then time counter grows 100 units/sec which means that it will overlap in 1 month on all ER models. 

View solution in original post


All Replies
Senior Member
Posts: 3,234
Registered: ‎08-06-2015
Kudos: 1383
Solutions: 186

Re: [ERL3 - Offloading] Hardware Offloading "breaking" after >30days uptime

When this happens, what does 'show ubnt offload' show?  That shows the actual running state.

 

Shortly after booting, run 'show ubnt offload statistics' to ensure that stats collection is enabled.  (Displaying the stats also ensures the collection is enabled)

 

Then when the problem recurs run that again a few times, with some time between each iteration.  The time depends on activity - if there are many new connections through your router are being created often then a short timeframe will suffice (ten or fifteen minutes?) but otherwise a longer period may be warranted.  All of the metrics could be relevant but in part you might look for a steadily increasing count for 'ipv4_create_flow_not_found_replaced_non_expired' and/or 'ipv6_create_flow_not_found_replaced_non_expired'.

 

It is possible that offloading remains enabled but that traffic simply isn't actually being offloaded.  There is always some traffic (such as new connections) that is handled by the kernel but it looks like something may be happening to cause most/all traffic to be handled by the kernel.

 

When this occurs what does 'sudo conntrack -C' show? Similar for 'sudo conntrack -L'  Do you have any conntrack tunables in your configuration?

 

 

Member
Posts: 238
Registered: ‎04-09-2013
Kudos: 89
Solutions: 6

Re: [ERL3 - Offloading] Hardware Offloading "breaking" after >30days uptime

[ Edited ]

Hello waterside,

 


@waterside wrote:

When this happens, what does 'show ubnt offload' show?  That shows the actual running state.

 


That's the first thing I checked, but this showed forwarding & vlan hardware offload were enabled as they should be.

 


@waterside wrote:

It is possible that offloading remains enabled but that traffic simply isn't actually being offloaded.  There is always some traffic (such as new connections) that is handled by the kernel but it looks like something may be happening to cause most/all traffic to be handled by the kernel.


That's what I'm thinking too, and if that's the case it's really tricky to put a finger on it. The statistics is a good idea to prove that, I'll collect some.


@waterside wrote:
Do you have any conntrack tunables in your configuration?

I had some, but I removed them a couple of mounths ago (due to my distant readings of the changelogs mentioning having now tunings per ER model IIRC). I think I now have the default values:

OLD:

  conntrack {
        expect-table-size 4096
        hash-size 4096
        modules {
            gre {
                disable
            }
            h323 {
                disable
            }
            pptp {
                disable
            }
            sip {
                disable
            }
        }
        table-size 32768
        tcp {
            half-open-connections 512
            loose disable
            max-retrans 3
        }

NOW:

    conntrack {
        expect-table-size 2048
        hash-size 32768
        modules {
            gre {
                disable
            }
            h323 {
                disable
            }
            pptp {
                disable
            }
            sip {
                disable
            }
        }
        table-size 262144
        tcp {
            half-open-connections 512
            loose disable
            max-retrans 3
        }
    }


 

 

Thank you for all the other suggestions, I'll have a look at them.

 

--

edit: added config file to 1rst post

Emerging Member
Posts: 62
Registered: ‎04-28-2014
Kudos: 26
Solutions: 1

Re: [ERL3 - Offloading] Hardware Offloading "breaking" after >30days uptime

Can't add any logs as I rebooted to apply 1.0.7, but I also dropped offloading on 1.0.6 after a few weeks.  I only noticed when applying 1.0.7 by high CPU and low RAM usage (I was running a ridiculous offload table size).

New Member
Posts: 3
Registered: ‎07-05-2015
Kudos: 1

Re: [ERL3 - Offloading] Hardware Offloading "breaking" after >30days uptime

This is also an issue I've been having at least two other times apart from today, I just rebooted again recently before looking at if others were having the same issues. Next time it occurs I can post logs. I do recall last time the offload broke the 'show ubnt offload' was showing everything as expected.

Member
Posts: 238
Registered: ‎04-09-2013
Kudos: 89
Solutions: 6

Re: [ERL3 - Offloading] Hardware Offloading "breaking" after >30days uptime

[ Edited ]

 

Just happened again, nothing on the physical console of the ERL3, and nothing related in the logs either.

 

# uptime
 21:47:10 up 41 days, 57 min,  1 user,  load average: 0.05, 0.08, 0.20

 

$ show ubnt offload

IP offload module   : loaded
IPv4
  forwarding: enabled
  vlan      : enabled
  pppoe     : disabled
  gre       : disabled
IPv6
  forwarding: disabled
  vlan      : disabled
  pppoe     : disabled

IPSec offload module: loaded

Traffic Analysis    :
  export    : disabled
  dpi       : disabled
    version       : 1.422

 

 

 

 Statistics
========================

RX packets:                377105338    bytes:          291027437770
TX packets:                291516854    bytes:          248383058912
Bypass packets:             85588484    bytes:           47076842134
Bad L4 checksum:                2756    bytes:                137428

Protocol        RX packets      RX bytes                TX packets      TX bytes

ipv4            221066935      260001877450           331354221      289868131796
ipv6            0                 0                   0                 0
pppoe           0                 0                   0                 0
vlan            156038403       31025560320            45748361        5591631822

 Forwarding cache size (IPv4)
=============================

table_size (buckets)                  8192
table size (bytes)                    1048576
flows_max (bytes)                     4915200

 Flow cache table size (IPv6)
=============================

table_size (buckets)                  8192
table size (bytes)                    1048576
flows_max (bytes)                     2883584

 Flow timers
=============================

cycles                                1772097149978909
clock_rate                            500000000
HZ                                    100
timer_ticks                           85960018
new_flow_interval (timer_ticks)       1200
old_flow_interval (timer_ticks)       400

 Low-level IPv4 flow dynamics
=============================

ipv4_flow_found                       302433400
    ipv4_flow_found_expired           1805844
    ipv4_flow_found_old_random_bypass 167342
    ipv4_flow_found_action_bypass     0

ipv4_flow_not_found                   74669182

 IPv4 flow creation dynamics
=============================

ipv4_create_flow_found                            1885949
ipv4_create_flow_found_replaced                   1885463
ipv4_create_flow_not_found                        72367248
ipv4_create_flow_not_found_replaced_expired       478704
ipv4_create_flow_not_found_replaced_non_expired   11787

 Low-level IPv6 flow dynamics
=============================

ipv6_flow_found                       0
    ipv6_flow_found_expired           0
    ipv6_flow_found_old_random_bypass 0
    ipv6_flow_found_action_bypass     0

ipv6_flow_not_found                   0

 IPv6 flow creation dynamics
=============================

ipv6_create_flow_found                            0
ipv6_create_flow_found_replaced                   0
ipv6_create_flow_not_found                        0
ipv6_create_flow_not_found_replaced_expired       0
ipv6_create_flow_not_found_replaced_non_expired   0

 Flow cache flushes
=============================

ipv4_flushes                          4
ipv6_flushes                          0
$ sudo conntrack -C
78

 

So little activity currently, and according to @waterside indicated metrics, ipv4_create_flow_not_found* are way too high.

I ran statistics again a couple of speedtest runs later:

 IPv4 flow creation dynamics
=============================

ipv4_create_flow_found                            1886707
ipv4_create_flow_found_replaced                   1886219
ipv4_create_flow_not_found                        73845515
ipv4_create_flow_not_found_replaced_expired       478896
ipv4_create_flow_not_found_replaced_non_expired   11788
# show system conntrack
 expect-table-size 2048
 hash-size 32768
 modules {
     gre {
         disable
     }
     h323 {
         disable
     }
     pptp {
         disable
     }
     sip {
         disable
     }
 }
 table-size 262144
 tcp {
     half-open-connections 512
     loose disable
     max-retrans 3
 }

Not sure why statistics report a table-size of 8192 when the conntrack configuration (intedend to be at default value) is way higher. [edit: conntrack table-size != ipv4 offload table-size]

 

 

Sooooo, any suggestion? Icon Idea

 

 

 

On a side note, I link this thead to another one where people seems to have a possible similar issue (but way sooner than 30 days of uptime).

Veteran Member
Posts: 7,602
Registered: ‎03-24-2016
Kudos: 1977
Solutions: 871

Re: [ERL3 - Offloading] Hardware Offloading "breaking" after >30days uptime

Maybe table size setting requires a reboot, or it is set to a value higher than this platform can handle.

Does setting it to 16k or 32k increase the value shown in offload statistics?

Member
Posts: 238
Registered: ‎04-09-2013
Kudos: 89
Solutions: 6

Re: [ERL3 - Offloading] Hardware Offloading "breaking" after >30days uptime

[ Edited ]

@16again wrote:

Maybe table size setting requires a reboot, or it is set to a value higher than this platform can handle.

Does setting it to 16k or 32k increase the value shown in offload statistics?



Got the answere here (hidden by a spoiler tag): conntrack table-size != ipv4 offload table-size

Emerging Member
Posts: 62
Registered: ‎04-28-2014
Kudos: 26
Solutions: 1

Re: [ERL3 - Offloading] Hardware Offloading "breaking" after >30days uptime

[ Edited ]

Mine's fallen off again after about 30 days.  CPU up, and vlan throughput is shown correctly (indicating offloading is not working..sadly).

 

Any info I should collect before rebooting?

 

edit: I have 2 ER Lite3 and both appear to have fallen off offload.  Only one has VLANs; the other just has high CPU and bypass packets.

 

I have not yet tried changing any of the offload parameters, as this seems to reset the offload engine???

 

edit2:  I have increased the size of the offload table; the first time it dropped off the ipv4 table was set to 131072.  I thought this ridiculous size may be an issue, and dropped to 32768.  conntrack -C has never shown more than 12k tracked flows under my highest loads; increasing the offload table was just a "because I can".

Member
Posts: 238
Registered: ‎04-09-2013
Kudos: 89
Solutions: 6

Re: [ERL3 - Offloading] Hardware Offloading "breaking" after >30days uptime


@gritech wrote:

Mine's fallen off again after about 30 days.  CPU up, and vlan throughput is shown correctly (indicating offloading is not working..sadly).

 

Any info I should collect before rebooting?

 

edit: I have 2 ER Lite3 and both appear to have fallen off offload.  Only one has VLANs; the other just has high CPU and bypass packets.

 

I have not yet tried changing any of the offload parameters, as this seems to reset the offload engine???


Erf. Nice to have confirmation that it happens on rather simple setups too though.

Try to click on the "escalate" button in the first post of this thread, in order to get more attention of the UBNT tech guys on this matter.

 

On the "reset the offload engine" side: I think I tried the "disable/commit/enable/commit" steps on the ipv4 forward+vlans features the first time it occured, without any luck. Next time I'll try to mess with additionnal offload parameters to provoke a "reset", but I'm not sure they are applied at run time (if they are kernel modules parameters, they may only be applied at boot when the modules are loaded).

Senior Member
Posts: 3,234
Registered: ‎08-06-2015
Kudos: 1383
Solutions: 186

Re: [ERL3 - Offloading] Hardware Offloading "breaking" after >30days uptime


@gritech wrote:

 

edit2:  I have increased the size of the offload table; the first time it dropped off the ipv4 table was set to 131072.  I thought this ridiculous size may be an issue, and dropped to 32768.  conntrack -C has never shown more than 12k tracked flows under my highest loads; increasing the offload table was just a "because I can".


I'd wonder about the workload - that just might be the nature of the traffic in your environment where a very large number of short-lived sessions are constantly being created.  That would impose a different type of load on the router than a large number of persistent sessions.  That seems similar to perhaps the case with @elgo too, and might help narrow down a possible cause for the symptoms.

 

Yes, the offloading and the firewall (conntrack) have their own independent notions of flows but observing the statistics on both can be helpful.

 

Emerging Member
Posts: 62
Registered: ‎04-28-2014
Kudos: 26
Solutions: 1

Re: [ERL3 - Offloading] Hardware Offloading "breaking" after >30days uptime

The VLAN throughput showing up in the dashboard correctly is difinitive to me--It only happens when offload is not active.

Emerging Member
Posts: 62
Registered: ‎04-28-2014
Kudos: 26
Solutions: 1

Re: [ERL3 - Offloading] Hardware Offloading "breaking" after >30days uptime

[ Edited ]

Resetting? the offload engine by changing the table size (router A, deleted entry 32768 to return to default) and flow lifetime (router B, 15 seconds back to blank, default) instantly fixed the issue.  I did change both settings on both routers, but did them in opposite order.

 

CPU from >60% down to <20%, VLAN traffic throughput not showing on dashboard after change.

 

This was with ~200 Mbps traffic from 150 users.

 

Will leave at defaults and see if this holds, but it takes a month or so.

 

edit1:

 

Looking at the offload stats, I see 2 table flushes (1 from each parameter change, commit?), but not a complete reset.  Speculation:  When the offload module was modified to allow more flows, the mechanism that removes old entries is not working correctly and 'orphaned' entries build up over time.

Member
Posts: 724
Registered: ‎09-13-2018
Kudos: 137
Solutions: 48

Re: [ERL3 - Offloading] Hardware Offloading "breaking" after >30days uptime


@gritech wrote:

Looking at the offload stats, I see 2 table flushes (1 from each parameter change, commit?), but not a complete reset.  Speculation:  When the offload module was modified to allow more flows, the mechanism that removes old entries is not working correctly and 'orphaned' entries build up over time.


 

Interesting observation and hypothesis.  

 

Did free memory change?  Next time you repeat this "reset", you might want to look at free mem before and after.

Member
Posts: 238
Registered: ‎04-09-2013
Kudos: 89
Solutions: 6

Re: [ERL3 - Offloading] Hardware Offloading "breaking" after >30days uptime

[ Edited ]

+1, very valuable tests indeed.

Next step would be to get the UBNT gentlemen to reproduce it in their lab: @UBNT-afomins@UBNT-sandisn can you have a look on this, please?

As for my workload, I have very low traffic and very few simultaneous flows for a couple of months now (my I2P router is down). That would be why I have to wait for nearly a month of uptime when other users see the issue after a couple of days (if the issue is the flow aging/expiration mechanism that is borked). I guess I can try to lower the offload table size to artificially reproduce it faster and do the same tests as @gritech.

Emerging Member
Posts: 62
Registered: ‎04-28-2014
Kudos: 26
Solutions: 1

Re: [ERL3 - Offloading] Hardware Offloading "breaking" after >30days uptime

 

BuckeyeNet: Free memory increased, but I also shrunk the table back to default.

 

With the large tables, memory use increases immediately somewhat, then grows to a limit (64% ram used per dashboard with table size 262144).  When I had this table size, I changed flow lifetime and the mem usage immediately dropped, then grew again.

 

The release notes for the table size mention a static memory usage based on the table size -- table size (bytes) ? (the immediate increase), but the offload stats also show flows_max (bytes) that seems to grow to a limit over time as the table is used.  I have not watched how long it takes to grow, but it grew back by the next time I logged in a couple of days later.

 

Just to clarify after the above, offloading did not break as the memory usage grew and hit it's limit.  The two times I have observed it were after about a month of uptime.

 

Occurance 1:  Noticed during upgrade, see previous post in this thread.

 

Occurance 2:  I saw this thread and had been logging in every couple of days to check.  Observed at 35 days uptime, but I hadn't checked in a few days to be sure when it happened.

 

Possibly related to an issue jms33 is seeing with 2.0.0 b1?  I know that's a new kernel & module, but if it's due to offload module modifications, it could be related.

Member
Posts: 238
Registered: ‎04-09-2013
Kudos: 89
Solutions: 6

Re: [ERL3 - Offloading] Hardware Offloading "breaking" after >30days uptime

[ Edited ]

@gritech wrote:

Possibly related to an issue jms33 is seeing with 2.0.0 b1?  I know that's a new kernel & module, but if it's due to offload module modifications, it could be related.


Linking his issue for reference, @jms33's posts in beta forum: (1) and (2)

Ubiquiti Employee
Posts: 545
Registered: ‎01-06-2017
Kudos: 192
Solutions: 20

Re: [ERL3 - Offloading] Hardware Offloading "breaking" after >30days uptime

@elgo

 

We are aware of it and are looking into it. Thanks!

Emerging Member
Posts: 46
Registered: ‎08-27-2016
Kudos: 11
Solutions: 1

Re: [ERL3 - Offloading] Hardware Offloading "breaking" after >30days uptime

I just wanted to chime in and confirm that this happens for me too. It seems to be an issue in one of the recent versions.

 

I did collect some logs via the "download support file" feature if any of those are of any use. Had I seen this thread before I would've checked if there were anything in particular you guys wanted me to collect. :/

New Member
Posts: 1
Registered: ‎12-02-2018

Re: [ERL3 - Offloading] Hardware Offloading "breaking" after >30days uptime

This has been happening for me as well the past few months. Usually it's exactly at 30 days of uptime (today was 1 month 11 hours). On a 1 gig up/down connection I notice when downloads are at about 25MB/sec instead of ~100MB/sec, and I login to the router and it shows 100% CPU usage. I just increased ipv4 table size to 16k and right away downloads were back up to normal speed. Rebooting also solves the issue.

Reply