11-28-2013 11:55 AM - edited 11-28-2013 10:07 PM
Got our first ERL this week - nice little box, decent performance, nice CLI, ok GUI (had some browser compat issues).
Yesterday, swapped it in to replace an old CIsco 3660 that happened to be using just three 10/100 interfaces. (Perfect match!)
It ran great until about 10AM this morning, and then kernel paniced and rebooted....
It's running 1.3.0, is running basic firewall rules, is doing SNAT, has a VLAN configured, and has IPv4 forwarding/VLAN offload loaded and enabled (show ubnt offload shows both enabled).
There was a serial console attached when it crashed, so panic/reboot log is attached, as is a (sanitized) copy of the config.
11-28-2013 10:06 PM
11-29-2013 09:04 AM
If I remember correctly there was a previous report of a similar crash (skb_under_panic), but it only happened once and was not reproducible in that case. Yes please do let us know if offload may be the issue. Another thing is you could try is to test the new alpha release (currently v1.4.0alpha2), which has a new kernel (3.4) , and see if the behavior is different. Thanks for reporting the issue!
11-29-2013 01:30 PM - edited 11-29-2013 01:31 PM
Well, with offload disabled, it's been up for over 14 hours so far... vs 3 crashes in 30 hours before.
I did put in a support request last night about this, requesting access to 1.4.0 to test this - haven't heard back yet.
Looking further into it, I suspect the problem is caused by 18.104.22.168 not including the patch (related to ARP resolution) listed here: http://lkml.org/lkml/2012/10/6/10 (it is included in 3.4, so 1.4.0 will probably take care of this). Not sure how that interacts with the Cavium offload module.
Perhaps related to why we can repeat this more frequently - we have a moderately large number of devices running on networks directly-connected to the ERL. The default settings for /proc/sys/net/ipv4/neigh/default/gc_thresh[1,2,3] are 128/512/1024 - most small networks will never exceed gc_thresh1, so their ARP cache will never be garbage-collected, and this code path will rarely (if ever) be touched.
We are always running above gc_thresh1, and frequently over gc_thresh2 - so the ARP cache is being frequently pruned and re-populated, providing a much greater chance of hitting the retry path in neigh_resolve_output()/neigh_connected_output() that appears to cause the skb_under_panic crash.