Reply
Highlighted
Member
Posts: 157
Registered: ‎11-28-2013
Kudos: 121
Solutions: 7

Crash/reboot - skb_under_panic

[ Edited ]

Got our first ERL this week - nice little box, decent performance, nice CLI, ok GUI (had some browser compat issues).

Yesterday, swapped it in to replace an old CIsco 3660 that happened to be using just three 10/100 interfaces. (Perfect match!)

It ran great until about 10AM this morning, and then kernel paniced and rebooted....

Any ideas?

It's running 1.3.0, is running basic firewall rules, is doing SNAT, has a VLAN configured, and has IPv4 forwarding/VLAN offload loaded and enabled (show ubnt offload shows both enabled).

There was a serial console attached when it crashed, so panic/reboot log is attached, as is a (sanitized) copy of the config.

Member
Posts: 157
Registered: ‎11-28-2013
Kudos: 121
Solutions: 7

Re: Crash/reboot - skb_under_panic

...and two more times today, with almost-identical crashes (attached).

I've disabled offloading completely now, as that seems like the most probable culprit - we'll see how tomorrow goes.

Previous Employee
Posts: 13,551
Registered: ‎06-10-2011
Kudos: 5429
Solutions: 1656
Contributions: 2

Re: Crash/reboot - skb_under_panic

If I remember correctly there was a previous report of a similar crash (skb_under_panic), but it only happened once and was not reproducible in that case. Yes please do let us know if offload may be the issue. Another thing is you could try is to test the new alpha release (currently v1.4.0alpha2), which has a new kernel (3.4) , and see if the behavior is different. Thanks for reporting the issue!

Member
Posts: 157
Registered: ‎11-28-2013
Kudos: 121
Solutions: 7

Re: Crash/reboot - skb_under_panic

[ Edited ]

Well, with offload disabled, it's been up for over 14 hours so far...  vs 3 crashes in 30 hours before.

I did put in a support request last night about this, requesting access to 1.4.0 to test this - haven't heard back yet.

Looking further into it, I suspect the problem is caused by 2.6.32.13 not including the patch (related to ARP resolution) listed here: http://lkml.org/lkml/2012/10/6/10 (it is included in 3.4, so 1.4.0 will probably take care of this).  Not sure how that interacts with the Cavium offload module.

Perhaps related to why we can repeat this more frequently - we have a moderately large number of devices running on networks directly-connected to the ERL.  The default settings for /proc/sys/net/ipv4/neigh/default/gc_thresh[1,2,3]  are 128/512/1024 - most small networks will never exceed gc_thresh1, so their ARP cache will never be garbage-collected, and this code path will rarely (if ever) be touched.

We are always running above gc_thresh1, and frequently over gc_thresh2 - so the ARP cache is being frequently pruned and re-populated, providing a much greater chance of hitting the retry path in neigh_resolve_output()/neigh_connected_output() that appears to cause the skb_under_panic crash.

 

Previous Employee
Posts: 13,551
Registered: ‎06-10-2011
Kudos: 5429
Solutions: 1656
Contributions: 2

Re: Crash/reboot - skb_under_panic

Yeah you should have access to the beta forum now, so maybe you could give the current 1.4.0 alpha a try. Thanks for providing the link and looking into the details!

Reply