12-15-2014 10:45 AM
Over the last week or so one of our Edgerouters (firmware version 1.6.0) has displayed some odd behaviour. The router becomes non-responsive to all forms of access requests (ssh or https) on all interfaces. This is preceeded by snmp displaying a jump to 100% cpu utilization.
The first occurance was early on the 8th of December, we had a field technician in the area and he rebooted the router we lost access. During that event cpu jumped to 100% then stopped responding to the snmp requests.
The second time no one was around to preform a reboot so the router stayed in this state for a more extended period. During this event the device continued to respond to snmp requests, and had an active ssh session the I had left open from earlier in the week.
I had a chance to grab some diagnostics before it froze completely.
"ps aux" didn't see anything consuming that much CPU time, nor did top
Then I attempted to login via https. I could access the webpage, but could not login. At that point I attempted to reboot (sudo reboot) from the exsisting ssh session at which point the shell locked and became unresponsive, it ceased to respond to snmp and the router did not reboot, I can no longer load the https login page.
Weirdly the routing functions have continued unhindered during both occurances, the router passes traffic, routes between interfaces, answers DHCP requests, and responds to ICMP without any trouble.
It was upgraded to 1.6.0 quite a while ago and hasn't had any issues from then until now. We have 2 other Edgerouters deployed with similar configuations without any trouble.
I can add the config once I get someone out to the area to power cycle the router so I can get back in.
Has anyone seen anything like this? Or thoughts on what might be going on here?
12-16-2014 11:03 AM
If the CPU goes up to 100%, it would help if we know what process is using the CPU. Is that available or do the SSH sessions freeze before this happens for example?
12-16-2014 12:44 PM
The ps aux and top sceenshots were off the device before the ssh session locked up. I couldn't find a process consuming the CPU time snmp was indicating.
When the field tech eventually got out there to power cycle the router, it did not come back up. We removed it and brought it back to the office and tried to reset it, to no avail. I haven't had a chance to see I can get in via the console port yet, but at the moment it appears to be bricked.