06-13-2013 11:18 AM
I'm just wondering if there is a watchdog feature on the router? It didn't jump out at me in the configuration but I could have missed it.
Asking because I was playing around with my EdgeRouter Lite and something happened and it just died. Had to do a power reset to bring it back up. Only issue was that I wasn't there and had to have my friend go to my house to power cycle it.
I was hoping that if it crashed like that it could detect it and force a reboot.
06-13-2013 12:39 PM
The CPU does have a hardware watchdog that works with the kernel, so I guess the question is how exactly the router "died"? Did it stop responding on the serial console?
Not sure about the serial console. But it no longer responded to pings. Now I was trying to setup L2TP/IPsec at the time. So not sure if I did some crazy config changes that killed it. It didn't stop responding immediately after I commited. Seemed to be a minute or two later. I guess that will teach me to use the commit-confirm command next time.
06-13-2013 01:01 PM
Yeah the hardware watchdog is only activated if the kernel dies and does not "poke" the watchdog for some time. If it's something like connectivity issue then like you said "commit-confirm" or some other solution may work, e.g., a simple script that restarts the router when ping stops working (though that may be dangerous of course).
06-21-2013 06:39 PM - edited 06-21-2013 06:40 PM
I've written a network watchdog script for our routers since we had some issues in the past where a reboot would fix it (back in the old Vyatta days).
It's very basic but does what we need it to, we have a cronjob set to run every 5 minutes and have it setup to prevent reboot loops.
#!/bin/bash rchk=`cat /rebootchk` if [ $rchk = 0 ] ; then sudo /bin/ping -c2 <EXTERNAL_IP> > /dev/null 2>&1 if [ $? -ne 0 ] ; then sudo /bin/ping -c2 <IP_FROM_WAN> > /dev/null 2>&1 if [ $? -ne 0 ] ; then sudo /bin/ping -c2 <IP_FROM_LAN> > /dev/null 2>&1 if [ $? -ne 0 ] ; then echo "1" > /rebootchk echo "All pings failed @ `date`" >> /var/log/rchk.log sudo /sbin/reboot else echo "EXTERNAL and WAN didn't ping @ `date`" >> /var/log/rchk.log fi else echo "EXTERNAL didn't ping @ `date`" >> /var/log/rchk.log fi else echo "Good @ `date`" >> /var/log/rchk.log fi else echo "No action" >> /var/log/rchk.log fi
We replaced <EXTERNAL_IP> with google.com, <IP_FROM_WAN> is out datacenter's gateway IP, and <IP_FROM_LAN> is our backup router's IP. As long as it can ping any of those 3 it will not reboot, if it does reboot we have the following line in /etc/rc.local so it executes on boot:
echo 1 > /rebootchk
Here are the cronjobs we run:
*/5 * * * * /root/rchk.sh */30 * * * * echo 0 > /rebootchk
I forget if we needed to install anything to get this to work but if you get errors it'll probably tell you what's missing.
Gotta love linux!
06-22-2013 02:50 PM - edited 06-22-2013 03:04 PM
I would use logger instead of redirection. Like "logger NO PING". This way messages end up in the same place to any other logs, especially useful if you log to remote location too.
Also I usually increase the ping wait time to like -W90 to prevent reboot on temporary glitches. Or even ping several external addresses to make sure we didn't just lose connectivity to specific subnet and something really wrong is going on.
06-28-2013 09:47 PM
Actually it's not built as a module (CONFIG_CAVIUM_OCTEON_WATCHDOG=y). By default the kernel pokes the watchdog using interrupts (see /proc/interrupts), and the device file (which we don't create by default) can be used to poke the watchdog from userspace among other things.
06-28-2013 10:57 PM
The watchdog device is generic (misc device with minor 130), not specific to octeon_wdt. You might also want to check out the "watchdog" package in Debian, which already provides similar functionality.
06-28-2013 11:12 PM
Good to know. Just found out watchdogd even creates it if it doesn't yet exist, nice.
Does the kernel reset the timer undonditionally though? I.e. will watchdogd be able keep the timer from being reset by the kernel if userspace tests fail?
06-29-2013 07:33 AM
Can everything I need be found in any watchdog driver such as octeon_wdt, or it's split into hardware specific part and generic watchdog infrastructure?
That sort of depends on what "everything you need" is? The normal operation is just file open/write/close, but if you need ioctl then the support will vary of course.
06-29-2013 09:31 AM
Oh, I meant just is the watchdog driver self-contained or the octeon_wdt contains just functions needed eo e.g. reset the hardware timer on specific platform and such while e.g. /dev/watchdog operations are in some common file. I've read it and found out that it's self-contained, thanks.
I started working on a watchdog daemon CLI wrapper, by the way. Here's the first draft, not yet functional: https://github.com/SO3Group/vyatta-watchdog I hope it will work as expected soon.
06-30-2013 09:16 AM
So I could make the watchdog package functional.
You can download the package from http://baturin.org/files/vyatta/packages/vyatta-watchdog_1.3_all.deb or review the source at the link from post above. Commit and release history is a total mess—sorry for that, I should have gotten more sleep before doing it.
I added some documentation to the README.md so you can see it right at the github link.
I verified it basically works, but I didn't test it thoroughly, so backup your config first and beware of bugs.