Reply
Highlighted
Member
Posts: 118
Registered: ‎05-07-2014
Kudos: 40
Solutions: 6

NMI watchdog

Around 19:27 today, my ERL seems to have crashed and rebooted.  On my serial console, I found this:

erl@erl:~$
*** NMI Watchdog interrupt on Core 0x00 ***
        $0      0x0000000000000000      at      0x0000000010108ce1
        v0      0x0000000000000001      v1      0x0000000000010000
        a0      0xffffffffc0010ae0      a1      0x0000000000000000
        a2      0x000000000000216d      a3      0x0000000000000000
        a4      0x0000000000000000      a5      0x80000004101fcbd8
        a6      0x0000000000000001      a7      0x0000000000000000
        t0      0x0000000000000400      t1      0x000000000001c000
        t2      0x000000000000000c      t3      0x8000000418edc000
        s0      0x0000000000000000      s1      0xffffffffc0010988
        s2      0xffffffff805a0000      s3      0x0000000000000001
        s4      0x0000000000000000      s5      0xffffffffc0010000
        s6      0x0000000000000005      s7      0xffffffffc0010000
        t8      0x00000000000002c0      t9      0xffffffff801a2b80
        k0      0x0000000000000000      k1      0x0000000000000001
        gp      0x800000041cf84000      sp      0x800000041cf87c10
        s8      0xffffffffc0017c30      ra      0xffffffffc00181c0
        err_epc 0xffffffff804b5efc      epc     0xffffffffc007a660
        status  0x0000000010588ce4      cause   0x0000000040808c08
        sum0    0x000400f100008000      en0     0x0100400500008000
*** Chip soft reset soon ***
 
Looking for valid bootloader image....
Jumping to start of image at address 0xbfc80000
 
U-Boot 1.1.1 (UBNT Build ID: 4567941-g15e9b5d) (Build time: Jun  4 2013 - 14:52:06)

 followed by a normal reboot.  It had been running without problems for several weeks, and still seems to be running OK.  So does anyone know what this means?  Is it an indicator of a hardware problem, or a software one?

SuperUser
Posts: 19,082
Registered: ‎09-17-2013
Kudos: 4791
Solutions: 1344

Re: NMI watchdog

NMI Watchdog is a process that's able to log non-maskable interrupts (i.e. lockups that can't be cleared by any other processor command, such as [RESET]) if/when they happen.

 

since you've got the output there, perhaps @UBNT-stig or @UBNT-ancheng can have a look and/or get it to the right people in order to see what caused it.

 

This link has some more info on the NMI Watchdog process.

Previous Employee
Posts: 13,551
Registered: ‎06-10-2011
Kudos: 5411
Solutions: 1657
Contributions: 2

Re: NMI watchdog

There was a couple of reports of similar issue, but even in those cases it seems to happen very rarely and we have not found any way to trigger the issue even after working with those community members (providing debug kernel etc.). So first we still need to find a way to trigger the issue reliably and then we can look into it.

Member
Posts: 118
Registered: ‎05-07-2014
Kudos: 40
Solutions: 6

Re: NMI watchdog

Since all that happens is a reboot, and I have only seen it once, from my point of view it is not a bad problem, and does not need urgent fixing.  So please let me know if there is anything I can do to help track it down, otherwise I will just wait and see if it ever happens again.

Member
Posts: 118
Registered: ‎05-07-2014
Kudos: 40
Solutions: 6

Re: NMI watchdog

I have just had another NMI watchdog:

 

*** NMI Watchdog interrupt on Core 0x01 ***
        $0      0x0000000000000000      at      0x0000000010108ce1
        v0      0x0000000000000001      v1      0x0000000000010000
        a0      0xffffffffc0010ae0      a1      0xffffffffffffffff
        a2      0x000000000000e039      a3      0x0000000000000000
        a4      0x0000000000000000      a5      0x800000041c0bf470
        a6      0x0000000000000001      a7      0x0000000000000000
        t0      0x0000000000000400      t1      0x000000000001c000
        t2      0x000000000000000c      t3      0x800000041c0ac000
        s0      0x0000000000000000      s1      0xffffffffc0010988
        s2      0xffffffff805a0000      s3      0x0000000000000001
        s4      0x0000000000000000      s5      0xffffffffc0010000
        s6      0x0000000000000005      s7      0xffffffffc0010000
        t8      0x00000000006d54f0      t9      0xffffffff801a2b80
        k0      0x0000000000000000      k1      0x0000000000000001
        gp      0x800000041c0c0000      sp      0x800000041c0c3c10
        s8      0xffffffffc0017c30      ra      0xffffffffc00181c0
        err_epc 0xffffffff804b5f0c      epc     0xffffffffc007a660
        status  0x0000000010588ce4      cause   0x0000000040808808
        sum0    0x000000f000008000      en0     0x0000000100000000
*** Chip soft reset soon ***

Looking for valid bootloader image....
Jumping to start of image at address 0xbfc80000

I happened to be working on my Windows PC at the time, which generates most of the traffic to and from the Internet, so I can say that the traffic levels were heavy with the in and outbound bandwidth pretty much saturated.  But that happens for several hours every day normally, so it is nothing new.  The new thing I can think of that may be the trigger for the NMI watchdogs is that I have very recently added port forwarding / DNAT rules and an SNAT masquerade rule.  I have just signed up for a UFB optical fibre Internet connection now that it is available on my street, which will involve the ERL becoming the border router that makes my Internet connection.  At the moment, the ERL is behind a Cisco 877 that is making an ADSL 2+ connection, and the 877 has been doing all the NAT work.  But since I have signed up for fibre, I have started adding to the ERL config all the bits that the 877 has been doing that will be needed on fibre, starting with NAT.  It is only since I added those rules that I have seen the NMI watchdogs happening.

Member
Posts: 113
Registered: ‎03-15-2014
Kudos: 63
Solutions: 4

Re: NMI watchdog

[ Edited ]

This is happening to me as well.  I don't have a console cable to log the console but in other thread the reboot "issue" seems to be the NMI Watchdog as well.  I've noticed the "issue" since I purchased the router (ER-PoE) with 1.4.0, 1.4.1, 1.5.0, and 1.6.0 as well.  Check the link below with a similar thread that @UBNT-ancheng participated as well and @oranenj post included the log with the NMI Watchdog error.

 

https://community.ubnt.com/t5/EdgeMAX/Suddenly-Restart/m-p/928158#M36640

 

 

Member
Posts: 113
Registered: ‎03-15-2014
Kudos: 63
Solutions: 4

Re: NMI watchdog

@UBNT-ancheng, is there a way to disable the NMI Watchdog?

Member
Posts: 113
Registered: ‎03-15-2014
Kudos: 63
Solutions: 4

Re: NMI watchdog

[ Edited ]

To understand better about the NMI Watchdog I saw this in another website:

 

The NMI watchdog monitors system interrupts and initiates a reboot if the system 
appears to have hung.  On a normal system hundreds of device and timer interrupts 
are received per second. If there are no interrupts in a 5 second interval, the NMI 
watchdog assumes that the system has hung and initiates a system reboot.

 

Having that said, it seems that some process in the routers hangs and the NMI Watchdog meeting its purpose reboots the router.  All I can think of is that there is not a problem withing the NMI Watchdog but a problem with a process that for some reason is hanging/stopping and the NMI Watchdog go ahead and reboots the unit.  In another thread I found this log which I think is kind of complete that can maybe help the UBNT folks to identify what process is stopping that triggers the NMI Watchdog to reboot or maybe the watchdog timers are too low making it think that the process have hung when that may not be the case.

 

https://community.ubnt.com/ubnt/attachments/ubnt/EdgeMAX/49734/1/consolelog.txt

 

What do you think @UBNT-ancheng ?  Does this makes sense to you?

Member
Posts: 118
Registered: ‎05-07-2014
Kudos: 40
Solutions: 6

Re: NMI watchdog


mcmpr wrote:

This is happening to me as well.  I don't have a console cable to log the console but in other thread the reboot "issue" seems to be the NMI Watchdog as well.  I've noticed the "issue" since I purchased the router (ER-PoE) with 1.4.0, 1.4.1, 1.5.0, and 1.6.0 as well.  Check the link below with a similar thread that @UBNT-ancheng participated as well and @oranenj post included the log with the NMI Watchdog error.

 

https://community.ubnt.com/t5/EdgeMAX/Suddenly-Restart/m-p/928158#M36640

 

 


My NMI watchdog is not the same as the one on that thread.  I only get the NMI watchdog happening, not the "skbuff: skb_under_panic" bits.

Previous Employee
Posts: 13,551
Registered: ‎06-10-2011
Kudos: 5411
Solutions: 1657
Contributions: 2

Re: NMI watchdog

Yes the NMI triggering is basically the "result", i.e., something hangs causing the kernel to stop "poking" the watchdog, and that triggers the NMI after some time, and the system reboots. Also fe31nz is correct that the two cases are most likely not the same given the different output (or lack thereof) before the NMI message.

Emerging Member
Posts: 67
Registered: ‎01-14-2015
Kudos: 12

Re: NMI watchdog

I am a two week owner of a PoE and this has happened to me twice now. Firmware 1.6.0 (received with 1.2.x but upgraded immediately).

 

The first time I was logged in via SSH and I saw it happening but was not able to preserve the log. Since then I've hooked up a serial console cable. It happened again at what appears to be exactly 10:00AM this morning (which may be total coincidence). The two were about a week apart. Here is the complete log from this last one:

*** NMI Watchdog interrupt on Core 0x00 ***
        $0      0x0000000000000000      at      0x0000000000000001
        v0      0x0000000000000001      v1      0x0000000000010000
        a0      0xffffffffc0010ae0      a1      0x0000000000000015
        a2      0x0000000000002175      a3      0x8000000001678248
        a4      0x8000000081139000      a5      0x0000000000000010
        a6      0x0000000000000000      a7      0x0000000000000000
        t0      0xffffffffffffffff      t1      0x0000000000000000
        t2      0x0000000000000000      t3      0x0000000073c78000
        s0      0xffffffffc000da30      s1      0xffffffffc0010000
        s2      0xffffffffc00151c0      s3      0x800000041d72e000
        s4      0xfffffffffffffff4      s5      0x0000000000001000
        s6      0x0000000073c68000      s7      0xffffffffc000d998
        t8      0x0000000000080000      t9      0x0000000076e591a4
        k0      0x0000000072b06930      k1      0x0f00000010712a87
        gp      0x800000041d054000      sp      0x800000041d057d50
        s8      0xffffffffc00151c0      ra      0xffffffffc00173bc
        err_epc 0xffffffff804b5f04      epc     0xffffffffc0017378
        status  0x0000000010588ce4      cause   0x0000000040808c08
        sum0    0x000000f000008000      en0     0x0100400500008000
*** Chip soft reset soon ***
 
*** NMI Watchdog interrupt on Core 0x01 ***
        $0      0x0000000000000000      at      0x0000000010108ce1
        v0      0x0000000000000001      v1      0x0000000000010000
        a0      0xffffffffc0010ae0      a1      0x0000000000000000
        a2      0x0000000000002174      a3      0x0000000000000000
        a4      0x0000000000000000      a5      0x8000000001786200
        a6      0x0000000000000000      a7      0x0000000000000000
        t0      0x0000000000000001      t1      0x000000000001c000
        t2      0x000000000000000c      t3      0x800000041c0ac000
        s0      0x0000000000000000      s1      0xffffffffc0010918
        s2      0xffffffff805a0000      s3      0x0000000000000001
        s4      0x0000000000000000      s5      0xffffffffc0010000
        s6      0x0000000000000006      s7      0xffffffffc0010000
        t8      0x0000000000000002      t9      0xffffffff8014e1b8
        k0      0x0000000000000000      k1      0x0000000000000000
        gp      0x800000041cd78000      sp      0x800000041cd7bc10
        s8      0xffffffffc0017c30      ra      0xffffffffc00181c0
        err_epc 0xffffffff804b5f00      epc     0xffffffff8014c7a0
        status  0x0000000010588ce4      cause   0x0000000040808800
        sum0    0x000000f000008000      en0     0x0000000100000000
*** Chip soft reset soon ***
 
Looking for valid bootloader image....
Jumping to start of image at address 0xbfc80000
 
 
U-Boot 1.1.1 (UBNT Build ID: 4567941-g15e9b5d) (Build time: Jun  4 2013 - 14:51:
00)
 
BIST check passed.
UBNT_E100 r1:1, r2:22, f:8/135, serial #: 24A43C3CC416
Core clock: 500 MHz, DDR clock: 266 MHz (532 Mhz data rate)
DRAM:  512 MB
Clearing DRAM....... done
Flash:  8 MB
Net:   octeth0, octeth1, octeth2
 
USB:   (port 0) scanning bus for devices... 1 USB Devices found
       scanning bus for storage devices...
  Device 0: Vendor:          Prod.: USB DISK 2.0     Rev: PMAP
            Type: Removable Hard Disk
            Capacity: 3824.0 MB = 3.7 GB (7831552 x 512)
 0 
reading vmlinux.64
............................
 
5567368 bytes read
argv[2]: coremask=0x3
argv[3]: root=/dev/sda2
argv[4]: rootdelay=15
argv[5]: rw
argv[6]: rootsqimg=squashfs.img
argv[7]: rootsqwdir=w
argv[8]: mtdparts=phys_mapped_flash:512k(boot0),512k(boot1),64k@1024k(eeprom)
ELF file is 64 bit
Allocating memory for ELF segment: addr: 0xffffffff80100000 (adjusted to: 0x1000
00), size 0x69dfd0
Allocated memory for ELF segment: addr: 0xffffffff80100000, size 0x69dfd0
Processing PHDR 0
  Loading 54dd80 bytes at ffffffff80100000
  Clearing 150250 bytes at ffffffff8064dd80
## Loading Linux kernel with entry point: 0xffffffff804aeb00 ...
Bootloader: Done loading app on coremask: 0x3
Linux version 3.10.20-UBNT (root@ubnt-builder2) (gcc version 4.7.0 (Cavium Inc. 
Version: SDK_3_1_0_p2 build 34) ) #1 SMP Thu Oct 16 16:29:39 PDT 2014
CVMSEG size: 2 cache lines (256 bytes)
Cavium Inc. SDK-3.1
bootconsole [early0] enabled
CPU revision is: 000d0601 (Cavium Octeon+)
Checking for the multiply/shift bug... no.
Checking for the daddiu bug... no.
Determined physical RAM map:
 memory: 0000000007800000 @ 0000000000800000 (usable)
 memory: 0000000007c00000 @ 0000000008200000 (usable)
 memory: 000000000fc00000 @ 0000000410000000 (usable)
 memory: 000000000050b000 @ 0000000000100000 (usable)
 memory: 0000000000045000 @ 000000000060b000 (usable after init)
Wasting 14336 bytes for tracking 256 unused pages
software IO TLB [mem 0x01707000-0x01747000] (0MB) mapped at [8000000001707000-80
00000001746fff]
Zone ranges:
  DMA32    [mem 0x00100000-0xefffffff]
  Normal   [mem 0xf0000000-0x41fbfffff]
Movable zone start for each node
Early memory node ranges
  node   0: [mem 0x00100000-0x0064ffff]
  node   0: [mem 0x00800000-0x07ffffff]
  node   0: [mem 0x08200000-0x0fdfffff]
  node   0: [mem 0x410000000-0x41fbfffff]
Primary instruction cache 32kB, virtually tagged, 4 way, 64 sets, linesize 128 b
ytes.
Primary data cache 16kB, 64-way, 2 sets, linesize 128 bytes.
Secondary unified cache 128kB, 8-way, 128 sets, linesize 128 bytes.
PERCPU: Embedded 10 pages/cpu @8000000001784000 s11648 r8192 d21120 u40960
Built 1 zonelists in Zone order, mobility grouping on.  Total pages: 126581
Kernel command line:  bootoctlinux $loadaddr coremask=0x3 root=/dev/sda2 rootdel
ay=15 rw rootsqimg=squashfs.img rootsqwdir=w mtdparts=phys_mapped_flash:512k(boo
t0),512k(boot1),64k@1024k(eeprom) console=ttyS0,115200
PID hash table entries: 2048 (order: 2, 16384 bytes)
Dentry cache hash table entries: 65536 (order: 7, 524288 bytes)
Inode-cache hash table entries: 32768 (order: 6, 262144 bytes)
Memory: 499340k/513344k available (3810k kernel code, 14004k reserved, 1352k dat
a, 276k init, 0k highmem)
Hierarchical RCU implementation.
NR_IRQS:255
Calibrating delay loop (skipped) preset value.. 1000.00 BogoMIPS (lpj=5000000)
pid_max: default: 32768 minimum: 501
Security Framework initialized
Mount-cache hash table entries: 256
Checking for the daddi bug... no.
SMP: Booting CPU01 (CoreId  1)...
CPU revision is: 000d0601 (Cavium Octeon+)
Brought up 2 CPUs
NET: Registered protocol family 16
bio: create slab <bio-0> at 0
SCSI subsystem initialized
usbcore: registered new interface driver usbfs
usbcore: registered new interface driver hub
usbcore: registered new device driver usb
Switching to clocksource OCTEON_CVMCOUNT
NET: Registered protocol family 2
TCP established hash table entries: 4096 (order: 4, 65536 bytes)
TCP bind hash table entries: 4096 (order: 4, 65536 bytes)
TCP: Hash tables configured (established 4096 bind 4096)
TCP: reno registered
UDP hash table entries: 256 (order: 1, 8192 bytes)
UDP-Lite hash table entries: 256 (order: 1, 8192 bytes)
NET: Registered protocol family 1
octeon_pci_console: Console not created.
/proc/octeon_perf: Octeon performance counter interface loaded
HugeTLB registered 2 MB page size, pre-allocated 0 pages
squashfs: version 4.0 (2009/01/31) Phillip Lougher
Registering unionfs 2.5.13 (for 3.10.34)
msgmni has been set to 975
io scheduler noop registered
io scheduler cfq registered (default)
Serial: 8250/16550 driver, 6 ports, IRQ sharing disabled
1180000000800.serial: ttyS0 at MMIO 0x1180000000800 (irq = 34) is a OCTEON
console [ttyS0] enabled, bootconsole disabled
console [ttyS0] enabled, bootconsole disabled
1180000000c00.serial: ttyS1 at MMIO 0x1180000000c00 (irq = 35) is a OCTEON
loop: module loaded
ehci_hcd: USB 2.0 'Enhanced' Host Controller (EHCI) Driver
ohci_hcd: USB 1.1 'Open' Host Controller (OHCI) Driver
OcteonUSB 16f0010000000.usbc: Octeon Host Controller
OcteonUSB 16f0010000000.usbc: new USB bus registered, assigned bus number 1
OcteonUSB 16f0010000000.usbc: irq 56, io mem 0x00000000
hub 1-0:1.0: USB hub found
hub 1-0:1.0: 1 port detected
OcteonUSB: Registered HCD for port 0 on irq 56
usbcore: registered new interface driver usb-storage
octeon_wdt: Initial granularity 5 Sec
TCP: cubic registered
NET: Registered protocol family 17
NET: Registered protocol family 15
L2 lock: TLB refill 256 bytes
L2 lock: General exception 128 bytes
L2 lock: low-level interrupt 128 bytes
L2 lock: interrupt 640 bytes
L2 lock: memcpy 1152 bytes
Bootbus flash: Setting flash for 8MB flash at 0x1f400000
phys_mapped_flash: Found 1 x16 devices at 0x0 in 8-bit bank. Manufacturer ID 0x0
000c2 Chip ID 0x0000c9
Amd/Fujitsu Extended Query Table at 0x0040
  Amd/Fujitsu Extended Query version 1.1.
phys_mapped_flash: Swapping erase regions for top-boot CFI table.
number of CFI chips: 1
3 cmdlinepart partitions found on MTD device phys_mapped_flash
Creating 3 MTD partitions on "phys_mapped_flash":
0x000000000000-0x000000080000 : "boot0"
0x000000080000-0x000000100000 : "boot1"
0x000000100000-0x000000110000 : "eeprom"
Waiting 15sec before mounting root device...
usb 1-1: new high-speed USB device number 2 using OcteonUSB
usb-storage 1-1:1.0: USB Mass Storage device detected
scsi0 : usb-storage 1-1:1.0
scsi 0:0:0:0: Direct-Access              USB DISK 2.0     PMAP PQ: 0 ANSI: 6
sd 0:0:0:0: [sda] 7831552 512-byte logical blocks: (4.00 GB/3.73 GiB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] No Caching mode page found
sd 0:0:0:0: [sda] Assuming drive cache: write through
sd 0:0:0:0: [sda] No Caching mode page found
sd 0:0:0:0: [sda] Assuming drive cache: write through
 sda: sda1 sda2
sd 0:0:0:0: [sda] No Caching mode page found
sd 0:0:0:0: [sda] Assuming drive cache: write through
sd 0:0:0:0: [sda] Attached SCSI removable disk
kjournald starting.  Commit interval 3 seconds
EXT3-fs (sda2): warning: maximal mount count reached, running e2fsck is recommen
ded
EXT3-fs (sda2): using internal journal
EXT3-fs (sda2): recovery complete
EXT3-fs (sda2): mounted filesystem with journal data mode
VFS: Mounted root (unionfs filesystem) on device 0:11.
Freeing unused kernel memory: 276K (ffffffff8060b000 - ffffffff80650000)
Algorithmics/MIPS FPU Emulator v1.5
INIT: version 2.88 booting
INIT: Entering runlevel: 2
[ ok ] Starting routing daemon: rib.
[ ok ] Starting EdgeOS router: migrate rl-system configure.
 
Welcome to EdgeOS star-edge ttyS0

 

Previous Employee
Posts: 10,504
Registered: ‎06-09-2011
Kudos: 3072
Solutions: 945
Contributions: 16

Re: NMI watchdog

@dolfs Are you using PoE?  If so, can you do "show interfaces ethernet poe"

EdgeMAX Router Software Development
Emerging Member
Posts: 67
Registered: ‎01-14-2015
Kudos: 12

Re: NMI watchdog

No not using it.

eth0 - WAN (Cable) with /29 static IP subnet from ISP

eth1 - Unused, reserved for WAN2

eth2 - VLANS switch0.1, switch0.25 and switch0.100

eth3 - Configured on switch, but unused

eth4 - Configured on switch, but unused

New Member
Posts: 2
Registered: ‎05-24-2014

Re: NMI watchdog

I'm also using a PoE with v1.6.0 that is randomly restarting. On average I'd say it happens once or twice a week. I never exprienced this problem with earlier firmware versions.

 

There's another thread in the Beta forum that suggests this may be related to VLAN offloading.

 

*** NMI Watchdog interrupt on Core 0x00 ***
        $0      0x0000000000000000      at      0x0000000050108ce1
        v0      0x0000000000000001      v1      0x0000000000010000
        a0      0xffffffffc0010ae0      a1      0x000000000000339d
        a2      0x000000000000339e      a3      0x0000000000000000
        a4      0x0000000000000000      a5      0x8000000001790200
        a6      0x0000000000000000      a7      0x0000000000000000
        t0      0x0000000000000001      t1      0x000000000001c000
        t2      0x000000000000000c      t3      0xffffffff805a0000
        s0      0x0000000000000000      s1      0xffffffffc00108a8
        s2      0xffffffff805a0000      s3      0x0000000000000001
        s4      0x0000000000000000      s5      0xffffffffc0010000
        s6      0x0000000000000007      s7      0xffffffffc0010000
        t8      0x00000000006f4c84      t9      0xffffffff8014e1b8
        k0      0x0000000000000000      k1      0x0000000000000010
        gp      0x800000041c084000      sp      0x800000041c087c10
        s8      0xffffffffc0017c30      ra      0xffffffffc00181c0
        err_epc 0xffffffff804b5f10      epc     0xffffffffc00181ac
        status  0x0000000050588ce4      cause   0x0000000040808c00
        sum0    0x000400f000008000      en0     0x0100400500008000
*** Chip soft reset soon ***

*** NMI Watchdog interrupt on Core 0x01 ***
        $0      0x0000000000000000      at      0x0000000000000001
        v0      0x0000000000000001      v1      0x0000000000010000
        a0      0xffffffffc0010ae0      a1      0x000000000000001a
        a2      0x000000000000339f      a3      0x80000000012b8d08
        a4      0x8000000081143000      a5      0x0000000000000010
        a6      0x0000000000000000      a7      0x0000000000000000
        t0      0xffffffffffffffff      t1      0x0000000000000000
        t2      0x0000000000000000      t3      0x0000000073e53000
        s0      0xffffffffc000da30      s1      0xffffffffc0010000
        s2      0xffffffffc00151c0      s3      0x800000000c516000
        s4      0xfffffffffffffff4      s5      0x0000000000001000
        s6      0x0000000073e43000      s7      0xffffffffc000d998
        t8      0x0000000000080000      t9      0x00000000770341a4
        k0      0x00000000734feaf0      k1      0x800000041ce33fe0
        gp      0x800000041ce30000      sp      0x800000041ce33d50
        s8      0xffffffffc00151c0      ra      0xffffffffc00173bc
        err_epc 0xffffffff804b5f04      epc     0x000000007700d914
        status  0x0000000010588ce4      cause   0x0000000040808820
        sum0    0x000400f000008000      en0     0x0000000100000000
*** Chip soft reset soon ***

 

Emerging Member
Posts: 67
Registered: ‎01-14-2015
Kudos: 12

Re: NMI watchdog

[ Edited ]

It just happened again 00:09AM PST, about 14 hours after the previous one. I do have to mention that earlier this evening I had been messing with the configuration, and, luckily for me, I had decided I liked my changes just before and performed a "save".

 

So while I did not lose my changes, may be that had something to do with it. The possibility is that I had just configured an ipsec site-site VPN where the other site is actually down (vacation home). Is it possible that the VPN code, when it is trying too long to get an sa established creates an out of memory condition, or some process ends up hanging in kernel space? I say this because this morning I had one other site-site configured and while the other side there was up, it was not configured quite correct and thus would never quite establish the sa during IKE, but instead was logging errors. On the other hand, that config had been sitting there for 3 days already.

 

Just a thought. Below is only the start of my console output, capturing the registers at the time of the reset, as the rest of the output as a plain old boot log and should be identical to the previous one (saving space).

*** NMI Watchdog interrupt on Core 0x01 ***
        $0      0x0000000000000000      at      0x0000000010108ce1
        v0      0x0000000000000001      v1      0x0000000000010000
        a0      0xffffffffc0010ae0      a1      0x0000000000000000
        a2      0x0000000000001b60      a3      0x0000000000000000
        a4      0x0000000000000000      a5      0x800000041c2f7470
        a6      0x0000000000000001      a7      0x0000000000000000
        t0      0x0000000000000400      t1      0x000000000001c000
        t2      0x000000000000000c      t3      0x800000041c0ac000
        s0      0x0000000000000000      s1      0xffffffffc0010918
        s2      0xffffffff805a0000      s3      0x0000000000000001
        s4      0x0000000000000000      s5      0xffffffffc0010000
        s6      0x0000000000000006      s7      0xffffffffc0010000
        t8      0x00000000006d5694      t9      0xffffffff801a2b80
        k0      0x0000000000000000      k1      0x0000000000000001
        gp      0x800000041c218000      sp      0x800000041c21bc10
        s8      0xffffffffc0017c30      ra      0xffffffffc00181c0
        err_epc 0xffffffff804b5efc      epc     0xffffffffc007a660
        status  0x0000000010588ce4      cause   0x0000000040808c08
        sum0    0x000000f100000000      en0     0x0000000100000000
*** Chip soft reset soon ***

 

I really would like some resolution on this because this restart took some 4-5 minutes (longest time waiting was on the line below) and if this happens during waking hours, or better yet while I am on the phone (VOIP), or Skype for work, I don't look particularly good!

 

I'm happy to fiddle with things, install loggers, change so that system logs are saved to permanent storage, whatever.

 

[....] Starting EdgeOS router: migrate rl-system configure

 

Emerging Member
Posts: 67
Registered: ‎01-14-2015
Kudos: 12

Re: NMI watchdog

[ Edited ]

And, again, 10:10AM PST, 10 hours after previous occurrence. Starting to get real annoying...

 

From the log I can see the system went down close to 10:03 and finally came back 10:10. A 7 minute restart.

 

*** NMI Watchdog interrupt on Core 0x00 ***
        $0      0x0000000000000000      at      0xffffffff80770000
        v0      0x0000000000000001      v1      0x0000000000010000
        a0      0xffffffffc0010ae0      a1      0x000000000000001a
        a2      0x0000000000008279      a3      0x0000000000000000
        a4      0x0000000000000000      a5      0x800000041c033658
        a6      0x800000041c033660      a7      0x0000000000000400
        t0      0x800000041c087fe0      t1      0x0000000000008c00
        t2      0xffffffff801bbc58      t3      0xffffffff805a0000
        s0      0x0000000000000000      s1      0xffffffffc00108a8
        s2      0xffffffff805a0000      s3      0x0000000000000001
        s4      0x0000000000000000      s5      0xffffffffc0010000
        s6      0x0000000000000007      s7      0xffffffffc0010000
        t8      0x00000000004111f0      t9      0xffffffff8014e1b8
        k0      0x0000000000000000      k1      0x0000000000000001
        gp      0x800000041c084000      sp      0x800000041c087c10
        s8      0xffffffffc0017c30      ra      0xffffffffc00181c0
        err_epc 0xffffffff804b5f04      epc     0xffffffffc007a660
        status  0x0000000050588ce4      cause   0x0000000040808c08
        sum0    0x000000f000008000      en0     0x0100400500008000
*** Chip soft reset soon ***
 
*** NMI Watchdog interrupt on Core 0x01 ***
        $0      0x0000000000000000      at      0x0000000010108ce1
        v0      0x0000000000000001      v1      0x0000000000010000
        a0      0xffffffffc0010ae0      a1      0x0000000000000000
        a2      0x0000000000008278      a3      0x0000000000000000
        a4      0x0000000000000000      a5      0x800000041c0bf470
        a6      0x0000000000000001      a7      0x0000000000000000
        t0      0x0000000000000400      t1      0x000000000001c000
        t2      0x000000000000000c      t3      0x800000041c0ac000
        s0      0x0000000000000000      s1      0xffffffffc0010918
        s2      0xffffffff805a0000      s3      0x0000000000000001
        s4      0x0000000000000000      s5      0xffffffffc0010000
        s6      0x0000000000000006      s7      0xffffffffc0010000
        t8      0x000000000042657c      t9      0xffffffff801a2b80
        k0      0x0000000000000000      k1      0x0000000000000000
        gp      0x800000041c0c0000      sp      0x800000041c0c3c10
        s8      0xffffffffc0017c30      ra      0xffffffffc00181c0
        err_epc 0xffffffff804b5efc      epc     0xffffffff8014c7a0
        status  0x0000000010588ce4      cause   0x0000000040808800
        sum0    0x000000f000008000      en0     0x0000000100000000
*** Chip soft reset soon ***

 

Emerging Member
Posts: 67
Registered: ‎01-14-2015
Kudos: 12

Re: NMI watchdog

Another one, but with some additional info

 

I decided to leave the console open at all times, running a set of commands once a minute:

while true
do
	echo -----
	date
	show system processes summary
	show system memory
	sleep 60
done

 I then got another watchdog event this morning:

Thu Jan 22 08:11:44 PST 2015
 08:11:44 up 10:19,  2 users,  load average: 0.08, 0.05, 0.06
             total       used       free     shared    buffers     cached
Mem:        499616     207616     292000          0      27092     115820
Swap:            0          0          0
Total:      499616     207616     292000
 
*** NMI Watchdog interrupt on Core 0x00 ***
        $0      0x0000000000000000      at      0x0000000000000001
        v0      0x0000000000000001      v1      0x0000000000010000
        a0      0xffffffffc0010ae0      a1      0x0000000000000001
        a2      0x000000000000df7e      a3      0x8000000001681628
        a4      0x8000000081139000      a5      0x0000000000000010
        a6      0x0000000000000000      a7      0x0000000000000000
        t0      0xffffffffffffffff      t1      0x0000000000000000
        t2      0x0000000000000000      t3      0x0000000073ca3000
        s0      0xffffffffc000da30      s1      0xffffffffc0010000
        s2      0xffffffffc00151c0      s3      0x800000041d9d2000
        s4      0xfffffffffffffff4      s5      0x0000000000001000
        s6      0x0000000073c93000      s7      0xffffffffc000d998
        t8      0x0000000000080000      t9      0x00000000776841a4
        k0      0x0000000076cb1a48      k1      0x800000041cefbfe0
        gp      0x800000041cef8000      sp      0x800000041cefbd50
        s8      0xffffffffc00151c0      ra      0xffffffffc00173bc
        err_epc 0xffffffff804b5f04      epc     0x000000007765d914
        status  0x0000000010588ce4      cause   0x0000000040808c20
        sum0    0x000000f000008000      en0     0x0100400500008000
*** Chip soft reset soon ***
 
*** NMI Watchdog interrupt on Core 0x01 ***
        $0      0x0000000000000000      at      0x0000000010108ce1
        v0      0x0000000000000001      v1      0x0000000000010000
        a0      0xffffffffc0010ae0      a1      0x000000000000df7c
        a2      0x000000000000df7d      a3      0x0000000000000000
        a4      0x0000000000000000      a5      0x800000041c0bf470
        a6      0x0000000000000001      a7      0x0000000000000000
        t0      0x0000000000000400      t1      0x000000000001c000
        t2      0x000000000000000c      t3      0x800000041c0ac000
        s0      0x0000000000000000      s1      0xffffffffc0010918
        s2      0xffffffff805a0000      s3      0x0000000000000001
        s4      0x0000000000000000      s5      0xffffffffc0010000
        s6      0x0000000000000006      s7      0xffffffffc0010000
        t8      0x000000000045d11c      t9      0xffffffff801a2b80
        k0      0x0000000000000000      k1      0x0000000000000001
        gp      0x800000041c0c0000      sp      0x800000041c0c3c10
        s8      0xffffffffc0017c30      ra      0xffffffffc00181c0
        err_epc 0xffffffff804b5f10      epc     0xffffffffc007a660
        status  0x0000000010588ce4      cause   0x0000000040808808
        sum0    0x000000f000008000      en0     0x0000000100000000
*** Chip soft reset soon ***

 

What is interesting here is that the last reported memory was 292000 free. I had rebooted around 1AM. I had last look at the router around 1:30AM and memory free was 319120, I did few things, one of which was starting the web UI, ran it for a few minutes, and then logged out and closed the windowaround 321AM. Memory free was 311112 at that time.

 

From there on, free memory was on a steady decline, losing about 34 per minute (on average). This seems to indicate some kind of process is/was slowly eating memory. I have no clue why things should go truly sour with 292000 still free, and that may even be unrelated, but something appears to have been leaking memory.

 

I am wondering if the WebSocket that the UI opens is not properly closed and is implicated?

 

I am going to try not using the UI for a view days and see if I get another crash. If I don't, I'll use it again and see if I get a crash within the day.

Emerging Member
Posts: 99
Registered: ‎01-19-2012
Kudos: 20
Solutions: 1

Re: NMI watchdog

I'm experiencing the same issue on 2 different Edgemax-POE routers running v1.6.0.

One has crashed 4x times since install 2015-02, the other has crashed once.

 

Router that crashed 4x times is not using switch0 function.

Other router is using switch0.

 

Router config:

 - POE-out: off

 - ipv4 forwarding offload: enable

 - ipv4 vlan offload: enable

 

 

From the crash dumps, it appears both devices are being crashed by the same bug

https://docs.google.com/spreadsheets/d/1hH0S270ho9MngeeTL4GCrczYlIu57CNyC_dyW1TChtw/edit?usp=sharing

crash dump.png

 

Attached are config files, version info, and full logs from both devices.

 

Previous Employee
Posts: 10,504
Registered: ‎06-09-2011
Kudos: 3072
Solutions: 945
Contributions: 16

Re: NMI watchdog

We believe NMI crash has be fixed in v1.7.0 (currently in alpha/beta testing).

EdgeMAX Router Software Development
Previous Employee
Posts: 13,551
Registered: ‎06-10-2011
Kudos: 5411
Solutions: 1657
Contributions: 2

Re: NMI watchdog

If you haven't seen this (or related discussions) already, there is a potential fix for this in the latest alpha version (currently available in the beta forum), so you could give it a try if interested.

Reply