Reply
Emerging Member
Posts: 49
Registered: ‎09-05-2013
Kudos: 38
Solutions: 2

switches / aps not reconnecting to controller after ~16 hours of WAN downtime

[ Edited ]

I have a handful of sites running against a cloud controller on DigitalOcean.

 

One of the sites experienced a fiber cut yesterday and dropped offline, from about 2pm to 7am the following day.

 

When connectivity was restored, the gateway immediately reconnected to the cloud controller, but none of the switches and APs did.

 

The site is "up", i.e. the switches and APs are passing traffic and all client devices are working; they just aren't properly checking in with the controller.

 

All the switches are on firmware version 3.9.27.8537.  The cloud controller is version 5.7.23.

 

I VPN'd into the site and SSH'd to one of the switches to investigate, and I found that the logs were full of this:

May 17 15:44:46 sw-301a user.err syslog: ace_reporter.prepare_ethernet_array(): failed to fill ifstat for eth0!, rc: -1
May 17 15:44:46 sw-301a user.err syslog: libubnt_webrtc.get_all_sdp_sessions(): Socket error
May 17 15:44:46 sw-301a user.err syslog: ace_reporter.reporter_fail(): Unknown[11] (http://my-cloud-controller.tld:8080/inform)
May 17 15:44:46 sw-301a user.err syslog: ace_reporter.reporter_fail(): initial contact failed #2334, url=http://my-cloud-controller.tld:8080/inform, rc=11
May 17 15:45:01 sw-301a user.err syslog: ace_reporter.prepare_ethernet_array(): failed to fill ifstat for eth0!, rc: -1
May 17 15:45:01 sw-301a user.err syslog: libubnt_webrtc.get_all_sdp_sessions(): Socket error
May 17 15:45:01 sw-301a user.err syslog: ace_reporter.reporter_fail(): Unknown[11] (http://my-cloud-controller.tld:8080/inform)
May 17 15:45:01 sw-301a user.err syslog: ace_reporter.reporter_fail(): initial contact failed #2335, url=http://my-cloud-controller.tld:8080/inform, rc=11
May 17 15:45:16 sw-301a user.err syslog: ace_reporter.prepare_ethernet_array(): failed to fill ifstat for eth0!, rc: -1
May 17 15:45:16 sw-301a user.err syslog: libubnt_webrtc.get_all_sdp_sessions(): Socket error
May 17 15:45:16 sw-301a user.err syslog: ace_reporter.reporter_fail(): Unknown[11] (http://my-cloud-controller.tld:8080/inform)
May 17 15:45:16 sw-301a user.err syslog: ace_reporter.reporter_fail(): initial contact failed #2336, url=http://my-cloud-controller.tld:8080/inform, rc=11
May 17 15:45:31 sw-301a user.err syslog: ace_reporter.prepare_ethernet_array(): failed to fill ifstat for eth0!, rc: -1
May 17 15:45:31 sw-301a user.err syslog: libubnt_webrtc.get_all_sdp_sessions(): Socket error
May 17 15:45:32 sw-301a user.err syslog: ace_reporter.reporter_fail(): Unknown[11] (http://my-cloud-controller.tld:8080/inform)
May 17 15:45:32 sw-301a user.err syslog: ace_reporter.reporter_fail(): initial contact failed #2337, url=http://my-cloud-controller.tld:8080/inform, rc=11

 

Info.  This is weird - the hostname is wrong, the IP address is missing, and the version number is wrong (it's actually version 3.9.27.8537).

sw-301a-US.v3.9.27# info

Model:       USW-8P-60
Version:     6.0.123
MAC Address: f0:9f:c2:18:44:6b
IP Address:  
Hostname:    ubnt
Uptime:      1721677 seconds

Status:      Unknown[11] (http://my-cloud-controller.tld:8080/inform)

 

The switch has full internet connectivity and seems to be up and running.  It can reach the cloud controller just fine via ping:

sw-301a-US.v3.9.27# ping my-cloud-controller.tld
PING my-cloud-controller.tld (my.cloud.controller.ip): 56 data bytes
64 bytes from my.cloud.controller.ip: seq=0 ttl=57 time=3.408 ms
64 bytes from my.cloud.controller.ip: seq=1 ttl=57 time=3.403 ms
64 bytes from my.cloud.controller.ip: seq=2 ttl=57 time=3.330 ms
64 bytes from my.cloud.controller.ip: seq=3 ttl=57 time=3.326 ms
64 bytes from my.cloud.controller.ip: seq=4 ttl=57 time=3.259 ms
64 bytes from my.cloud.controller.ip: seq=5 ttl=57 time=3.263 ms
64 bytes from my.cloud.controller.ip: seq=6 ttl=57 time=3.350 ms
64 bytes from my.cloud.controller.ip: seq=7 ttl=57 time=3.460 ms
64 bytes from my.cloud.controller.ip: seq=8 ttl=57 time=3.482 ms
64 bytes from my.cloud.controller.ip: seq=9 ttl=57 time=3.530 ms
64 bytes from my.cloud.controller.ip: seq=10 ttl=57 time=3.963 ms
^C
--- my-cloud-controller.tld ping statistics ---
11 packets transmitted, 11 packets received, 0% packet loss
round-trip min/avg/max = 3.259/3.434/3.963 ms

I don't see anything out of the ordinary running here:

sw-301a-US.v3.9.27# top -b -n1
Mem: 161724K used, 94640K free, 0K shrd, 0K buff, 49896K cached
CPU:   4% usr   0% sys   0% nic  45% idle   0% io   0% irq  50% sirq
Load average: 1.97 1.81 1.69 1/147 1439
  PID  PPID USER     STAT   VSZ %VSZ %CPU COMMAND
 1439   682 username R     1444   1%   4% top -b -n1
  770   737 username S     140m  56%   0% switchdrvr boot
  984     1 username S     9892   4%   0% /bin/mcad
  986     1 username S     9076   4%   0% /bin/utermd
  771   737 username S     8020   3%   0% syncdb
  980     1 username S     7356   3%   0% /bin/uplink-monitor
  985     1 username S     7232   3%   0% /bin/mca-monitor
  978     1 username S     7228   3%   0% /bin/reset-handler
  737     1 username S     1496   1%   0% /bin/procmgr
  682   681 username S     1480   1%   0% -sh
    1     0 username S     1448   1%   0% init
  976     1 username S     1448   1%   0% init
  979     1 username S     1444   1%   0% /usr/bin/syslogd -n -O /var/log/messages -l 7 -s 200 -b 0
  977     1 username S     1440   1%   0% /usr/bin/klogd -c 8 -n
  681   982 username S     1172   0%   0% /bin/dropbear -F -d /var/run/dropbear_dss_host_key -r /var/run/dropbear_rsa_host_key -p eth0:22
  982     1 username S     1124   0%   0% /bin/dropbear -F -d /var/run/dropbear_dss_host_key -r /var/run/dropbear_rsa_host_key -p eth0:22
  981     1 username S      936   0%   0% /sbin/ntpclient -i 86400 -n -s -c 0 -l -h 0.ubnt.pool.ntp.org
  395     1 username S      636   0%   0% /sbin/hotplug2 --persistent --set-rules-file /usr/etc/hotplug2.rules
  294     2 username SW       0   0%   0% [spi1]
  328     2 username SW       0   0%   0% [mtdblock6]
    3     2 username SW       0   0%   0% [ksoftirqd/0]
  117     2 username SW       0   0%   0% [kworker/0:1]
  125     2 username SW       0   0%   0% [kswapd0]
    6     2 username SW       0   0%   0% [kworker/u:0]
    8     2 username SW       0   0%   0% [migration/0]
    5     2 username SW<      0   0%   0% [kworker/0:0H]
   10     2 username SW       0   0%   0% [kdevtmpfs]
    7     2 username SW<      0   0%   0% [kworker/u:0H]
   93     2 username SW       0   0%   0% [bdi-default]
    9     2 username SW<      0   0%   0% [khelper]
  291     2 username SW<      0   0%   0% [krfcommd]
   11     2 username SW       0   0%   0% [kworker/u:1]
  168     2 username SW       0   0%   0% [fsnotify_mark]
  303     2 username SW       0   0%   0% [mtdblock1]
  308     2 username SW       0   0%   0% [mtdblock2]
  313     2 username SW       0   0%   0% [mtdblock3]
  298     2 username SW       0   0%   0% [mtdblock0]
  323     2 username SW       0   0%   0% [mtdblock5]
   95     2 username SW<      0   0%   0% [kblockd]
  334     2 username SW<      0   0%   0% [deferwq]
  318     2 username SW       0   0%   0% [mtdblock4]
    2     0 username SW       0   0%   0% [kthreadd]
  174     2 username SW<      0   0%   0% [crypto]
    4     2 username SW       0   0%   0% [kworker/0:0]

I don't see anything weird in the process list:

sw-301a-US.v3.9.27# ps w
  PID USER       VSZ STAT COMMAND
    1 username  1448 S    init
    2 username     0 SW   [kthreadd]
    3 username     0 SW   [ksoftirqd/0]
    4 username     0 SW   [kworker/0:0]
    5 username     0 SW<  [kworker/0:0H]
    6 username     0 SW   [kworker/u:0]
    7 username     0 SW<  [kworker/u:0H]
    8 username     0 SW   [migration/0]
    9 username     0 SW<  [khelper]
   10 username     0 SW   [kdevtmpfs]
   11 username     0 SW   [kworker/u:1]
   93 username     0 SW   [bdi-default]
   95 username     0 SW<  [kblockd]
  117 username     0 SW   [kworker/0:1]
  125 username     0 SW   [kswapd0]
  168 username     0 SW   [fsnotify_mark]
  174 username     0 SW<  [crypto]
  291 username     0 SW<  [krfcommd]
  294 username     0 SW   [spi1]
  298 username     0 SW   [mtdblock0]
  303 username     0 SW   [mtdblock1]
  308 username     0 SW   [mtdblock2]
  313 username     0 SW   [mtdblock3]
  318 username     0 SW   [mtdblock4]
  323 username     0 SW   [mtdblock5]
  328 username     0 SW   [mtdblock6]
  334 username     0 SW<  [deferwq]
  395 username   636 S    /sbin/hotplug2 --persistent --set-rules-file /usr/etc/hotplug2.rules
  681 username  1172 S    /bin/dropbear -F -d /var/run/dropbear_dss_host_key -r /var/run/dropbear_rsa_host_key -p eth0:22
  682 username  1480 S    -sh
  737 username  1496 S    /bin/procmgr
  770 username  140m S    switchdrvr boot
  771 username  8020 S    syncdb
  976 username  1448 S    init
  977 username  1440 S    /usr/bin/klogd -c 8 -n
  978 username  7228 S    /bin/reset-handler
  979 username  1444 S    /usr/bin/syslogd -n -O /var/log/messages -l 7 -s 200 -b 0
  980 username  7356 S    /bin/uplink-monitor
  981 username   936 S    /sbin/ntpclient -i 86400 -n -s -c 0 -l -h 0.ubnt.pool.ntp.org
  982 username  1124 S    /bin/dropbear -F -d /var/run/dropbear_dss_host_key -r /var/run/dropbear_rsa_host_key -p eth0:22
  984 username  9896 S    /bin/mcad
  985 username  7232 S    /bin/mca-monitor
  986 username  9076 S    /bin/utermd
 1770 username  1164 S    /bin/dropbear -F -d /var/run/dropbear_dss_host_key -r /var/run/dropbear_rsa_host_key -p eth0:22
 1771 username  1448 S    -sh
 1798 username  1444 R    ps w

 

Network configuration looks fine:

sw-301a-US.v3.9.27# uptime
 08:59:01 up 19 days, 21:26,  load average: 1.44, 1.70, 1.65
sw-301a-US.v3.9.27# ifconfig
eth0      Link encap:Ethernet  HWaddr F0:9F:C2:18:44:6B  
          inet addr:10.48.1.1  Bcast:10.48.7.255  Mask:255.255.248.0
          inet6 addr: fe80::f29f:c2ff:fe18:446b/64 Scope:Link
          UP BROADCAST RUNNING PROMISC ALLMULTI MULTICAST  MTU:1500  Metric:1
          RX packets:385309 errors:0 dropped:0 overruns:0 frame:0
          TX packets:525793 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:500 
          RX bytes:37387512 (35.6 MiB)  TX bytes:141582784 (135.0 MiB)

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:30 errors:0 dropped:0 overruns:0 frame:0
          TX packets:30 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:5113 (4.9 KiB)  TX bytes:5113 (4.9 KiB)

Unfortunately I can't keep the site in this state for further investigation; I'm going to reboot all the devices to get them working again.  I hope this is enough information to go off of, but if not, it should be possible to recreate this in a lab, since it happened to all four out of four devices (two US-8-60W switches and two UAP-AC-PRO access points) at the site.

 

Emerging Member
Posts: 49
Registered: ‎09-05-2013
Kudos: 38
Solutions: 2

Re: switches / aps not reconnecting to controller after ~16 hours of WAN downtime

ping

Emerging Member
Posts: 50
Registered: ‎11-02-2015
Kudos: 12
Solutions: 1

Re: switches / aps not reconnecting to controller after ~16 hours of WAN downtime

I experienced the exact same thing.  My wan was down for ~14h and this happened.

 

I was running: 3.9.27.8537, which is the exact same version as you.  I have since upgraded to 3.9.54.9373.

 

I was running into this problem on the following model:

Model:       USW-24P-250
Version:     6.0.123

 

Senior Member
Posts: 23,735
Registered: ‎08-04-2017
Kudos: 4494
Solutions: 1167

Re: switches / aps not reconnecting to controller after ~16 hours of WAN downtime

Hello @anotherbhav,

 

Could you SSH into the device and share the output of the info command?

info

 

 

Regards,

Glenn R.

Cloud Hosted Controllers | Glenn R. | UniFi Installation/Easy Update Scripts | UniFi-Video Installation Scripts | UniFi-VoIP Installation Scripts
USG-XG-8 • USG-4-PRO • USG
US-XG-16 • US-48-500W • US-24-POE-250W 2x • US-16-POE-150W 3x • US-24 • US-8-150W • US-8
UAP XG • UAP-SHD • UAP-HD • UAP-NanoHD 2x • UAP-AC-PRO 2x • UAP-AC-LITE • UAP-AC-IW • UAP-AC-M • UAP-AC-M-PRO 2x
UAS-XG • UCK-G2-PLUS • UCK-G2 • UCK
Emerging Member
Posts: 50
Registered: ‎11-02-2015
Kudos: 12
Solutions: 1

Re: switches / aps not reconnecting to controller after ~16 hours of WAN downtime

[ Edited ]

I have since upgraded, so I can't get any MORE info than what I collected in my output, but here you go.

 

- I could remotely SSH into the device.

- I could ping the controller from switch01

- I tried to perform a set-inform to the same name, but that did nothing

 

 

switch01-US.v3.9.27# info

Model:       USW-24P-250
Version:     6.0.123
MAC Address: f0:9f:c2:XX:XX:XX
IP Address:
Hostname:    ubnt
Uptime:      12683006 seconds

Status:      Unknown[11] (http://unifi:8080/inform)

 

switch01-US.v3.9.27# tail -F /var/log/messages
Nov 26 18:51:31 switch01 user.err syslog: ace_reporter.reporter_fail(): Unknown[11] (http://unifi.mydomain.com.com/)
Nov 26 18:51:31 switch01 user.err syslog: ace_reporter.reporter_fail(): initial contact failed #19, url=http://unifi.mydomain.com.com/, rc=11
Nov 26 18:51:46 switch01 user.err syslog: ace_reporter.prepare_ethernet_array(): failed to fill ifstat for eth0!, rc: -1
Nov 26 18:51:46 switch01 user.err syslog: libubnt_webrtc.get_all_sdp_sessions(): Socket error
Nov 26 18:51:47 switch01 user.err syslog: ace_reporter.reporter_fail(): Unknown[11] (http://unifi.mydomain.com.com/)
Nov 26 18:51:47 switch01 user.err syslog: ace_reporter.reporter_fail(): initial contact failed #20, url=http://unifi.mydomain.com.com/, rc=11
Nov 26 18:52:02 switch01 user.err syslog: ace_reporter.prepare_ethernet_array(): failed to fill ifstat for eth0!, rc: -1
Nov 26 18:52:02 switch01 user.err syslog: libubnt_webrtc.get_all_sdp_sessions(): Socket error
Nov 26 18:52:03 switch01 user.err syslog: ace_reporter.reporter_fail(): Unknown[11] (http://unifi.mydomain.com.com/)
Nov 26 18:52:03 switch01 user.err syslog: ace_reporter.reporter_fail(): initial contact failed #21, url=http://unifi.mydomain.com.com/, rc=11
Nov 26 18:52:18 switch01 user.err syslog: ace_reporter.prepare_ethernet_array(): failed to fill ifstat for eth0!, rc: -1
Nov 26 18:52:18 switch01 user.err syslog: libubnt_webrtc.get_all_sdp_sessions(): Socket error
Nov 26 18:52:19 switch01 user.err syslog: ace_reporter.reporter_fail(): Unknown[11] (http://unifi.mydomain.com.com/)
Nov 26 18:52:19 switch01 user.err syslog: ace_reporter.reporter_fail(): initial contact failed #22, url=http://unifi.mydomain.com.com/, rc=11
^
Senior Member
Posts: 23,735
Registered: ‎08-04-2017
Kudos: 4494
Solutions: 1167

Re: switches / aps not reconnecting to controller after ~16 hours of WAN downtime

Hello @anotherbhav,

 

Try the IPv4 address of the controller.

 

 

Regards,

Glenn R.

Cloud Hosted Controllers | Glenn R. | UniFi Installation/Easy Update Scripts | UniFi-Video Installation Scripts | UniFi-VoIP Installation Scripts
USG-XG-8 • USG-4-PRO • USG
US-XG-16 • US-48-500W • US-24-POE-250W 2x • US-16-POE-150W 3x • US-24 • US-8-150W • US-8
UAP XG • UAP-SHD • UAP-HD • UAP-NanoHD 2x • UAP-AC-PRO 2x • UAP-AC-LITE • UAP-AC-IW • UAP-AC-M • UAP-AC-M-PRO 2x
UAS-XG • UCK-G2-PLUS • UCK-G2 • UCK
Emerging Member
Posts: 50
Registered: ‎11-02-2015
Kudos: 12
Solutions: 1

Re: switches / aps not reconnecting to controller after ~16 hours of WAN downtime

unfortunately its too late for that.  I did try setting the fqdn (unifi.mydomain.com) and the shortname (unifi), but neither worked.  I could succesfully resolve those hostnames on the device, so I didn't bother with the IP.

 

In the end, the site was experiending some issues after the wan outage, so I needed to get everything back on all devices to troubleshoot, and the easiest way was to reboot and then upgrade.

 

Thanks for the suggestions.

 

Bhav

Senior Member
Posts: 23,735
Registered: ‎08-04-2017
Kudos: 4494
Solutions: 1167

Re: switches / aps not reconnecting to controller after ~16 hours of WAN downtime

Hello @anotherbhav,

 

So everything was working after a reboot?

 

 

Regards,

Glenn R.

Cloud Hosted Controllers | Glenn R. | UniFi Installation/Easy Update Scripts | UniFi-Video Installation Scripts | UniFi-VoIP Installation Scripts
USG-XG-8 • USG-4-PRO • USG
US-XG-16 • US-48-500W • US-24-POE-250W 2x • US-16-POE-150W 3x • US-24 • US-8-150W • US-8
UAP XG • UAP-SHD • UAP-HD • UAP-NanoHD 2x • UAP-AC-PRO 2x • UAP-AC-LITE • UAP-AC-IW • UAP-AC-M • UAP-AC-M-PRO 2x
UAS-XG • UCK-G2-PLUS • UCK-G2 • UCK
Highlighted
Emerging Member
Posts: 50
Registered: ‎11-02-2015
Kudos: 12
Solutions: 1

Re: switches / aps not reconnecting to controller after ~16 hours of WAN downtime

Yup, everything worked after a reboot.  I still went ahead with an upgrade becuase I was running the exact same version as the original author and I figured I might encounter the same bug again and it may only exist in this version of SW firmware.

 

 

Reply