11-17-2018 11:36 AM
I monitor a group of 15 sites with only one AP each (10 UAP outdoor+ 2 AC Lite and 3 AC Pro), combined with EdgeRouter Lite3 and DSL uplinks. All on 3.9.54 originally.
Really minimum setups but there are no options at the moment to improve that.
Particularly the UAP outdoor+ were rock solid in the past two years. Rarely a hung hostapd but I was always
able to restart remotely.
Now since summer and I am not sure when exactly but definitely past 3.9.27 problems came up.
First somewhat under my radar because folks onsite were restarting them manually when
there was access to the hw. And so I could not always determine whether there was just a local glitch (powerfailure ?) or so.
I was not sure anymore what was between 3.9.54 and 3.9.27 and whether 4.9.42 was also deployed. At least
if that was pulled (I could not find it in the archive), then it was at least not causing severe trouble for us (at least not more
than what 3.9.54 does now).
But after some time I got a bigger picture and focussed more closely on such events. unify controller told me about
disconnections but I was still able to ping and pretty often I could still access with ssh but then login got hung and I got no prompt anymore. But a few times I was able to get through and look at dmesg output. Regrettably we have no remote syslogs for the APs but I could make it possible when needed.
dmesg told me that oomkiller became active due to memory pressure. The victim was sometimes a hostapd, also mcad I guess, may be also others as my statistic is not very complete I suspect. I suppose it is not always the biggest process affected.
So depending on which was killed there was some releif but sometimes still not enough to login or may be there was or already some growth afterwards before I was looking. Depending on the victim and possibly automatic recovery mechanisms the event got unnoticed, got shown a disconnection to the controller or just 0 users from then on. I guess no panics or automatic restarts.
Only the 64Mb outdoor+ APs were affected. The 128MB AC lite and Pro never ran into memory problems.
I have to add that besides the AP operation as such the controller activity causes intermittent variations in memory consumption in the same way like our observium SNMP monitoring does when gathering information on the devices.
And if an AP is already tight on memory then the occurence of such external events (may be simultaneously) is probably kicking it over the edge.
The only SSID used is an open Wifi - so no radius or WPA2 or anything.
I took now close attention for like 6 weeks trying to find out how they behave.
- what is the typical workingset of the devices, the UAPO+ as well as the others and how does it develop
- Do we have a leak or just increased consumption ? I suppose daily operation with many people passing by coming
and going may increase the workingset gradually and each of the site environments is vastly different in that regard.
- I wondered whether 3.9.54 may be just more "generous" and let more folks in (from a bigger distance may be) and so the
workingset got larger.
- I turned off the meshing which was not used anyway and got rid of these extra hostapd processes. Not sure why it was turned
on initially. I find the controller option about uplink monitoring not really self explanatory. I wondered about the extra meshing device options on this controller while I had none in my private controller and wondered what the difference was...
Then I started downgrading single APs to 3.9.27 and immediately these APs were stable again.
They typically start at 78% memory use or even less and then with 3.9.54 very loaded devices get to 88% after
like two days at maximum and then are due for a quick restart to avoid problems when they are getting less attention. Rarely I could not remotely restart them and that was to be avoided on the weekends.
The consumption does not stop growing at 88%..
Now even the loaded ones on 3.9.27 are only at 83% after a week. So they do fine.
Now minutes ago one was pushing my bloodpressure a bit with being at 90% - wtf ?? But I can calm down it is back to 83% and apparently I just have had the chance to monitor values during intense external activity.
One very idle AP on 3.9.54 with only <20people around and hardly bypassers is still at 84% after 50 days. I am curious whether it will ever stop growing. That may back my leak theory.
On the other hand the AC Pro devices in larger site with many people - and the 128mb devs are all left on 3.9.54 - do not seem to grow past a point which may represent their workingset (~65% in average).
So after this long list of observations I wonder whether anybody involved is able to tell me what was changed, whether someone is interested in further information or even asks for access to devices. There was at least a change in consumption if not a leak was introduced. That is rendering these devices pretty unstable and the busy ones would have to be restarted every second day (which is not really an issue in this particular case). But still not acceptable.
I wonder that apparently this is not a widespread issue, but may be just the devices stopped being used any longer (which I can perfectly understand, I would not have ordered them already back then). But if the devices are still supported or get at least security fixes then such a behavior should not be introduced. May be memory consuming features were introduced which should better be dropped perhaps.
a week ago
Apparently nobody really interested. I am still observing this and the outdoor+ which arre pretty loaded,
got a lot more stable when I downgraded them to 3.9.27.
So I do not expect a real leak but increased load proportional memory consumption now not fitting into the 64mb anymore.
I just go forward with 1 or 2 to current fw to see what happens.