Distant Curve - A Story About Coming Unstuck...

by ‎02-24-2018 03:41 PM - edited ‎02-24-2018 08:55 PM

Distant Curve Remote Area Communications prides itself on considered and redundant design for remote areas and challenging environments, having said that, sometimes we've learnt to expect the unexpected. From Cockatoos eating GPS units to Wallabies jumping on our solar panels...

We had one such event today... The attached graph is temperature inside 2 of our sites in Central Northern Nortern Territory (each 30km apart). There's a couple of things unusual about this graph - the blue line (at a site called 'Yakulla') shows a dramatic drop in temperature just after sunrise - that's the fans activating just before sunrise and pulling in cooler air - as they're meant too. You'll see the orange line (Kankawalla) shows no such drop - in fact it's just continued to climb until reaching a peak of nearly 50 degrees Celcius at about lunchtime - it looked to us like all 5 fans had stopped working - unusual in itself because they're on two separate circuits for redundancy and we use expensive fans with ceramic bearings for long life.

Logging into the cameras this morning about 7am we confirmed the same - we could not hear the fans. With the equipment inside the boxes rapidly looking like it might reach 70 degrees Celcius today we knew we needed to take some action. To fly out there is incredibly expensive - around $3000 return. But what else could we do? All of the gear inside each control box is industrial rated (we tend to use rockwell, Allen Bradley, Hirschmann etc) but nonetheless a total ventilation failure is something that is not only unusual, but potentially problematic. So time for some analysis - What was the cause? Stuck Fan Control Relays? A problem with the Controller? Hot temperatures causing a false trigger of the polyswitch fuses?

We custom designed the controllers for each site - we call these site controllers the 'curveIQ' - they're an embedded expert system that controls all the subsystems at each site - they're based around an atmel processor and they're at the heart of what makes our systems so reliable. They do things like supremely accurately control the charging of batteries, monitor state of charge, autonomously interrogate devices and troubleshoot in the event of a link failure, fix and communicate problems with redundancy subsystems etc..

Amongst many other things they control the fans based upon temperature, but we do have the ability to switch the fans on and off remotely - with about a 2 minute time delay between 'executing the instruction' and the fans following the instruction. So we tried power cycling the fans a few times... Using the camera, we could hear the fan control switches clicking when we executed these commands - but still, the fans wouldn't start - so we were down to 1 of 3 causes - a thermal effect on our fuses (self resetting fuses), poor contact on our relays or perhaps a progressive failure of our fans.

So what could we do? If it was a polyswitch problem, it would probably self resolve in a week or so when the temps get lower - but it seemed really unlikely this was the cause..

One huge benefit of the fact we designed and wrote the code for CurveIQ is that we can get it to do novel stuff... so after a bit of thinking, we wrote an extra function in C++ and uploaded the new code / firmware through an encrypted tunnel 3000kms into the Desert into the atmel chip of the controller - the code looked something like this:-

 

void rattleRelays()

{
for (int i=0; i <= 200; i++){
wdt_reset();
PORTC=B11111100;
delay(100);
PORTC=B11111111;
wdt_reset();
delay(100);
}
}

The basic idea (or 'hope') of the code was to rapidly open and close the electronic switches (relays) on both sets of fans about 5 times a second for 40 seconds - to 'rattle' the fans - and hopefully get them turning again - after uploading, we told the new program to execute - straight away we could hear the rattling of the relays - a sound a bit like a woodpecker - through the cameras and gradually, after about 30 seconds, we started to hear the familiar 'hum' of the fans kicking in.

If you have a look at the graph you'll see the temps dropped substantially after 12pm (when we executed 'the rattle') - and they continue to drop. It was a success.

We're now fairly certain that the problem was probably due to the recent dust storms in the area - it seems that the very fine (like talcum powder) desert dust had entered through the filters and been ingested into the fan ceramic bearings - and with their very fine tolerances had clogged them. With the benefit of hindsight we can now see that the power consumption of the box seems to indicate that they'd probably been failing one at a time over the last few days until this morning all 5 had failed -

We thought we were doing the right thing by using open ceramic bearings for long life, but it seems to me like in the future we'll be using more traditional grease encapsulated sealed sleeve bearing fans. Managing these remote sites sometimes feels a bit like managing something on mars - but we've got enough redundancy in each system that we can address almost any problem.At no time during this event did our client ever lose their connectivity, so, for that reason, I consider this unexpected failure to be a success Icon Lol . We've got a trip planned out there in March and the weather is due to cool down over the next 2 weeks or so, so we'll replace these fans when we're out there instead of being forced to rush out - and in the meantime we'll just leave the fans running 24/7 - momentum usually keeps a dicky fan going. 

 

Kankawalla North Facing - Feb 24 '18 01_14_06 PM.jpgPR1 - Looking East - Jan 20 '18 07_07_37 AM.jpg

 

20151123_121305.jpg

 

 

Graph.jpg

 console.jpg

 

 

{"location":{"title":"Tennant Creek NT, Australia","placeId":"ChIJSVS7TkNMTCsRIIQkKqgXAgQ"},"addedProducts":[{"id":"airfiber-5x","count":28}],"solved":"","numbers":"","description":"A series of radio repeaters to take high speed internet to an industrial site in the Northern Territory","mainImage":"144839i7AFB298A8ABDE8D1"}

Comments
by
on ‎02-24-2018 05:45 PM

AWESOME Story , Thank you for sharing!!! 

by
on ‎02-25-2018 08:22 AM

@doc_karl You guyz are so cool !!!

"desert dust had entered through the filters". Why not try other filters then ? Or can't you just add grease to the ceramic bearings ?

by
‎02-25-2018 09:55 AM - edited ‎02-25-2018 09:56 AM

Desert dust (from dry lake beds and clay deposits especially) gets into the low and sub-micron range.  Very tough to filter that without a significant amounts of fan power.

by Ubiquiti Employee
on ‎02-25-2018 11:52 AM

@doc_karl thanks for the story on the radios deployed, definitely appreciated that you share!

by
‎02-25-2018 04:46 PM - edited ‎02-25-2018 04:49 PM

Thanks guys - they're still going strong. 

 

@jjonsson @MimCom - dust ingress is one of those funny things. Given that we aim to visit these sites no more than once a year, and given the highlighted deficiencies of the cabinet filters, I'm almost beginning to consider doing away with the fine filters altogether and using a coarse filter that's basically there only to protect against ingress of vertebrate and invertebrate pests (eg ants and snakes). My reasoning here is that the gear inside the boxes (industrial ethernet switches, control gear, power conversion gear) is all designed to withstand a dusty environment anyway - a lot of the 'received knowledge' about the need for air filtration probably harks back to the days of spinning platter hard drives when a dust particle had the capacity to bring down servers as the width of the dust was greater than the distance between the platter and the head. In these environments, airflow is king - so mimcom makes a good point about the potential for filters to get clogged - there's about nothing I could do remotely if the filters got clogged and the effect would be as bad or worse. I think I just need to find fans that can handle any dust a bit better..

 

Having said that, I'm working on an alternate hypothesis about what may have happened - we had about 2 weeks running where night time temps never went below about 100 farenheit - it was a long heatwave. polyswitches have a heat derating curve. Looking at the specs for these polyswitches there should be plenty of overhead to handle the startup pulse of the fans each morning, but perhaps in this extreme case, not - still wouldn't really explain how the fan rattle got them spinning but interesting to consider. We'll find out when we get out there Man Happy 

 

by Ubiquiti Employee
on ‎02-26-2018 08:52 AM

Very cool, always nice to see a challenging deployment like this done right.

by
on ‎03-01-2018 02:17 AM

rattleRelays() !!!!

Pure bloody genius!   

 

The fact you have managed to get these links up and working at all, let alone for over a year, is very impressive. Fixing a set of fans with a remote rattle routine?  **Words fail. Tips cap**

by
on ‎03-03-2018 08:12 AM

Congratulations, and thanks so much for you share your issues with us , from Ecuador Latin America , God bless you!

 

I hope go on talking about your experiecies! and solutions !

 

regards ! from Ecuador!

 

by
on ‎03-03-2018 10:55 AM

Your company's design skills are impressive. For the sake of redundant ventillation system, have you considered using both the ceramic bearing fans and the sealed lubricated fans, using only one system at a time? Since the ceramics have potentially better life span if protected from desert dust, use those as primary, switching over to the lubricated fans when dust storms or other conditions indicate the need to protect the ceramics. Without the ceramics running during heavy dust conditions, you would probably reduce the chances of clogging the fine particulate filters, as well.

 

Just a thought.

by
on ‎03-03-2018 12:15 PM

Very nicely done - reminds me of some of the things deep space satellite and Mars Rover folks have had to do over hte years...  

 

Anything that's moving/mechanical is a problem in certain environments.  If it was indeed the bearings, about all you can do is use sealed ones instead.   Perhaps adding something like felt donut pads to the motor shafts to keep the dust out, or maybe look into fans used by the oil and gas industry - they see conditions like you do a lot.   The cabinets that MetroLink/Riccochet used here in the US for their outdoor equipment included Peltier Effect heat pumps between an outer and inner sealed chamber with inches of rigid insulation around the equipment chamber to make sure no dust got onto the equipment - I know that won't work in your case due to power considerations.   Might just have to accept replacing the fans every year to prevent them siezing - sometimes you just have to declare something a consumable part and go on down the road...

Jim

by
on ‎03-04-2018 06:44 AM

Always enjoy reading your stories, it's great to see how you overcome challenges that just aren't a consideration for most WISPs. Please update us once you get a chance to visit the site.

by
‎03-05-2018 04:35 PM - edited ‎03-07-2018 01:32 AM

Thanks everybody for your kind words and suggestions. All appreciated and many will be actioned.

 

Serious weather coming through our Central Desert site we'll see how we go over the next 5 days...

 

 

 

us_model-en-294-0_modez_2018030512_96_1533_157.pngPR1 - Looking East - Mar 07 '18 07_23_23 PM.jpg

 

 

by
on ‎03-07-2018 07:43 PM

Fantastic story. You had your Apollo 13 moment there!

by
on ‎03-10-2018 08:56 AM

This reminds me of a setup I have done in the past for a client a mere 400KM away. We decided to do without cooling in the main equipment enclosure since all the gear was industrial and rated for higher temperatures and the local desert dust was full of copper, lead and other conductive metals which had and has caused problems with unsealed equipment at that site. The lack of cooling did kill a few batteries though.

 

I'm planning another remote setup for a client on a farm in NSW at only 250KM away and thanks to your experience I am now considering adding more in the way of the ability to remotely manage cooling and power for the gear. I assume you have your Atmel controller is connected to a device running a Linux operating system for control and programming? I was going to use an Arduino type board with ethernet to handle my voltage monitoring, power switching and fan control but now I might just use a regular board connected to Linux single board computer like a pi, so I can upload new code remotely.

 

I will take your fan issue into consideration with my own fan selection since I don't want to cook any batteries this time.

by
‎03-14-2018 07:02 PM - edited ‎03-14-2018 07:08 PM

@JacklGuruGood luck.

 

With regards to the architecture of our SCADA system, no, in this case this is a pure real time embedded system using the Atmel processor itself. I have training in embedded systems, and the design and development of this system took several years. It's a highly specialised system specifically designed for controlling radio sites.
All the functions of the system are fully contained within that processor and it simply does push / pull to an external server - Push to update readings, pull to take commands in the rare event that we want it to modify any of the 28 default characteristics that control the manner in which the box responds to stimuli. We've also implemented a custom bootloader that allows us to upload firmware updates over the wire in the event that that becomes necessary.

by
on ‎05-08-2018 05:13 PM

Just to follow up on this, did you find out what caused the initial failure of the Fans? was it the fine dust getting into the Fans?

by
‎10-22-2018 12:56 AM - edited ‎10-22-2018 12:57 AM

Just as a follow up to this, in June I went out and visited (and expanded) this particular project.

 

To answer @SpankyMK's question - the root cause ended up being two-fold... yes, some of the fans had ingested dust - but I had managed to get them going using rattleRelays() and set them to stay on permanently between when this article was written and when I went out and visited the sites.

 

When I got out there, I found that there was a second problem - the relay contacts were worn out. There was one register that I was unable to check remotely that I was able to check when I had the CurveIQ units on the bench - that was the watchdog timer register.

 

On almost all embedded systems, programmers are able to set an interrupt driven watchdog timer. What this does is perform a hard reset of the embedded system by pulling a reset pin low. This is mostly to guard against cases where code execution stops for some reason - due to an error, memory leak etc. In my case when I checked the register that counts these resets, I found that it had been triggered many thousands of times. In my system, when the watch dog timer resets the device, all relays that are not already closed go to a closed state - by design - such that if the curveIQ device itself fails, all fans and all radios are switched on so that the device itself does not become a single point of failure.

 

We'd tested these devices for some years, and had never had spurious watchdog resets before - the code was seen to be very solid - but when I looked deeply into it, I recalled that there was a time early on in this particular project when there had been a 3 day outage due to a misconfiguration at the clients end - basically I'd told them not to use vlan 1 at all as that was the vlan that my hirschmann switches use, but they had in fact not respected that and had 'error shutdown' their router port in certain circumstances when they saw TCN's coming in from my gear on Vlan 1. After they got back from the weekend, and cycled the port, they realised the problem, turned off STP on that port and there was no dramas from then on in - we've never had an unplanned outage in our part of the network at all.

During that outage, when the users at the downstream side of the network saw they no longer had network connectivity, the first thing they did was power cycle my equipment. What this then did was reset the DNS entries in my edgerouter - as there was no longer an internet connection, the router couldn't get a DNS record for a URL that they check at poweron.

The ethernet library I use in my embedded devices was meant to be non-blocking, but we found that that particular circumstance was one time when it became blocking - basically when it had a connection to the router, but the router did not reply to a DNS request immediately, the code became blocking waiting for a reponse from the edgerouters, which was taking longer than 8 seconds - and thus the watchdog timer triggered.

At night, when the temperature was lower than the fan cutoff temp, this meant that with each reset cycle, my fan relays were turning the fans on, then curveIQ was turning them off once it measured that the temperature was lower than the cutoff - only then did the ethernet library bug cut in and WDT triggered, thus closing all the relays (not an issue for the relays that were already closed like those powering the radios, but an issue for the relays that were noe open, like those controlling the fans). This started the whole process again.

What this meant was that over that 3 day period, we had tens of thousands of fan and fan relay cycles. These were electromechanical relays so the contacts had not liked that treatment. Dust was a problem in some cases, but worn out relays were the culprit in others.

 

We would have preferred the customer didn't power cycle our equipment - if the router hadn't been rebooted after the upstream comms failure, the problem wouldn't have occured - but it's great they were a bit naughty in a way as it exposed a failure mode we wouldn't have otherwise discovered.

We had tested during development with the routers completely turned off or unplugged, but never with the routers plugged in switched on, but with a blank DNS table and no outside world connection to conduct a query. It was only then that the WDT timer was triggered.

 

To address, we replaced the relay boards, rewrote the publicly available ethernet library to make it truly non-blocking and for a bit of 'defense in depth' we also changed the behaviour of the fans so that they don't 'turn off' after curveIQ is started until after the first DNS request is properly answered and a flag is set. It was a fairly significant effort, but fixed the problem entirely.

As for the fans, as noted, I've now replaced them with sealed bearing fans instead of the exxy ceramic bearing fans. They've proven better in that environment.

The added benefit is that the particular fans I now use have a PWM speed control wire (the 4 wire type instead of the 3 wire type) so I'm now able to use curveIQ to modulate fan speed to match ambient temps instead of just completley turning them off at night - and also do neat stuff like not turn them on until the internal temperature of the box exceeds the external temperature outside the box.. much smarter temprerature control with reduced power consumption as well.

 

So... like many things in Engineering, there was no single cause - it was a combination, in this case, of three seperate variables that only when lined up in a complex way led to the the 'failure mode' exposing itself.This was uploaded to our other sites. 

by
on ‎10-22-2018 01:01 AM

Thanks for the update.great read.

by
on ‎10-22-2018 07:53 AM

Foolproofing, meet greater fools.  Nice work excavating that root cause.

by
on ‎10-22-2018 04:23 PM

thanks Gents Man Wink

by
a week ago

Your Energy balance graphs are very neat, looks like you are using emoncms.org  type setup, just wondering what hardware you maybe using behind these energy balance graphs?

by
a week ago

@rosspeel- the interface with emoncms is managed by the same embedded system (atmel based controller) that controls our sites that I mentioned above - CurveIQ - it's our own closed source design.