- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Sticky This Topic
- Bookmark
- Subscribe
- Printer Friendly Page

ipsec site to site VPN fails - requires reboot
- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
11-28-2017 07:09 PM
Router is behind 2 additional NATted gateways providing dual-wan access. The dual-wan load balancing is handled with simple source modify rules.
The symptom is that IPSEC site to site is unreliable. I discovered that while IP traffic NATted by the EdgeRouter reaches its destination (verified with tcpdump at the destination) reliably, traffic originating from the EdgeRouter is not. The tcpdump on the ER looks proper whether NATted or not, however the traffic simply doesn't arrive at the destination.
ICMP traffic does reliably make it where it needs to go. `telnet VPNendpoint 500` from a host inside the EdgeRouter source NAT makes it where it needs to go. `telnet VPNendpoint 500` from the edgerouter itself though, nothing ever arrives at the destination. I don't think it's an MSS/MTU problem as I've already modified them down quit substantially, much lower than the 1420 I had set on the prior configuration, in addition tcpdump is indicating very small frame lengths, <200 bytes.
Rebooting the EdgeRouter fixes the problem, for awhile. Tunnel comes up, traffic moves smoothly for... hours. Eventually it fails and it takes a reboot to make traffic flow again.
Again... NATted traffic is still making it to its destination. ICMP traffic moves fine to and from wherever. IP (UDP and TCP seemingly) traffic cannot get where I need it to go however, including sending to destinations unrelated to the VPN, if it's originating at the EdgeRouter's own IP interface.
I'm worried that it's actually a defective unit, in a remote location North of the Arctic Circle
This is in /var/log/messages:
"I/O Error, both of real entry and whiteout found, resolv.conf, error -5"
.... a lot:
root@nvfy-edgerouter:/var/log# grep "I/O Error, both of real entry and whiteout found" /var/log/messages | wc -l 1011
pseudo-sanitized config:
firewall { all-ping enable broadcast-ping disable group { address-group Unwanted_WAN_Traffic { address 40.77.232.59 address 198.251.90.71 description "" } network-group LAN_All { description "" network 172.22.1.0/24 network 172.22.19.0/24 network 172.22.21.0/24 network 172.22.22.0/24 network 172.22.23.0/24 network 172.22.24.0/24 network 172.22.25.0/24 } network-group source_route_1 { description "IPs that route through general Internet" network 172.22.21.0/24 network 172.22.23.0/24 network 172.22.24.0/24 network 172.22.25.0/24 } network-group source_route_2 { description "IPs that route through acct Internet" network 172.22.22.0/24 network 172.22.1.0/24 network 172.22.19.0/24 } } ipv6-receive-redirects disable ipv6-src-route disable ip-src-route disable log-martians enable modify WAN_LB { rule 10 { action modify modify { table 1 } source { group { network-group source_route_1 } } } rule 20 { action modify modify { table 2 } source { group { network-group source_route_2 } } } } name GZGTG_LAN { default-action accept description "" rule 1 { action drop description "Block Unwanted Sites" destination { group { network-group LAN_All } } log enable protocol all source { group { address-group Unwanted_WAN_Traffic } } state { established enable invalid enable new enable related enable } } } options { mss-clamp { interface-type all mss 1372 } } receive-redirects disable send-redirects enable source-validation disable syn-cookies enable } interfaces { ethernet eth0 { address 192.168.1.100/24 duplex auto speed auto } ethernet eth1 { address 172.22.1.1/24 duplex auto firewall { in { modify WAN_LB } } speed auto vif 19 { address 172.22.19.1/24 description "Server VLAN" firewall { in { modify WAN_LB } } mtu 1500 } vif 21 { address 172.22.21.1/24 description "GZGTG Users" firewall { in { modify WAN_LB } } mtu 1500 } vif 22 { address 172.22.22.1/24 description Accounting firewall { in { modify WAN_LB } } mtu 1500 } vif 23 { address 172.22.23.1/24 description "GZGTG Guests" firewall { in { modify WAN_LB } } mtu 1500 } vif 24 { address 172.22.24.1/24 description VOIP firewall { in { modify WAN_LB } } mtu 1500 } vif 25 { address 172.22.25.1/24 description "GZGTG Printers" firewall { in { modify WAN_LB } } mtu 1500 } } ethernet eth2 { address 192.168.2.100/24 duplex auto mtu 1460 speed auto } ethernet eth3 { duplex auto speed auto } ethernet eth4 { duplex auto speed auto } loopback lo { } switch switch0 { mtu 1500 } } load-balance { } protocols { static { route 0.0.0.0/0 { next-hop 192.168.1.1 { } next-hop 192.168.2.1 { } } route 207.2.81.240/29 { next-hop 192.168.2.1 { description "All GSE LLC traffic through Acct Uplink" } } table 1 { route 0.0.0.0/0 { next-hop 192.168.1.1 { } } } table 2 { route 0.0.0.0/0 { next-hop 192.168.2.1 { } } } } } service { dhcp-server { disabled false hostfile-update disable shared-network-name Native_VLAN_pool { authoritative disable subnet 172.22.1.0/24 { default-router 172.22.1.1 dns-server 8.8.8.8 dns-server 8.8.4.4 lease 86400 static-mapping gzgtg-ups1 { ip-address 172.22.1.6 mac-address 00:c0:b7:6a:47:14 } static-mapping nvfy-vm4 { ip-address 172.22.1.9 mac-address 00:24:e8:7f:1c:f8 } } } shared-network-name Server { authoritative disable subnet 172.22.19.0/24 { default-router 172.22.19.1 dns-server 172.22.1.7 dns-server 172.22.19.48 domain-name tribal.local lease 86400 start 172.22.19.64 { stop 172.22.19.64 } static-mapping Brother-Env { ip-address 172.22.19.44 mac-address 90:cd:b6:68:7e:2b } static-mapping Brother-Realty { ip-address 172.22.19.43 mac-address 40:49:0f:a2:8f:30 } static-mapping UniFi1 { ip-address 172.22.19.6 mac-address 04:18:d6:6c:56:da } static-mapping UniFi2 { ip-address 172.22.19.7 mac-address 04:18:d6:6c:5e:f1 } } } shared-network-name VOIP_pool { authoritative disable subnet 172.22.24.0/24 { default-router 172.22.24.1 dns-server 172.22.1.7 dns-server 172.22.19.48 domain-name tribal.local lease 86400 start 172.22.24.128 { stop 172.22.24.255 } tftp-server-name 172.22.24.2 } } shared-network-name accounting_pool { authoritative disable subnet 172.22.22.0/24 { default-router 172.22.22.1 dns-server 172.22.1.7 dns-server 172.22.19.4 domain-name tribal.local lease 86400 start 172.22.22.64 { stop 172.22.22.127 } static-mapping acct-printer { ip-address 172.22.22.42 mac-address 40:b0:34:a4:dc:4a } } } shared-network-name gen_use_pool { authoritative disable subnet 172.22.21.0/24 { default-router 172.22.21.1 dns-server 172.22.1.7 dns-server 172.22.19.4 domain-name tribal.local lease 86400 start 172.22.21.128 { stop 172.22.21.191 } static-mapping nvfy-desktop16 { ip-address 172.22.21.192 mac-address b0:83:fe:ba:97:eb } } } shared-network-name guest_pool { authoritative disable subnet 172.22.23.0/24 { default-router 172.22.23.1 dns-server 172.22.1.7 lease 86400 start 172.22.23.64 { stop 172.22.23.127 } } } shared-network-name printers_pool { authoritative disable subnet 172.22.25.0/24 { default-router 172.22.25.1 dns-server 172.22.1.7 lease 86400 static-mapping prn-housing-1 { ip-address 172.22.25.64 mac-address 48:5a:b6:7e:7a:a5 } static-mapping prn-realty-1 { ip-address 172.22.25.65 mac-address 40:49:0f:a2:8f:30 } } } use-dnsmasq disable } gui { http-port 80 https-port 443 older-ciphers enable } nat { rule 5001 { description Outbound_All_eth0 log disable outbound-interface eth0 protocol all source { group { network-group LAN_All } } type masquerade } rule 5002 { description Outbound_All_eth2 log disable outbound-interface eth2 protocol all source { group { network-group LAN_All } } type masquerade } } ssh { port 22 protocol-version v2 } unms { disable } } system { domain-name tribal.local host-name nvfy-edgerouter login { user jrdalrymple { authentication { encrypted-password $6$gZ6pymO7r4tfag55$LakDYi2Gmm2rnZ7BdkQKIbZ4.WQLKfK1CQJaE0UjAfsLOWkm/NbVUJnL9DtQ7FpC1dnKLZF6dRTZ910/QjCUK1 plaintext-password "" } full-name "JR Dalrymple" level admin } user ubnt { authentication { encrypted-password $6$mFYayM/oosIR$eW7ztWThZMKN7tg5/0qdTErjHBr6NHKHSmywgH9gtxnryx9e/kbVRWF5C9owuIWwcTijwDRfeXRfGxV6PJVnd. plaintext-password "" } full-name Admin level admin } } name-server 172.22.1.7 name-server 172.22.19.48 ntp { server 0.ubnt.pool.ntp.org { } server 1.ubnt.pool.ntp.org { } server 2.ubnt.pool.ntp.org { } server 3.ubnt.pool.ntp.org { } } syslog { global { facility all { level notice } facility protocols { level debug } } } time-zone UTC traffic-analysis { dpi enable export enable } } traffic-control { smart-queue GZGTG-eth0 { download { ecn enable flows 1024 fq-quantum 1514 limit 10240 rate 1024kbit } upload { ecn enable flows 1024 fq-quantum 1514 limit 10240 rate 512kbit } wan-interface eth0 } smart-queue GZGTG-eth2 { download { ecn enable flows 1024 fq-quantum 1514 limit 10240 rate 1024kbit } upload { ecn enable flows 1024 fq-quantum 1514 limit 10240 rate 512kbit } wan-interface eth2 } } vpn { ipsec { auto-firewall-nat-exclude enable esp-group FOO0 { compression disable lifetime 3600 mode tunnel pfs enable proposal 1 { encryption aes128 hash sha1 } } ike-group FOO0 { ikev2-reauth no key-exchange ikev1 lifetime 3600 mode main proposal 1 { dh-group 2 encryption aes128 hash sha1 } } site-to-site { peer 207.2.81.244 { authentication { mode pre-shared-secret pre-shared-secret vowu74khx9F99h4IfUPT6ohoOsmw0II4XtGO7rosGzWpRC3WYlnzt3bTz2RdvvpW } connection-type initiate description "GSE LLC VPN" ike-group FOO0 ikev2-reauth inherit local-address any tunnel 1 { allow-nat-networks disable allow-public-networks disable esp-group FOO0 local { prefix 172.22.22.0/24 } remote { prefix 172.16.104.0/24 } } tunnel 2 { allow-nat-networks disable allow-public-networks disable esp-group FOO0 local { prefix 172.22.19.0/24 } remote { prefix 172.16.104.0/24 } } tunnel 3 { allow-nat-networks disable allow-public-networks disable esp-group FOO0 local { prefix 172.22.1.7/32 } remote { prefix 172.16.104.0/24 } } } } } }
Any advice appreciated.

Accepted Solutions
Re: ipsec site to site VPN fails - requires reboot
- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
11-29-2017 08:52 PM
@16again wrote:
1) Rebooting the ER-X shuts down IPSEC traffic on ports 500 and 4500, if this takes long enough, the upstream router NAT table is cleared
Right problem wrong cause. There is something still a bit baffling. Obviously this thing is a router on a stick. When I left it a week ago today, things were working great. Customers were all logging into their domain, sending print jobs to their printers, etc etc. This morning things seemed to tank massively.
Now... as I was following tutorials to put together the LB config I definitely identified that the rules I put in place would prevent LAN to LAN traffic, but they didn't... things worked fine when I left last Wednesday. Today I found nothing working as none of the clients could get to DNS. I don't have an explanaiton for how last Wednesday through last night they were able to route to their DNS servers, printers etc, but in order to fix it I did have to put in a proper route modify to main.
So... my presumption is that at least some amount of traffic that didn't belong on the WAN uplinks was getting pitched out there, and without being NATted. The outcome, fill the NAT table on the crappy upstream routers eventually causing them to blow up. There was obviously some coincidence and some confusion, but reallistically it's working now after that (as is internal routing) so it's the best I can come up with. At the end of the day... it's working now - marking solved.
All Replies
Re: ipsec site to site VPN fails - requires reboot
- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
11-28-2017 10:34 PM
The "whiteout" message doesn't show up here in over 5 weeks of log files.
IPSEC and NAT can give troubles if external NAT device starts translating ports. (so your source port 500 gets translated, confusing the remote)
Keep-alive might prevent that from happening.
I also use tunnels behind NAT, and am succesfull with starting a GRE tunne (outer)l, and encrypt the packets inside the GRE tunnel with IPSEC. This way the remote device only see GRE packets, and it can't mess up ports. And since IPSEC no longer sees NAT, you can use VTI.
Note this different than normal GRE/IPSEC , which uses GRE on internal tunnel , and IPSEC for outer.
Re: ipsec site to site VPN fails - requires reboot
- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
11-29-2017 02:13 AM
@16again wrote:
IPSEC and NAT can give troubles if external NAT device starts translating ports. (so your source port 500 gets translated, confusing the remote)
Keep-alive might prevent that from happening.
I also use tunnels behind NAT, and am succesfull with starting a GRE tunne (outer)l, and encrypt the packets inside the GRE tunnel with IPSEC. This way the remote device only see GRE packets, and it can't mess up ports. And since IPSEC no longer sees NAT, you can use VTI.
Note this different than normal GRE/IPSEC , which uses GRE on internal tunnel , and IPSEC for outer.
Not to say that you're incorrect by any means - I'm definitely looking for solutions and hate balking when I hear them, but...
1) So let's say port overload is causing issues at the public edge, why does rebooting ERX immediately fix the issue and for some hours?
2) I'm not sure what you mean by keepalive, but I kind of have one in place by the nature of the graph attached performing the up/down check and reporting back. I suppose I should also take this opportunity to highlight another issue, that is that I have to initiate from behind the ERX, a situation that didn't exist prior to the ERX's replacement of the previous router. If I try to initiate from the remote endpoint the ERX is definitely receiving isakmp phase 1, but doing nothing about it. Another problem for another day...
As mentioned - there was another router doing a very vanilla IPSEC behind this very same NAT for years without issue. I didn't change any configuration at the target nor on the public router. The ERX installation is something of a feasability study since Soekris shuttered and in time I will have to replace all of them (of which 2 other endpoints I support are also behind NAT). Right now that study is failing
Re: ipsec site to site VPN fails - requires reboot
- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
11-29-2017 02:31 AM
When I think about it even further... why would ERX care anyway? As far as it's concerned (and as long as my MTU/MSS is configured properly) there is no PAT going on. If anyone would care it's my remote end, and it's working fine. If it wasn't I'd have hundreds of people calling me, not just 3
See attached to understand what I mean.

Re: ipsec site to site VPN fails - requires reboot
- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
11-29-2017 02:51 AM
There is this...
root@nvfy-edgerouter# ping google.com ping: unknown host google.com [edit] root@nvfy-edgerouter# delete system name-server 172.22.1.7 [edit] root@nvfy-edgerouter# set system name-server 172.22.19.4 [edit] root@nvfy-edgerouter# commit [ system name-server 172.22.1.7 ] touch: /etc/resolv.conf: Input/output error sed: can't read /etc/resolv.conf: Input/output error [ system name-server 172.22.19.48 ] touch: /etc/resolv.conf: Input/output error sed: can't read /etc/resolv.conf: Input/output error [ system name-server 172.22.19.4 ] touch: /etc/resolv.conf: Input/output error grep: /etc/resolv.conf: Input/output error awk: cannot open /etc/resolv.conf (Input/output error) head: /etc/resolv.conf: Input/output error cat: can't open '/etc/resolv.conf': Input/output error tail: can't open '/etc/resolv.conf': Input/output error tail: no files mv: can't stat '/etc/resolv.conf': Input/output error [edit] root@nvfy-edgerouter# ping google.com ping: unknown host google.com
All 3 of the nameservers do indeed work.
I have no real immediate need for this box to be able to resolve anything, but what would the expected behavior be if it couldn't resolve things ... like say the default NTP servers? I noticed some time ago a consistent load average of right around 1.0...
Not sure of how to fix this (the above with a reboot didn't)
Not sure of whether it even matters
Re: ipsec site to site VPN fails - requires reboot
- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
11-29-2017 02:59 AM
1) Rebooting the ER-X shuts down IPSEC traffic on ports 500 and 4500, if this takes long enough, the upstream router NAT table is cleared
2) your graph does keep-alive for data channel (UDP4500) not for IKE phase 1
In screenshot shown, I see udp connections coming from IP addresses not defined as peer in your config. Maybe too much requests like these confuse the NAT router or the ER-X.
You could disable auto-firewall-nat-exclude-enable , and add WAN_LOCAL rules , only allowing configured peer for udp 500/4500
Re: ipsec site to site VPN fails - requires reboot
- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
11-29-2017 08:52 PM
@16again wrote:
1) Rebooting the ER-X shuts down IPSEC traffic on ports 500 and 4500, if this takes long enough, the upstream router NAT table is cleared
Right problem wrong cause. There is something still a bit baffling. Obviously this thing is a router on a stick. When I left it a week ago today, things were working great. Customers were all logging into their domain, sending print jobs to their printers, etc etc. This morning things seemed to tank massively.
Now... as I was following tutorials to put together the LB config I definitely identified that the rules I put in place would prevent LAN to LAN traffic, but they didn't... things worked fine when I left last Wednesday. Today I found nothing working as none of the clients could get to DNS. I don't have an explanaiton for how last Wednesday through last night they were able to route to their DNS servers, printers etc, but in order to fix it I did have to put in a proper route modify to main.
So... my presumption is that at least some amount of traffic that didn't belong on the WAN uplinks was getting pitched out there, and without being NATted. The outcome, fill the NAT table on the crappy upstream routers eventually causing them to blow up. There was obviously some coincidence and some confusion, but reallistically it's working now after that (as is internal routing) so it's the best I can come up with. At the end of the day... it's working now - marking solved.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Sticky This Topic
- Bookmark
- Subscribe
- Printer Friendly Page