Reply
New Member
Posts: 9
Registered: ‎12-27-2015
Kudos: 1
Solutions: 2
Accepted Solution

ipsec site to site VPN fails - requires reboot

Router is behind 2 additional NATted gateways providing dual-wan access. The dual-wan load balancing is handled with simple source modify rules.

 

The symptom is that IPSEC site to site is unreliable. I discovered that while IP traffic NATted by the EdgeRouter reaches its destination (verified with tcpdump at the destination) reliably, traffic originating from the EdgeRouter is not. The tcpdump on the ER looks proper whether NATted or not, however the traffic simply doesn't arrive at the destination.

 

ICMP traffic does reliably make it where it needs to go. `telnet VPNendpoint 500` from a host inside the EdgeRouter source NAT makes it where it needs to go. `telnet VPNendpoint 500` from the edgerouter itself though, nothing ever arrives at the destination. I don't think it's an MSS/MTU problem as I've already modified them down quit substantially, much lower than the 1420 I had set on the prior configuration, in addition tcpdump is indicating very small frame lengths, <200 bytes.

 

Rebooting the EdgeRouter fixes the problem, for awhile. Tunnel comes up, traffic moves smoothly for... hours. Eventually it fails and it takes a reboot to make traffic flow again.

 

Again... NATted traffic is still making it to its destination. ICMP traffic moves fine to and from wherever. IP (UDP and TCP seemingly) traffic cannot get where I need it to go however, including sending to destinations unrelated to the VPN, if it's originating at the EdgeRouter's own IP interface.

 

I'm worried that it's actually a defective unit, in a remote location North of the Arctic Circle Man Sad

 

This is in /var/log/messages:

 

"I/O Error, both of real entry and whiteout found, resolv.conf, error -5"

 

.... a lot:

 

root@nvfy-edgerouter:/var/log# grep "I/O Error, both of real entry and whiteout found" /var/log/messages | wc -l
1011

 

 

pseudo-sanitized config:

 

Spoiler
 firewall {
     all-ping enable
     broadcast-ping disable
     group {
         address-group Unwanted_WAN_Traffic {
             address 40.77.232.59
             address 198.251.90.71
             description ""
         }
         network-group LAN_All {
             description ""
             network 172.22.1.0/24
             network 172.22.19.0/24
             network 172.22.21.0/24
             network 172.22.22.0/24
             network 172.22.23.0/24
             network 172.22.24.0/24
             network 172.22.25.0/24
         }
         network-group source_route_1 {
             description "IPs that route through general Internet"
             network 172.22.21.0/24
             network 172.22.23.0/24
             network 172.22.24.0/24
             network 172.22.25.0/24
         }
         network-group source_route_2 {
             description "IPs that route through acct Internet"
             network 172.22.22.0/24
             network 172.22.1.0/24
             network 172.22.19.0/24
         }
     }
     ipv6-receive-redirects disable
     ipv6-src-route disable
     ip-src-route disable
     log-martians enable
     modify WAN_LB {
         rule 10 {
             action modify
             modify {
                 table 1
             }
             source {
                 group {
                     network-group source_route_1
                 }
             }
         }
         rule 20 {
             action modify
             modify {
                 table 2
             }
             source {
                 group {
                     network-group source_route_2
                 }
             }
         }
     }
     name GZGTG_LAN {
         default-action accept
         description ""
         rule 1 {
             action drop
             description "Block Unwanted Sites"
             destination {
                 group {
                     network-group LAN_All
                 }
             }
             log enable
             protocol all
             source {
                 group {
                     address-group Unwanted_WAN_Traffic
                 }
             }
             state {
                 established enable
                 invalid enable
                 new enable
                 related enable
             }
         }
     }
     options {
         mss-clamp {
             interface-type all
             mss 1372
         }
     }
     receive-redirects disable
     send-redirects enable
     source-validation disable
     syn-cookies enable
 }
 interfaces {
     ethernet eth0 {
         address 192.168.1.100/24
         duplex auto
         speed auto
     }
     ethernet eth1 {
         address 172.22.1.1/24
         duplex auto
         firewall {
             in {
                 modify WAN_LB
             }
         }
         speed auto
         vif 19 {
             address 172.22.19.1/24
             description "Server VLAN"
             firewall {
                 in {
                     modify WAN_LB
                 }
             }
             mtu 1500
         }
         vif 21 {
             address 172.22.21.1/24
             description "GZGTG Users"
             firewall {
                 in {
                     modify WAN_LB
                 }
             }
             mtu 1500
         }
         vif 22 {
             address 172.22.22.1/24
             description Accounting
             firewall {
                 in {
                     modify WAN_LB
                 }
             }
             mtu 1500
         }
         vif 23 {
             address 172.22.23.1/24
             description "GZGTG Guests"
             firewall {
                 in {
                     modify WAN_LB
                 }
             }
             mtu 1500
         }
         vif 24 {
             address 172.22.24.1/24
             description VOIP
             firewall {
                 in {
                     modify WAN_LB
                 }
             }
             mtu 1500
         }
         vif 25 {
             address 172.22.25.1/24
             description "GZGTG Printers"
             firewall {
                 in {
                     modify WAN_LB
                 }
             }
             mtu 1500
         }
     }
     ethernet eth2 {
         address 192.168.2.100/24
         duplex auto
         mtu 1460
         speed auto
     }
     ethernet eth3 {
         duplex auto
         speed auto
     }
     ethernet eth4 {
         duplex auto
         speed auto
     }
     loopback lo {
     }
     switch switch0 {
         mtu 1500
     }
 }
 load-balance {
 }
 protocols {
     static {
         route 0.0.0.0/0 {
             next-hop 192.168.1.1 {
             }
             next-hop 192.168.2.1 {
             }
         }
         route 207.2.81.240/29 {
             next-hop 192.168.2.1 {
                 description "All GSE LLC traffic through Acct Uplink"
             }
         }
         table 1 {
             route 0.0.0.0/0 {
                 next-hop 192.168.1.1 {
                 }
             }
         }
         table 2 {
             route 0.0.0.0/0 {
                 next-hop 192.168.2.1 {
                 }
             }
         }
     }
 }
 service {
     dhcp-server {
         disabled false
         hostfile-update disable
         shared-network-name Native_VLAN_pool {
             authoritative disable
             subnet 172.22.1.0/24 {
                 default-router 172.22.1.1
                 dns-server 8.8.8.8
                 dns-server 8.8.4.4
                 lease 86400
                 static-mapping gzgtg-ups1 {
                     ip-address 172.22.1.6
                     mac-address 00:c0:b7:6a:47:14
                 }
                 static-mapping nvfy-vm4 {
                     ip-address 172.22.1.9
                     mac-address 00:24:e8:7f:1c:f8
                 }
             }
         }
         shared-network-name Server {
             authoritative disable
             subnet 172.22.19.0/24 {
                 default-router 172.22.19.1
                 dns-server 172.22.1.7
                 dns-server 172.22.19.48
                 domain-name tribal.local
                 lease 86400
                 start 172.22.19.64 {
                     stop 172.22.19.64
                 }
                 static-mapping Brother-Env {
                     ip-address 172.22.19.44
                     mac-address 90:cd:b6:68:7e:2b
                 }
                 static-mapping Brother-Realty {
                     ip-address 172.22.19.43
                     mac-address 40:49:0f:a2:8f:30
                 }
                 static-mapping UniFi1 {
                     ip-address 172.22.19.6
                     mac-address 04:18:d6:6c:56:da
                 }
                 static-mapping UniFi2 {
                     ip-address 172.22.19.7
                     mac-address 04:18:d6:6c:5e:f1
                 }
             }
         }
         shared-network-name VOIP_pool {
             authoritative disable
             subnet 172.22.24.0/24 {
                 default-router 172.22.24.1
                 dns-server 172.22.1.7
                 dns-server 172.22.19.48
                 domain-name tribal.local
                 lease 86400
                 start 172.22.24.128 {
                     stop 172.22.24.255
                 }
                 tftp-server-name 172.22.24.2
             }
         }
         shared-network-name accounting_pool {
             authoritative disable
             subnet 172.22.22.0/24 {
                 default-router 172.22.22.1
                 dns-server 172.22.1.7
                 dns-server 172.22.19.4
                 domain-name tribal.local
                 lease 86400
                 start 172.22.22.64 {
                     stop 172.22.22.127
                 }
                 static-mapping acct-printer {
                     ip-address 172.22.22.42
                     mac-address 40:b0:34:a4:dc:4a
                 }
             }
         }
         shared-network-name gen_use_pool {
             authoritative disable
             subnet 172.22.21.0/24 {
                 default-router 172.22.21.1
                 dns-server 172.22.1.7
                 dns-server 172.22.19.4
                 domain-name tribal.local
                 lease 86400
                 start 172.22.21.128 {
                     stop 172.22.21.191
                 }
                 static-mapping nvfy-desktop16 {
                     ip-address 172.22.21.192
                     mac-address b0:83:fe:ba:97:eb
                 }
             }
         }
         shared-network-name guest_pool {
             authoritative disable
             subnet 172.22.23.0/24 {
                 default-router 172.22.23.1
                 dns-server 172.22.1.7
                 lease 86400
                 start 172.22.23.64 {
                     stop 172.22.23.127
                 }
             }
         }
         shared-network-name printers_pool {
             authoritative disable
             subnet 172.22.25.0/24 {
                 default-router 172.22.25.1
                 dns-server 172.22.1.7
                 lease 86400
                 static-mapping prn-housing-1 {
                     ip-address 172.22.25.64
                     mac-address 48:5a:b6:7e:7a:a5
                 }
                 static-mapping prn-realty-1 {
                     ip-address 172.22.25.65
                     mac-address 40:49:0f:a2:8f:30
                 }
             }
         }
         use-dnsmasq disable
     }
     gui {
         http-port 80
         https-port 443
         older-ciphers enable
     }
     nat {
         rule 5001 {
             description Outbound_All_eth0
             log disable
             outbound-interface eth0
             protocol all
             source {
                 group {
                     network-group LAN_All
                 }
             }
             type masquerade
         }
         rule 5002 {
             description Outbound_All_eth2
             log disable
             outbound-interface eth2
             protocol all
             source {
                 group {
                     network-group LAN_All
                 }
             }
             type masquerade
         }
     }
     ssh {
         port 22
         protocol-version v2
     }
     unms {
         disable
     }
 }
 system {
     domain-name tribal.local
     host-name nvfy-edgerouter
     login {
         user jrdalrymple {
             authentication {
                 encrypted-password $6$gZ6pymO7r4tfag55$LakDYi2Gmm2rnZ7BdkQKIbZ4.WQLKfK1CQJaE0UjAfsLOWkm/NbVUJnL9DtQ7FpC1dnKLZF6dRTZ910/QjCUK1
                 plaintext-password ""
             }
             full-name "JR Dalrymple"
             level admin
         }
         user ubnt {
             authentication {
                 encrypted-password $6$mFYayM/oosIR$eW7ztWThZMKN7tg5/0qdTErjHBr6NHKHSmywgH9gtxnryx9e/kbVRWF5C9owuIWwcTijwDRfeXRfGxV6PJVnd.
                 plaintext-password ""
             }
             full-name Admin
             level admin
         }
     }
     name-server 172.22.1.7
     name-server 172.22.19.48
     ntp {
         server 0.ubnt.pool.ntp.org {
         }
         server 1.ubnt.pool.ntp.org {
         }
         server 2.ubnt.pool.ntp.org {
         }
         server 3.ubnt.pool.ntp.org {
         }
     }
     syslog {
         global {
             facility all {
                 level notice
             }
             facility protocols {
                 level debug
             }
         }
     }
     time-zone UTC
     traffic-analysis {
         dpi enable
         export enable
     }
 }
 traffic-control {
     smart-queue GZGTG-eth0 {
         download {
             ecn enable
             flows 1024
             fq-quantum 1514
             limit 10240
             rate 1024kbit
         }
         upload {
             ecn enable
             flows 1024
             fq-quantum 1514
             limit 10240
             rate 512kbit
         }
         wan-interface eth0
     }
     smart-queue GZGTG-eth2 {
         download {
             ecn enable
             flows 1024
             fq-quantum 1514
             limit 10240
             rate 1024kbit
         }
         upload {
             ecn enable
             flows 1024
             fq-quantum 1514
             limit 10240
             rate 512kbit
         }
         wan-interface eth2
     }
 }
 vpn {
     ipsec {
         auto-firewall-nat-exclude enable
         esp-group FOO0 {
             compression disable
             lifetime 3600
             mode tunnel
             pfs enable
             proposal 1 {
                 encryption aes128
                 hash sha1
             }
         }
         ike-group FOO0 {
             ikev2-reauth no
             key-exchange ikev1
             lifetime 3600
             mode main
             proposal 1 {
                 dh-group 2
                 encryption aes128
                 hash sha1
             }
         }
         site-to-site {
             peer 207.2.81.244 {
                 authentication {
                     mode pre-shared-secret
                     pre-shared-secret vowu74khx9F99h4IfUPT6ohoOsmw0II4XtGO7rosGzWpRC3WYlnzt3bTz2RdvvpW
                 }
                 connection-type initiate
                 description "GSE LLC VPN"
                 ike-group FOO0
                 ikev2-reauth inherit
                 local-address any
                 tunnel 1 {
                     allow-nat-networks disable
                     allow-public-networks disable
                     esp-group FOO0
                     local {
                         prefix 172.22.22.0/24
                     }
                     remote {
                         prefix 172.16.104.0/24
                     }
                 }
                 tunnel 2 {
                     allow-nat-networks disable
                     allow-public-networks disable
                     esp-group FOO0
                     local {
                         prefix 172.22.19.0/24
                     }
                     remote {
                         prefix 172.16.104.0/24
                     }
                 }
                 tunnel 3 {
                     allow-nat-networks disable
                     allow-public-networks disable
                     esp-group FOO0
                     local {
                         prefix 172.22.1.7/32
                     }
                     remote {
                         prefix 172.16.104.0/24
                     }
                 }
             }
         }
     }
 }

 

Any advice appreciated.

image.png

Accepted Solutions
New Member
Posts: 9
Registered: ‎12-27-2015
Kudos: 1
Solutions: 2

Re: ipsec site to site VPN fails - requires reboot


16again wrote:

1) Rebooting the ER-X shuts down IPSEC traffic on ports 500 and 4500, if this takes long enough, the upstream router NAT table is cleared


Right problem wrong cause. There is something still a bit baffling. Obviously this thing is a router on a stick. When I left it a week ago today, things were working great. Customers were all logging into their domain, sending print jobs to their printers, etc etc. This morning things seemed to tank massively.

 

Now... as I was following tutorials to put together the LB config I definitely identified that the rules I put in place would prevent LAN to LAN traffic, but they didn't... things worked fine when I left last Wednesday. Today I found nothing working as none of the clients could get to DNS. I don't have an explanaiton for how last Wednesday through last night they were able to route to their DNS servers, printers etc, but in order to fix it I did have to put in a proper route modify to main. 

 

So... my presumption is that at least some amount of traffic that didn't belong on the WAN uplinks was getting pitched out there, and without being NATted. The outcome, fill the NAT table on the crappy upstream routers eventually causing them to blow up. There was obviously some coincidence and some confusion, but reallistically it's working now after that (as is internal routing) so it's the best I can come up with. At the end of the day... it's working now - marking solved.

View solution in original post


All Replies
Veteran Member
Posts: 7,023
Registered: ‎03-24-2016
Kudos: 1819
Solutions: 802

Re: ipsec site to site VPN fails - requires reboot

The "whiteout" message doesn't show up here in over 5 weeks of log files.

 

IPSEC and NAT can give troubles if external NAT device starts translating ports.  (so your source port 500 gets translated, confusing the remote)

 

Keep-alive might prevent that from happening.

 

I also use tunnels behind NAT, and am succesfull with starting a GRE tunne (outer)l, and encrypt the packets inside the GRE tunnel with IPSEC.  This way the remote device only see GRE packets, and it can't mess up ports. And since IPSEC no longer sees NAT, you can use VTI.

 

Note this different than normal GRE/IPSEC , which uses GRE on internal tunnel , and IPSEC for outer.

New Member
Posts: 9
Registered: ‎12-27-2015
Kudos: 1
Solutions: 2

Re: ipsec site to site VPN fails - requires reboot


16again wrote:

 IPSEC and NAT can give troubles if external NAT device starts translating ports.  (so your source port 500 gets translated, confusing the remote)

 

Keep-alive might prevent that from happening.

 

I also use tunnels behind NAT, and am succesfull with starting a GRE tunne (outer)l, and encrypt the packets inside the GRE tunnel with IPSEC.  This way the remote device only see GRE packets, and it can't mess up ports. And since IPSEC no longer sees NAT, you can use VTI.

 

Note this different than normal GRE/IPSEC , which uses GRE on internal tunnel , and IPSEC for outer.


Not to say that you're incorrect by any means - I'm definitely looking for solutions and hate balking when I hear them, but...

 

1) So let's say port overload is causing issues at the public edge, why does rebooting ERX immediately fix the issue and for some hours?

 

2) I'm not sure what you mean by keepalive, but I kind of have one in place by the nature of the graph attached performing the up/down check and reporting back. I suppose I should also take this opportunity to highlight another issue, that is that I have to initiate from behind the ERX, a situation that didn't exist prior to the ERX's replacement of the previous router. If I try to initiate from the remote endpoint the ERX is definitely receiving isakmp phase 1, but doing nothing about it. Another problem for another day...

 

 

As mentioned - there was another router doing a very vanilla IPSEC behind this very same NAT for years without issue. I didn't change any configuration at the target nor on the public router. The ERX installation is something of a feasability study since Soekris shuttered and in time I will have to replace all of them (of which 2 other endpoints I support are also behind NAT). Right now that study is failing Man Happy

 

New Member
Posts: 9
Registered: ‎12-27-2015
Kudos: 1
Solutions: 2

Re: ipsec site to site VPN fails - requires reboot

When I think about it even further... why would ERX care anyway? As far as it's concerned (and as long as my MTU/MSS is configured properly) there is no PAT going on. If anyone would care it's my remote end, and it's working fine. If it wasn't I'd have hundreds of people calling me, not just 3 Man Happy

 

See attached to understand what I mean.

Untitled.png
New Member
Posts: 9
Registered: ‎12-27-2015
Kudos: 1
Solutions: 2

Re: ipsec site to site VPN fails - requires reboot

There is this...

 

root@nvfy-edgerouter# ping google.com
ping: unknown host google.com
[edit]
root@nvfy-edgerouter# delete system name-server 172.22.1.7
[edit]
root@nvfy-edgerouter# set system name-server 172.22.19.4
[edit]
root@nvfy-edgerouter# commit
[ system name-server 172.22.1.7 ]
touch: /etc/resolv.conf: Input/output error
sed: can't read /etc/resolv.conf: Input/output error

[ system name-server 172.22.19.48 ]
touch: /etc/resolv.conf: Input/output error
sed: can't read /etc/resolv.conf: Input/output error

[ system name-server 172.22.19.4 ]
touch: /etc/resolv.conf: Input/output error
grep: /etc/resolv.conf: Input/output error
awk: cannot open /etc/resolv.conf (Input/output error)
head: /etc/resolv.conf: Input/output error
cat: can't open '/etc/resolv.conf': Input/output error
tail: can't open '/etc/resolv.conf': Input/output error
tail: no files
mv: can't stat '/etc/resolv.conf': Input/output error

[edit]
root@nvfy-edgerouter# ping google.com
ping: unknown host google.com

All 3 of the nameservers do indeed work.

 

I have no real immediate need for this box to be able to resolve anything, but what would the expected behavior be if it couldn't resolve things ... like say the default NTP servers? I noticed some time ago a consistent load average of right around 1.0... 

 

Not sure of how to fix this (the above with a reboot didn't)

Not sure of whether it even matters

 

Veteran Member
Posts: 7,023
Registered: ‎03-24-2016
Kudos: 1819
Solutions: 802

Re: ipsec site to site VPN fails - requires reboot

1) Rebooting the ER-X shuts down IPSEC traffic on ports 500 and 4500, if this takes long enough, the upstream router NAT table is cleared

 

2) your graph does keep-alive for data channel (UDP4500) not for IKE phase 1

 

In screenshot shown, I see udp connections coming from IP addresses not defined as peer in your config.  Maybe too much requests like these confuse the NAT router or the ER-X.

 

You could disable auto-firewall-nat-exclude-enable , and add WAN_LOCAL rules , only allowing configured peer for udp 500/4500

New Member
Posts: 9
Registered: ‎12-27-2015
Kudos: 1
Solutions: 2

Re: ipsec site to site VPN fails - requires reboot


16again wrote:

1) Rebooting the ER-X shuts down IPSEC traffic on ports 500 and 4500, if this takes long enough, the upstream router NAT table is cleared


Right problem wrong cause. There is something still a bit baffling. Obviously this thing is a router on a stick. When I left it a week ago today, things were working great. Customers were all logging into their domain, sending print jobs to their printers, etc etc. This morning things seemed to tank massively.

 

Now... as I was following tutorials to put together the LB config I definitely identified that the rules I put in place would prevent LAN to LAN traffic, but they didn't... things worked fine when I left last Wednesday. Today I found nothing working as none of the clients could get to DNS. I don't have an explanaiton for how last Wednesday through last night they were able to route to their DNS servers, printers etc, but in order to fix it I did have to put in a proper route modify to main. 

 

So... my presumption is that at least some amount of traffic that didn't belong on the WAN uplinks was getting pitched out there, and without being NATted. The outcome, fill the NAT table on the crappy upstream routers eventually causing them to blow up. There was obviously some coincidence and some confusion, but reallistically it's working now after that (as is internal routing) so it's the best I can come up with. At the end of the day... it's working now - marking solved.

Reply