Intermittent VPN tunnel fluctuations causes Remote Desktop Connection not working properly

We have users in our enterprises in INDIA and those took Remote Desktop connection of the server in US through VPN tunnel.

Issue: It has been observed that RDP sessions are losing during a certain time - stamp(during one-hours at transition time i.e when shift changes)  and users are not able to take "rdp" flawlessly which ultimately hinders business.

Resolution:

Troubleshooting Steps:
1. Step1. Checked Logs at Firewall: If you have enabled logs at firewall you should find something which is causing the issues. I checked system logs and found nothing related to VPN, only sshd logs were there occurring because ssh port is open in our scenario.


Jul 15 19:59:28  JUN-FW1-Cluster sshd[18352]: Received disconnect from 192.168.90.144: 11: disconnected by user
Jul 15 20:07:02  JUN-FW1-Cluster checklogin[18436]: warning: can't get client address: Bad file descriptor
Jul 15 20:07:02  JUN-FW1-Cluster checklogin[18436]: WEB_AUTH_FAIL: Unable to authenticate httpd client (username root)
Jul 15 20:07:29  JUN-FW1-Cluster checklogin[18440]: warning: can't get client address: Bad file descriptor
Jul 15 20:07:29  JUN-FW1-Cluster checklogin[18440]: WEB_AUTH_FAIL: Unable to authenticate httpd client (username root)

Only ssh request to firewall has found and nothing other than this has been in the logs.

Step2: Check KMD logs for SA related issues for VPN

kmd logs to check SA related issues.--found no issues


[Jun 30 04:35:16]KMD_INTERNAL_ERROR: iked_ifstate_eoc_handler: EOC msg received
[Jul  2 09:41:00]KMD_INTERNAL_ERROR: Error:File exists in adding SA config for tunnel id 131074 spi 0
[Jul  2 09:41:00]KMD_INTERNAL_ERROR: Error:File exists in adding SA config for tunnel id 131075 spi 0
[Jul  2 09:41:00]KMD_INTERNAL_ERROR: Error:File exists in adding SA config for tunnel id 131073 spi 0
[Jul  2 09:41:00]KMD_INTERNAL_ERROR: Error:File exists in adding SA config for tunnel id 2 spi 0
[Jul  2 09:41:00]KMD_INTERNAL_ERROR: Error:File exists in adding SA config for tunnel id 3 spi 0
[Jul  2 09:41:00]KMD_INTERNAL_ERROR: iked_ifstate_eoc_handler: EOC msg received
[Jul  9 16:31:38]KMD_INTERNAL_ERROR: iked_ui_event_handler: usp ipc connection for iked show CLI was SHUTDOWN due to error in receiving msg or age out of connection or flowd going down etc. Reconnect to pfe..

Note: Nothing has been generated after a specific date.

Step 3: Then I check interface logs at Firewall to check the interface status

Then i check interfaces logs-- found st0.1 was down intermittently during specific interval aand this during the same interval it was down.

<30>1 2015-07-15T20:20:09.553Z JUN-FW1-Cluster mib2d 16243 SNMP_TRAP_LINK_UP [junos@2636.1.1.1.2.40 snmp-interface-index="592" admin-status="up(1)" operational-status="up(1)" interface-name="st0.1"]
<28>1 2015-07-15T20:23:44.882Z JUN-FW1-Cluster mib2d 16243 SNMP_TRAP_LINK_DOWN [junos@2636.1.1.1.2.40 snmp-interface-index="592" admin-status="up(1)" operational-status="down(2)" interface-name="st0.1"]
<30>1 2015-07-15T20:25:09.620Z JUN-FW1-Cluster mib2d 16243 SNMP_TRAP_LINK_UP [junos@2636.1.1.1.2.40 snmp-interface-index="592" admin-status="up(1)" operational-status="up(1)" interface-name="st0.1"]
<28>1 2015-07-15T20:27:05.065Z JUN-FW1-Cluster mib2d 16243 - - SNMP_TRAP_LINK_DOWN: ifIndex 592, ifAdminStatus up(1), ifOperStatus down(2), ifName st0.1
<30>1 2015-07-15T20:27:09.610Z JUN-FW1-Cluster mib2d 16243 SNMP_TRAP_LINK_UP [junos@2636.1.1.1.2.40 snmp-interface-index="592" admin-status="up(1)" operational-status="up(1)" interface-name="st0.1"]
<28>1 2015-07-15T20:28:45.157Z JUN-FW1-Cluster mib2d 16243 - - SNMP_TRAP_LINK_DOWN: ifIndex 592, ifAdminStatus up(1), ifOperStatus down(2), ifName st0.1
<30>1 2015-07-15T20:29:19.596Z JUN-FW1-Cluster mib2d 16243 SNMP_TRAP_LINK_UP [junos@2636.1.1.1.2.40 snmp-interface-index="592" admin-status="up(1)" operational-status="up(1)" interface-name="st0.1"]
<28>1 2015-07-15T20:30:15.248Z JUN-FW1-Cluster mib2d 16243 - - SNMP_TRAP_LINK_DOWN: ifIndex 592, ifAdminStatus up(1), ifOperStatus down(2), ifName st0.1
<30>1 2015-07-15T20:31:19.592Z JUN-FW1-Cluster mib2d 16243 SNMP_TRAP_LINK_UP [junos@2636.1.1.1.2.40 snmp-interface-index="592" admin-status="up(1)" operational-status="up(1)" interface-name="st0.1"]
<28>1 2015-07-15T20:40:25.698Z JUN-FW1-Cluster mib2d 16243 SNMP_TRAP_LINK_DOWN [junos@2636.1.1.1.2.40 snmp-interface-index="592" admin-status="up(1)" operational-status="down(2)" interface-name="st0.1"]
<30>1 2015-07-15T20:41:09.633Z JUN-FW1-Cluster mib2d 16243 SNMP_TRAP_LINK_UP [junos@2636.1.1.1.2.40 snmp-interface-index="592" admin-status="up(1)" operational-status="up(1)" interface-name="st0.1"]

NOTE: Since Interface st0.1 was going UP and down during the issue time stamp, and the same VPN tunnel is configured over st0.1 over which user is facing rdp disconnects. 

STEP 4:
Then, I again check binded interface configuration, which is also right 
Then I check bind interface configuration-- it seems all right.


vpn new_office_srx {
    bind-interface st0.1;
    ike {
        gateway new_office_srx;
        ipsec-policy new_office;
    }
    establish-tunnels immediately;
}
Note: Configuration to tunnel is also right.

STEP4: Check VPN tunnel IPSEC statistic 

The following tunnel was encountered some replay errors, which can cause unwanted users to intercept in packet encapsulation and modifying ESP packets.So anti-replay features should be 
enabled.
root@JUN-SRX650-357-FW1-Cluster> show security ipsec statistics
node0:
--------------------------------------------------------------------------
ESP Statistics:
  Encrypted bytes:       1387423576
  Decrypted bytes:       3194860299
  Encrypted packets:     1813841462
  Decrypted packets:     2346964747
AH Statistics:
  Input bytes:                    0
  Output bytes:                   0
  Input packets:                  0
  Output packets:                 0
Errors:
  AH authentication failures: 0, Replay errors: 176
  ESP authentication failures: 0, ESP decryption failures: 0
  Bad headers: 0, Bad trailers: 0
Step 5: Check, TCP-MSS value should be set 1350.Currently no value is defined
root@JUN-SRX650-357-FW1-Cluster# edit security flow
{primary:node0}[edit security flow]
root@JUN-FW1-Cluster# show
traceoptions {
    file DebugTraffic;
    flag basic-datapath;
    packet-filter f1 {
        destination-prefix 192.168.90.225/32;
    }
    packet-filter f2 {
        source-prefix 192.168.90.225/32;
    }
}
tcp-mss {
    ipsec-vpn;
}
tcp-session {
    no-syn-check;
    no-syn-check-in-tunnel;
    no-sequence-check;
    tcp-initial-timeout 60;
}

Step 6: Check dead peer end detection feature status at firewall both end.

gateway new_office_srx {
    ike-policy new_office;
    address 216.214.181.210;
    dead-peer-detection {
        always-send;
        interval 10;
        threshold 5;
    }
    external-interface reth1.0;

dead peer end detection at IKE phase 1, which detects hosts which is dead in the ISP path and ultimately increase latency at VPN packet transmission from one end to far end."


Step 7: Enable trace options to check the firewalls packets transmission on the respective tunnels, and will troubleshoot the same issue during the same time stamp again . Also since this is the transition time, user traffic increases and since load on the sever also increase, if server distributes load on priority basis than issue might be occurring for particular vpn tunnel which need to be cross checked at server end. It could be a possibility.

Step 8: enable ike debug traceoptions and check dpd status in kmd logs.
request security ike debug-enable local 14.14.6.42 remote 26.24.11.20

[Jul 17 21:58:42][14.14.6.42 <-> 26.24.11.20]  ike_encode_packet: Encrypting packet
[Jul 17 21:58:42][14.14.6.42 <-> 26.24.11.20]  ike_encode_packet: Final length = 92
[Jul 17 21:58:42][14.14.6.42 <-> 26.24.11.20]  ssh_ike_connect_notify: Sending notification to (null):500
[Jul 17 21:58:42][14.14.6.42 <-> 26.24.11.20]  ike_send_packet: Start, send SA = { 63484a11 71051fc4 - 624f7383 7f422822}, nego = 0, dst = 26.24.11.20:500,  routing table id = 0
[Jul 17 21:58:42][14.14.6.42 <-> 26.24.11.20]  ike_delete_negotiation: Start, SA = { 63484a11 71051fc4 - 624f7383 7f422822}, nego = 0
[Jul 17 21:58:42][14.14.6.42 <-> 26.24.11.20]  ike_free_negotiation_info: Start, nego = 0
[Jul 17 21:58:42][14.14.6.42 <-> 26.24.11.20]  ike_free_negotiation: Start, nego = 0
[Jul 17 21:58:42][14.14.6.42 <-> 26.24.11.20]  iked_pm_ike_info_done_callback: P1 SA 714369 (ref 2). pending req? 0, status: Error ok
[Jul 17 21:58:42][14.14.6.42 <-> 26.24.11.20]  ikev2_fallback_negotiation_free: Freeing fallback negotiation df3800

DPD is a method used by devices to verify the current existence and availability of IPsec peer devices. A device performs this verification by sending encrypted IKE Phase 1 notification payloads (R-U-THERE) to peers and waits for DPD acknowledgements (R-U-THERE-ACK).


Since, due to Dead-Peer-Detection , SA negotiations were discarded and throwing an error and found TTL for DPD has decreased and DPD failover detection has observed., which ultimately hinders the dpd machanism and dpd ACK is not coming from peer, which causes VPN tunnel fluctuations for a very small period which are even don't changing the status of tunnel.

This issue has occured only during peak hours when lot of traffic start initaiting, new users logged in the remote served, since latency has increased due to DPD which ultimately drop packets.

SO after disabling dead-peer-detection from the configuration , the issue has resolved and remote dektop connections has been stable during the same time-stamp.



Comments