Why VPN monitoring? Let’s imagine a company that have a network of branch offices in most of the largest cities of the country. The offices are interconnected using both public networks (i.e. Internet) and dedicated operator networks (i.e. MPLS). In order to provide secure communications, on top of both types of links the company runs VPN tunnels, that are used for to carry the actual data. Also, in order to make the connectivity more reliabile, the company employs multiple VPNs from each site — one main (over MPLS) and two backup (over the Internet); in case of link failure the re-routing is handled using OSPF. As our experience shows, network  failures and service degradations may occur even in case of a very reliable network such as this one. In case of RDP and VoIP-based services (i.e. the remote office doesn’t have any servers and PBXes), the network failure will bring the office to a dead stop.

In general, real-time services (RDP, VoIP, videoconferencing, etc) are not only affected by link or network failures – packet loss and jitter are the main contributors to the service degradation.

As the result the tasks of VPN monitoring can be defined as follows: notification of the support operator (using alerts) and fast preliminary diagnostics of the problem. After that the problem is forwarded to the network support group (with the symptoms and preliminary diagnostics) for further actions – double check the symptoms and run deeper diagnostics, then either contact the Network Provider or solve the problem locally. With backup channels at hand and proper reaction times this will minimize the probability of full disconnect of the remote office. This procedure also allows for QoS monitoring, providing the personnel on the remote locations with a comfortable level of service.

Let’s consider a real-life example. A company with 20 regional offices that are connected to the central location with 3 links each – about 60 channels altogether. We’ve started to monitor and log network failures with the duration of more than 30 minutes on the 17th of March. During the 15 days of March we’ve logged 40 network faults. In 6 cases the Network Operators were contacted and in 3 cases the users noticed the performance degradation. In April we’ve logged 51 faults, in 22 cases the NOs were contacted and only in 2 cases the users did notice the faults. During 20 days of May 16 faults were logged, in 8 cases the NOs were contacted and no user-affecting cases. The next diagram shows the average faults per day for each month:

graph

As you can see the overall amount of faults decreased since we’ve started the monitoring and the logging process. The difference between the overall faults and the faults that required the NOs to be contacted cannot be explained by the fact of the 24/7 monitoring process itself, but it looks like the NOs don’t want to lose their customers when they realize that the QoS is being monitored J On the other hand it’s just a theory.

How does I work? The Operator get the alerts as SMSes, E-mail and also observes it in the SCOM’s console:

sms

email

alert_console

The time between the failure and the generated alert is configurable, but in our case is no longer than 4 minutes. Such a long time between the event and the alerts is used in order to avoid the alarm “noise” as the amount of short-term network outages is quite high.

As the alert contains the name of the monitored channel,  the operator have to choose the site (in the SCOM console):
/p>

regions

Then a window with the graphs of the most critical parameters is opened:

region_view

The selected parameters and the order of the grouping is the result of a long-term testing process. This particular grouping allows for the quickest possible “1-click” diagnosis of the problem.

Let’s consider the screenshot in more details:

  • Graphs 1 and  2 shows the state of the router (CPU load, memory consumption etc…) and a quick glance will show whether the problem is affected by the router or something else.
  • The graph 3  shows the RTT of the integral (i.e. the main and both backup channels together)  channel between the remote location and the central office, as “seen” by the end user.
  • Graph 4’s positive values on the Y-axis shows the amount of lost packets over all of the channels for that region, the negative Y values show the current state. In this particular case it is “6”, which means “channel failure”. Different types за channels are color-coded differently – integral is red, MPLS is blue and the two backup channels are green and yellow.
  • Graph 5 shows the throughput through the VPNs. You can see the traffic migrating from the main MPLS link to the backup public VPN
  • And at last graph 6 shows the RTT of each of the channels to the site, except the integral one.

Based on the priority (failures of the main MPLS link and one of the backup VPNs obviously do have different priorities), the operator waits a predefined amount of time in case the problem will self-repair and then forwards the problem to the network department. Failures that were observed for more than 30 minutes are logged and are included into the monthly report together with the SCOM statistics:

stat

And the RND’s case wasn’t a  long one – the case was forwarded to the network department, but the engineer didn’t have the time to log into the router – the problem disappeared and the case was closed. Fortunately this happens in most cases:

sms_closed

email_closed

scom_alert_closed

As far as the packet loss is concerned, the mechanism is principally the same  and of course everything can be tuned, tweaked, redefined  and customized.

P.S. And for the Grand Finale – most interesting screenshots from real-life situations. This illustrated how our ISPs sometimes do “provide”. No comments:

igevsk_20022009

perm_100409_public_integral

And here the somebody started to download a movie…

eburg_250309