This publication is dedicated to CISCO monitoring management pack design by Metrex Engineering team.

I’d like to start from how did we get this way, i.e. why did we start writing our own Cisco management pack.

Some time ago we’ve been trying to use Zabbix, Nagios, OpManager and other systems for the monitoring issue however I’ll skip this part as far as we’re not using it anymore.

The issue is that it’s been a while since we installed SCOM and it had been working well for Windows – based servers. Then we’ve got a predictable idea to have a monitoring system for both servers, network equipment and UPSs. And we have started digging of how to monitor CISCOs with SCOM. In a catalogue on the Microsoft web site we found two third party packages (designed by MS partners and sold separately). I have installed both of them – Quest Software and Jalasoft and have shown to our company network dept… In any event we agreed that they are not suitable for our needs and rather than trying to adjust to the existing solution we’d try to write our own management pack.

I should describe in a few words the things the cons of the third party MPs we have been testing.

First of all they work separately from SCOM. As a matter of fact installer is setting up a program (executable) on a server (sometimes on a standalone server) and this program is polling CISCOs grabbing data and then exports collected data to SCOM for visualizing. It implies that you have to configure all settings out of SCOM. You’re using a third party software’s interface to make all configurations and I’d like to say that this interface is not actually intuitively obvious.

It seems like MPs are designed people who have not been working as network administrators. Packages collect a lot of data but they do not collect the data which is really needed.

The core and nicest feature of the SCOM is poorly supported by these MPs. I’m talking about notifications. If this feature is natively properly configured in SCOM there is no need to take a look into the console on regular basis as you get all noticeable notifications via email or via SMSs on a cellular phone (it is up to you: our network guys receive critical notifications via SMSs and noncritical – via emails).

Customizing. The way the network administrator is administering the network highly depends on network quality and demands placed on the network. Leaving the default MP settings untouched would lead to the unacceptably huge number of notifications or vice versa – one notification per month as far as network uses experience network quality issues. Of course every administrator working with MP might change some thresholds however in real environment it’s not always enough and it’s needed to make arrangements at lower, more generic level.

Distributing the MPs. Monitoring the networks in the real company with several branches is a complicated task (sometimes a real challenge) and idea of implementing the monitoring system pressing NEXT-NEXT-NEXT in the installation wizard looks unreal. The wise implementation approach should combine an MP with implementation methodology and experience, so is seems reasonable to distribute the MP as a part of the monitoring solution, i.e. implementation and customization project. However MPs are usually sold for the “out of box” use.

Well, let’s get back to our management pack overview.

First of all we introduced three not actually equal classes of CISCO: Routers, Switches and Firewalls. The most sophisticated from the monitoring point of view class is a class of Routers, the second one is Switches and the last one is a class of Firewalls.

It’s not necessary to dig deeply into the MP features so I’m going to place MP screenshots which seem interesting to me and add a few tables with the details for those who might need it.

 

1. State view.

 

To glance over the current state of the monitored Ciscos the following view seems to be mostly suitable.

Общий список.

 

2. Diagrams.

 

To see the monitored components of the particular Cisco and their current state the we defined this view. The following picture reveals the monitored components of the Cisco 1841 router:

 

У бедняжки всего один вентилятор, да и тот сломан.

The poor thing has one cooling fan and it’s currently damaged.

 

Routers of the higher classes have better monitoring capabilities. Here’s a diagram screenshot of a Cisco 2851 router. The entire diagram doesn’t suite well into the 1280×1024 screen however it seems quite suitable on my 19” monitor.

Diagram view for Cisco 2851

 

On this picture you can see the list of the most CPU consuming processes.

4-diagram-3

list of the most CPU consuming processes

As you might notice router is paying the bill for being monitored and SNMP – related processes are in the top-processes list.

 

And the brilliant feature set at last – IP SLA Monitoring. This set contains different monitors so SCOM will split instances into groups according to their current state.

С каналом в Самару явно не всё хоро?о.

As you can see – the data channel shown has some issues.

3. Graphs.

This is great if you can glance over the current state of the whole list of Ciscos however sometimes you might need details. Not everything can be shown on a graph. Let’s say I doubt whether the state of the power supply should be placed on the graph, however many other figures might (and must) be interested in a graph view.

Классика жанра - % свободной памяти.

The classical example – free memory %.

Графики температур нескольких цисок.

Internal temperature graphs of misc routers.

Трафик на одном из интерфейсов (in, out и total).

Interface traffic (IN, OUT, Total).

For any objects you can get many different graphs. As an example – there are 13 different graphs for UDP-Jitter IP SLA.

% потерянных пакетов.

Lost packets %.

RTTMax и RTTMin

RTTMax andRTTMin

If you would like to see the complete list of available graphs – you could take a look at a table below, I’m not going to show all of them on pictures, just a few.

 

4. Alerts.

 

Every change of the current state of the object causes opening or closing of the alert. To see the alerts currently opened (or see the list of alerts happened within N days) open up the following view.

Активные алёрты.

alerts currently opened.

 

5. Notifications.

 

For sure it’s a must to be able to get notifications about important events via email or SMS on a cellular phone. This feature is available and the information value seems to be good enough.

 

Оповещение о высокой загруженности процессора.

High CPU load notification.

It’s shown that the CPU load level was above the threshold (>75%) and the actual value was 81%. The top list of the processes is presented (yeah, SNMP Engine), the part consumed by the interrupts is shown as well (84%). It’s easy to make a conclusion on what is the source of the high CPU load. The calculation you might have done seem to be inexact and looks a little bit strange: 2%+84% = 81%. However it’s because Cisco is always giving only instant value for the “Load Due to Interrupts” request (i.e. I/O operations).

 

Ощутимые потери пакетов.

The data channel reveals a reduced performance.

Here we’ve got a significant packets lost on a data channel. For 10 samples used for link quality evaluation 13% were lost. The threshold value of 12% and the number of samples can be redefined.

 

Канал в Новосибирск совсем упал.

The channel is down.

We consider channel as down if the estimation could not be done (due to data packets loss) 2 times repeatedly. Of cause thresholds and values can be redefined.

 

6. Health Explorer.

 

To find the reason Cisco got red (state list) and one of its modules is shining yellow on a diagram we can dig into email notifications sent earlier or take a look at the alerts list. However sometimes it’s more useful to open the health explorer.

Модель состояния

State mode

As you might see Cisco is yellow because it’s fan is now in “shutdown” state (it’s a Cisco term). In addition this fan is not totally broken as far as it is getting back to “OK” state from time to time.

 

7. Syslog.

 

And of cause we couldn’t leave Syslog monitoring unattended. The most important events (from our point of view) are placed for alerting and email notification:

Пользователь подключился по ssh.

The user has logged in via SSH.

8. Customizing.

It’s not always interesting to track the data channel condition looking into the object list. Sometimes it’s good to make a customized view to comply with specific needs.

А как у нас поживают циски, ответственные за канал в Екатеринбург?

Regional site related graphs?

9. Reports.

If you would like to represent the inter-site data channel reliability and availability report (on a quarterly or hourly or whatever basis) – you might like the following one:

Отчёт о состоянии каналов. Красная часть - канал лежал, жёлтая - были ощутимые потери пакетов, зелёная - канал в порядке.

The Tables.

Routers

  Graphs Alerts/Notifications
Memory Free memory %, Used memory %, Free memory in MBytes, Used memory in Mbytes Free memory % is getting lower than threshold
CPU Average 30 seconds CPU load, average 1 minute CPU load, average 5 minutes CPU load, momentary values “Due to interrupts” Average 1 minute CPU load exceeds the threshold value
Process Average 30 seconds or 1 minute or 5 minute CPU usage by the process Not available
Fan Not available Fan state is not OK
Temperature Temperature, C Temperature exceeds the threshold predefined on Cisco device (not in SCOM)
Network Interface Network traffic through the interface (in, out, total) Not available
Power supply Not available Power supply state is not OK
IP SLA Jitter rttMonLatestJitterOperRTTMin, rttMonLatestJitterOperRTTMax, rttMonLatestJitterOperPacketOutOfSequence, rttMonLatestJitterOperSense, rttMonLatestJitterOperAvgDSJ, rttMonLatestJitterOperAvgJitter, rttMonLatestJitterOperAvgSDJ, rttMonLatestJitterOperPacketLateArrival, rttMonLatestJitterOperPacketLossDS, rttMonLatestJitterOperPacketLossSD, rttMonLatestJitterOperPercentOfLostPackets Number of lost packets during N estimation exceeds the threshold; Number of consecutive unsuccessful estimations exceeds a threshold.
IP SLA Echo rttMonLatestRttOperCompletionTime, rttMonLatestRttOperSense Not available
IP SLA UDP Echo rttMonLatestRttOperCompletionTime, rttMonLatestRttOperSense Not available
IP SLA PathEcho rttMonLatestRttOperCompletionTime, rttMonLatestRttOperSense Not available

 

Switches

  Graphs Alerts/Notifications
Memory Free memory %, Used memory %, Free memory in MBytes, Used memory in Mbytes Free memory % is getting lower than threshold
CPU Average 30 seconds CPU load, average 1 minute CPU load, average 5 minutes CPU load, momentary values “Due to interrupts” Average 1 minute CPU load exceeds the threshold value
Process Average 30 seconds or 1 minute or 5 minute CPU usage by the process Not available
Fan Not available Fan state is not OK
Temperature Temperature, C Temperature exceeds the threshold predefined on Cisco device (not in SCOM)
Network Interface Network traffic through the interface (in, out, total) Not available
Power Unit Not available Power supply state is not OK

 

Firewalls

  Graphs Alerts/Notifications
Memory Free memory %, Used memory %, Free memory in MBytes, Used memory in Mbytes Free memory % is getting lower than threshold
CPU Average 30 seconds CPU load, average 1 minute CPU load, average 5 minutes CPU load, momentary values “Due to interrupts” Average 1 minute CPU load exceeds the threshold value

 

We have been working on management pack for about a year. I like it and it satisfies our network department. Essentially we could get rid of disadvantages we found in third party MPs. The benefits of the package we created could be boiled down to the following sentences:

Our CISCO management pack doesn’t install additional standalone software and doesn’t need it to work. In fact it’s a typical management pack like others, for example like MP for MS SQL written by Microsoft. It must be properly imported into SCOM and then you discover CISCOs like typical network devices. Management pack would recognize Cisco equipment would assign appropriate classes, discover the modules that are embedded and the monitoring show would kick off.

The management pack development was engaged by our network team and they are taking an active part in this project. This is why MP is monitoring those things they are interested in but not everything that theoretically can be monitored. I thought to implement misc things I saw in Jalasoft and Quest packs however decided not to go this way as it’s not needed for the end users – network guys.

Our management pack is oriented on unreliable, unstable (“dirty”) data channels which might cause a huge number of useless sequential alerts and this might lead to a real headache for the administrator. For our Scandinavian or UK clients it is typical to have stable and powerful channels within the countries. However if we step away from the MPLS networks and start using Internet VPN tunnels to join the company’s sites in different countries – then we enter the world we kept in mind creating the pack. Short time (just a few seconds) outages, a small percent of lost packets all the time – this is what I’m talking about. We have been testing the package on Russian networks and hence you might be sure that even being installed as is without fine tuning you won’t be overwhelmed with the message storm, quite the contrary – you’ll be notified with the meaningful notifications. However it is still possible to redefine thresholds to be notified about every little tiny event with dozens of event – related SMSs.

We keep working extending the management pack functionality. We expect to introduce the following features:

Package – specific reports. Currently we are using usual SCOM reports however sometimes this is not enough.

Network diagram visualization. The idea is quite simple – you’re drawing the network diagram in MS Visio and “connect” it to SCOM and the diagram start blinking with all this green, yellow and red.

If you are interested in implementing this MP in your company and would like to evaluate it please send the email requests to scom-team@metrex.ru to get the trial version.