This article describes our management pack for System Center Operations Manager 2007 (SCOM) which provides monitoring of the APC UPS devices.

Why did we start writing this management pack? The power supply monitoring task becomes more urgent when you have to maintain distributed server rooms. In such environment it is hard to expect that monitoring a single UPS by the native APC application in one room you could predict the behavior of the others.

So let’s take a look at existing UPS monitoring solutions. As far as we have implemented SCOM in our company earlier, we had not been looking for monitoring solutions designed for other (non-SCOM) monitoring systems. Some of our UPSs have an imbedded network management card, some – not and these devices are connected to the servers via USB or a serial cable.  Of course we would like to have a universal monitoring tool working regardless to the connectivity type. These are the solutions we had been tested:

  • Configure each UPS with its embedded software (if the UPS is intelligent enough) to send messages to a predefined list of the recipients. It’s a simplest solution having a lot of disadvantages:

1)      each UPS must have an access to the email server;

2)      UPS generates same notifications frequently in a short period of time unless you solve the issue. For example – if the temperature exceeds the predefined value for about half an hour you’ll get a bulk of identical messages. However if the temperature gets normal you definitely will NOT get the corresponding (closing) notification as there is no such feature in the embedded toolset. And it seems to be critical from the monitoring point of view as if something happened during the night in a telecom room, the operator getting back to the desk would like to pay attention to those events that still actual;

3)      It is necessary to configure notifications (i.e. types of events, list of the recipients and email server) on each UPS. You cannot configure notifications and this is another issue. For instance – if the temperature exceeds the threshold you’ll get the corresponding message notifying that it’s above the threshold but without the actual temperature value;

4)      the imbedded notification system is not flexible enough. You cannot configure it to send the calibration test outcome (which is meaningful only to the operator) to one list of recipients and the power failure notifications (system and network administrators would like to be aware) to another one;

5)      it works only for those UPSs that are directly connected to the network and have an embedded monitoring card;

  • Use the native APC solution. This is about Powerchute Business Edition. The Basic version seems to be interesting one as it allows to monitor and control up to five UPSs simultaneously. However all disadvantages of the imbedded toolset (bulk of messages, no closing notifications, inflexible alerting) are still remain. The major advantage of the Powerchute is the ability to monitor those UPSs that are connected via USB and RS232.
  • Power Management Pack for Operations  Manager by Quest Software. As a matter of fact this software works independently from SCOM having a connector to the Operations Manager. Due to this all settings must be done via the special utility and not via SCOM. In addition to need to be familiar with a third party software you lose the ability to use native SCOM tools like Powershell. And the installation process seems to be compecated.
  • Miscellaneous free of charge third party management packs. All these MPs are using SNMP Traps as a way of getting the information. The trap being sent via UDP packet can be lost and there is no way to monitor the performance of the UPS.

This is why we decided to make our own UPS monitoring management pack. Currently our MP works with the APC UPSs however in a short while it will be able work with others which a RFC1628 compliant.

We had the following requirements to the management pack:

  • Monitor the APC UPS via SCOM native features, without external connectors to third party products.
  • It must be useful to the operator, i.e. typical interface, usual state models functionality;
  • Rich customization capabilities – adjustable thresholds and polling intervals and etc. In a distributed enterprise environment it is really important because each server room might have its own power requirements and you might like to customize monitoring settings for several UPSs in different locations.
  • Support for the different connection types – Ethernet, COM and USB.

To give a better understanding of the package features I’ll give my comments to the screenshots.

Overall list of devices

Looking into this list you can see the current state of all monitored devices. The table view allows to see a large number of objects, let’s say for 1024×768 resolution there are 30 to 40 UPSs on one screen:

List of UPS that monitoried by SCOM

List of UPSs monitoried by SCOM

Diagrams

Diagrams give you a good way to find the problem moving along the tree of objects. There is a diagram of all UPSs connected to the monitoring system:

 All devices in one diagram

All devices in one diagram

If you’d like to see detailed information about the item just open up the properties window.

APC UPS Properties

Properties of the UPS

The model of a particular UPS, its location (this field must be filled in on the corresponding UPS), IP address and other useful information is available.

Getting back to the diagram you might mention that some devices experience problems. To better understand the issue we can drill down the object:

A UPS with the environment monitoring card installed

A UPS with the environment monitoring card installed

This is a Smart-UPS 5000 with the management/monitoring card. It is clear now that the environmental controls look fine however the battery has an alarm.

The diagram view might be considered as a main view for the monitoring system operator. It allows to open up all other views via the context menu (a performance graph or a list of alerts or whatever). The operator would see the only information available for this particular object. If you hit the “performance graphs” item for the UPS – you’ll see all available graphs for the UPS. However if you do the same thing for the battery – you’ll see only the battery-related graphs.

Health Explorer

Finding out a battery related problem we’d like to dig into the details. In order to do this we open a Health Explorer or a State Model. This is what we see for the battery:

Health Epxlorer

Health Explorer gives a list of current problems on the UPS

It is obvious now that the UPS remaining time under current load is the issue. The current remaining time is lower that threshold. The actual value and the time when the threshold had been violated are available. The history of events is available as well and this helps to locate the problem.

Alerts

State model is great if you need to see the list of current events. However in order to see the list of events already closed it is more convenient to use Alerts view. This view lets you see the events within the period for a single UPS or for all UPSs at a glance. You can figure out that within the period we had two events for a single UPS, the first one is closed the second is still active:

Alerts

This view is good to see the active and closed events

Graphs

Graph is another one nice troubleshooting tool. It is obvious that graphs exist only for those values that can be measured. You can see here the input voltage graph for the period of several hours:

Input Voltage Graph

Input Voltage Graph

There are two temperature graphs for a single UPS. The red one is for the internal temperature sensor, the orange one is for the external sensor attached to the front panel of the rack shelf near the UPS. Both graphs follow each other and the internal temperature is naturally higher:

Temperature Graph

Internal and External temperature graphs

And here is the interesting graph of the UPS remaining time  (the purple one) and the current load (the yellow-green one):

Time Remaining and Output Load Graphs

On this graph you can see the correlation between the UPS remaining time and the current UPS load

Events

For those events which become more important happening again and again – like “Incorrect password” – it is more convenient to use Event View:

Event View

Some one have been trying to guess the UPS password and it was detected by the monitoring system

It is reasonable to notify the operator if the number of repetitive events exceeds the predefined value within the particular period of time.

Notifications

It is practically impossible to track on-line all events and changes via SCOM console. This is why the email notification system is so convenient:

Security Alert

After the numerous wrong attempts to enter the password the operator would get the corresponding notification

Runtime Remaining Alert

Low UPS remaining runtime notification

Graphs, alerts and notifications

The following graphs are currently available:

  • Input Voltage, V
  • Input Frequency, Hz
  • Output Voltage, V
  • Output Frequency, Hz
  • Output Current, A
  • Output Load, %
  • Battery Capacity, %
  • Battery Current, A
  • Battery Voltage, V
  • Battery Time Remaining, m
  • Battery Temperature, C
  • Probe Temperature, C
  • Probe Humidity, %

There is a list of alerts and notifications that currently available:

Category Object Conditions
Availability UPS Basic Status The UPS has changed its status (for example to “Hardware Failure Bypass” or another – there are 12 values available)
UPS DC Fan In case of fan outage
UPS has switched to battery backup power UPS has switched to battery backup power
UPS Link check In case there are n lost responses from UPS within time m
Battery replace indicator The UPS battery must be replaced
Test Calibration Results There are errors or unknown results of the last UPS calibration
Test Diagnostic Results There are errors or unknown results of the last UPS diagnostic tests
Performance Input Line Voltage The input line voltage is out of the valid boundaries
Input Line Frequency The input line frequency is out of the valid boundaries
Output Load The UPS load is higher than threshold
Output Voltage The output line voltage is out of the valid boundaries
Battery Capacity Battery charge capacity is lower that threshold
Battery Runtime Remaining The UPS remaining time is lower that predefined value
Battery Temperature The temperature exceeds the threshold
Output contact Output contact shortcut
Input relays Input relay contacted
Probe temperature The external temperature is higher that threshold
Security UPS Password UPS access password has been changed
UPS HTTP Access N wrong password attempts within time m via HTTP
UPS Console Access N wrong password attempts within time m via console

We continue working extending the management pack functionality and expect to make it available for non-APC UPSs which a RFC1628 compliant. However the main PROs will remain:

  • Management pack installation process is same as for the others. No special spadework required. No external connectors and software needed. Being installed the package detects UPSs among the devices discovered by SCOM and detects the imbedded components.
  • Management pack supports UPSs connected via Ethernet, USB and COM-ports.
  • Alerting is quite customizable and the management pack can be easily adjusted to the particular needs.