VMware Alarms

Summary: Information and background on vSphere alarms as well as how to set them up.
Date: Around 2010
Refactor: 1 May 2025: Checked links and formatting.

vmware

With the introduction of vSphere 4.0 the possibilities of using Alarms in VMware are greatly improved. This article describes some of the possibilities that are now available while using alarms. We'll go through some of the default defined alarms and customize them to our environment to make sure they do what we want. Note that we'll only use Email Notification in our environment. SNMP traps are also supported, but not used with us.
Since Alarms consists of 4 tabs we'll go through them per tab:

General
1. Naming, description, etc.
2. Alarm Type & Trigger
  1. Monitor for specific condition
  2. Monitor for specific event
3. Enabling Alarm
Trigger Configuration
Reporting Configuration
1. Range
2. Frequency
Actions

After going through all the options I'll tell you how to configure vCenter for email notification and I'll give you my minimal customizations in the vCenter Alarm Definitions.

General Alarm Configuration

By default alarm definitions are configured at the vCenter level. So in your vCenter select the object representing the vCenter server, select the tab Alarms and select the Definitions view:

During this walk through we'll focus on the defined alarm “Cannot connect to storage”. In the general tab we'll leave the default alarm name but we'll modify the description so other system administrators know that I changed it:
Default is:

Default alarm to monitor host connectivity to storage device

We'll change it to:

Customized alarm to monitor host connectivity to storage device - sjoerd, 7 June 2011.

As you can see it's really obvious that the the alarm is changed and by who.

Alarm Type

The alarm type we use now is for hosts. Note that it's possible to create alarms for:

Virtual Machines
Hosts
Clusters
Datacenters
Datastores
Networks
Distributed switches
Distributed virtual port groups

Now note that for an alarm to work it needs to be triggered. In VMware the triggering can be done in two different ways:

On a specific condition or state
- Examples for states are “Power States” and “Connection States” (a.o.)
- Examples for conditions are “Performance metrics as CPU and disk usage” (a.o.)
On a specific event
- Events are always on managed objects
- Examples of events for a Virtual Machine may include “cloned”, “created”, “deleted”, “deployed” and “migrated”

General Alarm Configuration (Continued)

This is the result:

Trigger Configuration

In the tab “Triggers” there are already three events added that can be configured to trigger the alarm:

Lost Storage Connectivity
Lost Storage Path Redundancy
Degraded Storage Path Redundancy

Each of them has a default status of “unset” and can have extra conditions so it's possible to only activate the trigger when it happens on a specific datacenter, datastore, host, etc. The default status is not really helpful, it means the event will never trigger the alarm. We'll set the events like this:

Lost Storage Connectivity : Alert
Lost Storage Path Redundancy : Warning
Degraded Storage Path Redundancy : Warning

These options are chosen according to the amount of trouble they give. Lost storage connectivity means end users will not be able to work anymore while path redundancy can impact performance, but ens users will still be able to work. We won't set any conditions since we want the alarms to work on the entire environment.
This gives this result:

Note: As you can see in the screenshot the alarm will be triggered if ANY of the specified events occur. Since this is a default alarm that we are slightly customizing this is an option that cannot be changes. If you want the alarm to be triggered if all events occur you'll have to create the alarm manually. Then you'll have the option to customize this.

Reporting Configuration

As you can see below this is not customizable when monitoring for events. That is logical, because if you lose storage connectivity you can't have a fluctuation for example as you can have with CPU usage:

It is however interesting to dive in these options a little bit deeper, at least explaining what it should do: Using Range and Frequency with Alarms
The Range parameter specifies a tolerance percentage above or below the configured threshold. For example, the built-in alarm for virtual machine CPU usage specifies a warning threshold of 75 percent but specifies a range of 0. This means that the trigger will activate the alarm at exactly 75 percent. However, if the Range parameter were set to 5 percent, then the trigger would not activate the alarm until 80 percent (75 percent threshold + 5 percent tolerance range). This helps prevent alarm states from transitioning because of false changes in a condition by providing a range of tolerance.
The Frequency parameter controls the period of time during which a triggered alarm is not reported again. Using the built-in VM CPU usage alarm as our example, the Frequency parameter is set, by default, to five minutes. This means that a virtual machine whose CPU usage triggers the activation of the alarm won't get reported again – assuming the condition or state is still true – for five minutes.

Action Configuration

In the action tab it's possible to define the specific action that should be taken when the alarm gets triggered. This can be done on four different alarm state changes:

From a green circle to a yellow triangle
From a yellow triangle to a red diamond
From a red diamond to a yellow triangle
From a yellow triangle to a green circle

For every action you can define these options:

empty: there is no interest in the transition
once: the action gets performed only once
repeat: the action gets repeated on the frequency defined (from 1 minute to 2 days, 5 minute default)

Now the question is, how much minutes may be acceptable to have a notification send again? The assumption is that whoever gets the first notification will work on it as fast as possible since it is a severe warning/alert. However, some repeat may be expected in case somebody accidental forgets the email. I decided to set it to 240 minutes.
Also, considering what I've set in in the trigger configuration I only want the Alert to be repeated, not the warnings. All this gives me this result:

Other Actions

Note that there are other actions available as well:
Every Alarm has these actions available:

Send a notification email
Send a notification trap
Run a command

VM- and host-alarms have more actions:

Power on a virtual machine
Power off a virtual machine
Suspend a virtual machine
Reboot host
Shut down host

Email Configuration vCenter

Before vCenter is capable of sending email it needs to know some email settings. Go to Administration → vCenter Server Settings → Mail and fill in the correct values:

Note: There is no way in vCenter to test this configuration. The best way to test is to make a custom alarm, on something like VM CPU usage and set it to sent an email when usage is above 20% or something. That will be triggered pretty fast so emails will be sent.

Overview Customized Alarms

This is an overview of default alarms as defined in vCenter 4.1 that needs to be customized as described above or as described below:

Service Availability

Host connectivity
- Alarm Name: Host connection and power state
- Description: Customized alarm to monitor host connection and power state - sjoerd - 8 June 2011
- Alarm type: Hosts - monitor conditions or state
- Alert: Host connection state is equal to Not responding: Send notification email every 4 hours
- Alert: Host connection state is equal to Disconnected: Send notification email every 4 hours
- Send email to:
  - it_getshifting_com
  - sjoerd_getshifting_com
HA operations and errors
- Alarm Name: Cluster high availability error
- Description: Customized alarm to monitor high availability errors on a cluster - sjoerd, 21 December 2011
- Alarm type: Clusters - monitor events
- Alert: HA host isolated: Send notification email every 4 hours
- Alert: All HA hosts isolated: Send notification email every 4 hours
- Alert: HA host failed: Send notification email every 4 hours
- Send email to:
  - it_getshifting_com
  - sjoerd_getshifting_com

Resource Monitoring

Host CPU usage
- Alarm Name: Host cpu usage
- Description: Customized alarm to monitor host CPU usage - sjoerd, 21 December 2011
- Alarm type: Hosts - monitor conditions or state
- Alert: Host memory usage is above 90% for 15 minutes: Send notification email every 4 hours
- Warning: Host memory usage is above 75% for 60 minutes: Send notification email once
- Send email to:
  - it_getshifting_com
  - sjoerd_getshifting_com
Host Memory usage
- Alarm Name: Host memory usage
- Description: Customized alarm to monitor host memory usage - sjoerd - 8 June 2011
- Alarm type: Hosts - monitor conditions or state
- Alert: Host memory usage is above 90% for 15 minutes: Send notification email every 4 hours
- Warning: Host memory usage is above 75% for 60 minutes: Send notification email once
- Send email to:
  - it_getshifting_com
  - sjoerd_getshifting_com

Storage Monitoring

Storage capacity
- Alarm Name: Datastore overallocation
- Description: New alarm to replace “datastore usage on disk”. Since we use guaranteed storage only datastore usage is not of any use. In stead we monitor on accidentally overallocation of datastores - sjoerd, 21 December 2011
- Alarm type: Datastores - monitor conditions or state
- Alert: Datastore Disk Overallocation is above 125%: Send notification email every 8 hours
- Warning: Datastore Disk Overallocation is above 110%: Send notification email once
- Send email to:
  - it_getshifting_com
  - sjoerd_getshifting_com

Storage connectivity
- Alarm Name: Cannot Connect to storage
- Description: Customized alarm to monitor host connectivity to storage device - sjoerd, 7 June 2011.
- Alarm type: Hosts - monitor events
- Alert: Lost Storage Connectivity: Send notification email every 4 hours
- Warning: Lost Storage Path Redundancy: Send notification email once
- Warning: Degraded Storage Path Redundancy: Send notification email once
- Send email to:
  - it_getshifting_com
  - sjoerd_getshifting_com

Note that the default alarm “Datastore usage on disk” has been disabled and replaced by “Datastore overallocation”.

Table of Contents