Monitoring and Metrics

It takes people, processes and technology to ensure uptime. Contegix monitoring solutions provide all three – whether we’re monitoring devices and systems on your premises, across your network, or hosted in our Data Center.

As a cloud infrastructure, managed service, and data center hosting provider, our customers have a broad range of needs, not the least of which are flexibility, security, and integration. Monitoring and metrics collection is the foundation to accommodate all of these requirements, and Contegix takes our promise to Go Beyond for our customers seriously. That’s why we designed the Noctane Monitoring System (NMS) for our managed service customers. Below are some details of our monitoring system that can be found on the Contegix Customer Portal.

 

The Sensu Monitoring and Metrics Framework

sensu_logo_large-c92d73dbWe use the Sensu monitoring framework underneath the hood to provide the message bus through which we aggregate monitoring events, metrics, and artifacts in various ways. The system scales extraordinarily well with industry-proven AMQP servers, which provide the communication backbone for all status checks flowing through the system. A specially packaged agent is installed on all managed devices, which provide the producers of virtually all content within the system. On the consumer side, we use event dashboards and multiple escalation schemes to identify problems and get them in the hands of the right people to ensure your business critical services are available.

 

Check Scripts Versus Check Extensions

Within monitoring systems like Nagios and even Sensu, the sky is the limit in terms of the types of services you’re able to check. These checks typically come by the way of arbitrary scripts that could use any available tool/resource on the system to perform the check. This approach to running checks works reasonably well, if somewhat inefficient, due to a large number of fork system calls (one or more for each check per check interval). An alternative to running the checks as scripts is to perform the check from inside the client process (in our case with Sensu, the ruby VM). Without the volume of process forking, we can massively scale volume and concurrency, but of course, we’re limited to the libraries available to the ruby VM, so we sacrifice some flexibility for speed. On a few internal Contegix monitoring hosts, we’re running over 2000 checks every minute as Sensu check extensions.

 

Metric-Based Checks

These standard checks are point-in-time snapshots of a particular system state. Suppose you want to know if the current load average on your system is unusual. We can’t quantify “unusual” as a threshold value to check against a point in time, but if we could look at the historical load average values, then we could certainly calculate the standard deviation and compare that to the current value. This is exactly what metric-based checks implement. Clearly, access to the historic metric data is required for this functionality. Since we store metric data external to the client machine, we query the remote database via the managed services network to compute the tolerance levels and then compare that to the current value.

 

Portal Integration

Device MonitorsAll monitoring checks that are executed on a given managed host are available from within the Contegix Customer Portal. From the portal, customers can see:

  • Status of the check loaded dynamically on page refresh
  • Type of check, either metric collection or a check for availability/status
  • Check configurations including thresholds, interval, etc.
  • Output status if the check is ever in a non-OK state

 

Metric Data Collection and Graphing

Managed Device MetricsNo modern monitoring system would be complete without metric collection and reporting. Contegix utilizes Graphite and accompanying tools to store, report, and graph managed customer metric data. Our default set of metrics for a typical set-up consists of over 500 metric data points per host, per minute. This is by no means a limit, we encourage all managed customers to engage us with custom metric data you’d like to have collected.

In addition to having Contegix add metrics to be collected to each host, managed customers can also submit their own application level metrics via a statsd daemon implementation running within our monitoring agent. This is an effective means of instrumentation for your applications and the metric value will be automatically made available to graph from within the Contegix customer portal within a matter of minutes.

We also have a set of default metric templates that are automatically graphed on our customer portal. This set of metric templates can be augmented and extended by users to show, for example, only pertinent metric data for a host in question.

Want to learn more about the Contegix Customer Portal and how our Monitoring and Metrics Reporting can benefit your infrastructure? Contact us today!