Enhanced Monitoring for Fastly's CA
This is the latest in a series of posts describing how we built Certainly, Fastly’s publicly-trusted Certification Authority (CA). We described some of our architectural decisions and went into detail on our decision to employ a traditional primary replica database design. Today we will share our path to a robust yet sustainable internal monitoring and alerting implementation.
A trusted CA should always employ comprehensive logging and monitoring to enable rapid detection of service failures and suspicious activities within the environment. This not only enhances the integrity of the CA but also provides operations teams with the necessary data to respond to incidents.
At Fastly we have always cared about providing the best real-time logging and observability and integrating real-time data into other tools, so when we started building our own CA we knew we had to do the same.
Certainly’s redundant and purpose-built architecture meant that we needed logs and metrics from many different systems, including physical hosts, network devices, cameras and more. We needed to not only collect this data but also coalesce it into a meaningful alert signal. Throughout this post, we’ll use the term ‘alert’ to refer to events that trigger a push notification to an on-call engineer and require their immediate attention. Noisy and non-actionable alerts trigger needless wake-up pages, pull staff away from other tasks and personal time, cause burnout and eventually decrease service reliability. This post discusses the challenge of creating meaningful alerts and the approach Certainly used to address it.
Before any alert can be defined, logs and metrics need to be securely assembled with the right information. Certainly chose Splunk (with PagerDuty integration) as its monitoring platform and standardized on the popular open source tools rsyslog and Telegraf for log and metric collection. Once this data was collected and available in Splunk, the Certainly team then had the capability to create alerts and perform incident investigation and reporting.
We started by building an initial set of approximately 60 alerts, targeting service failure messages or errors in the supporting toolset. When we first deployed these alerts many false positives were triggered because the service or host was in the process of being rebuilt (recreated), which is part of Certainly’s ephemeral design. Soon after, we noticed that it didn’t take many false positives to quickly burn out support staff and threaten support reliability. We knew we had to do better so we took the following steps to correct the issues.
We went through each alert and questioned if it was really needed. Some example questions:
Do we need to alert on backup failures if we already have an alert for restore failures?
Should we alert when only 1 of 2 redundant (HA) services fails?
We combined alerts where possible:
Since our data center cage zones are always monitored by many redundant motion sensors, let's only trigger an alert when 2 sensors are tripped to weed out false positives caused by vibrations.
We adjusted timing thresholds to account for common scenarios:
Patching deployments and host rebuilds mean that monitoring individual instances of a service can trigger many alerts in the course of normal operations so we adjusted timing thresholds on each alert to account for this.
We temporarily disabled alerts when necessary:
There was no need to alert on physical security breaches when authorized site visits were underway. This also eliminated the busy work required to acknowledge and resolve the alert.
After going through this process we saw a dramatic decrease in alerts and the alerts that were triggered were truly actionable. Support staff morale began to improve and there was less look of "burnout" on faces. Less time was being spent on operational support and more on improving the service reliability and surrounding infrastructure.
In the end, the effort required to tune our alerts was time well spent. Certainly is now more capable than ever of handling additional logging and monitoring needs as our customer base increases.