Photo by Tim Gouw on Unsplash
Photo by Tim Gouw on Unsplash

Painless Kubernetes monitoring and alerting

Roman Frey
Roman Frey

Kubernetes is hard, but lets make monitoring and alerting for Kubernetes simple!

At iLert we are creating architectures composed of microservices and serverless functions that scale massively and seamlessly to guarantee our customers uninterrupted access to our services. As many others in the industry we are relying on Kubernetes when it comes to the orchestration of our services. Responding fast to any kind of incident is an absolute must for our uptime platform as every second counts and we like to reduce complexity wherever possible, iLert’s Kubernetes agent and incident actions help us to respond to issues extremely fast.

Content

  1. Introduction
  2. Requirements
  3. Understanding the setup
  4. iLert account configuration
  5. Kubernetes agent setup
  6. Quick cluster actions setup
  7. Responding to alerts
  8. Conclusion

Introduction

As of today there are many solutions for collecting kubernetes cluster metrics like Prometheus and turning them into graphs and alerts with the help of Grafana. These are great solutions with lots of features and contributions, but there are some problems that go hand in hand with them:

  • Complexity (infrastructure setup)
  • Complexity (alerting based on graph thresholds)
  • Huge resource usage
  • Maintenance
  • Time to react to an incident

First, the complexity of such monitoring systems means that our goal (to get an alert if a problem occurs) cannot be achieved if one of the monitoring components itself fails. Let’s take a closer look. Suppose I want to be notified if my service or my kubernetes node has reached the memory limit, what do I need for this? The minimum we need is Prometheus, Grafana, kube-state-metrics and node exporter running in my cluster. Then I need to create a Grafana dashboard with the metrics I am interested in and individual alerting conditions for each service, because the memory used for each is different. It is far from easy and adds a whole layer of services that require additionall monitoring. Now if I want to be notified if one of my monitoring system components fails, I need to use HA solutions for it like Cortex, which again adds a whole additional level of complexity to my system.

Besides added complexity and maintenance, the resources used by monitoring systems often exceed the resources consumed by business services (especially in smaller companies or start-ups). This might be normal if we want to achieve transparency in resource usage and performance of our services or if we need to visualize business metrics. But if we just want to get an alert notification regarding a standard problem in our cluster e.g. POD down, it is clearly too much hassle.

The open source community is very active and we are very grateful for them, but this leads to the fact that we get several updates a month and of course we want to use them and we install them on our cluster over and over again. Unfortunately these updates don’t always go smoothly and we have to take the time to deal with the consequences of updates.

At last if a problem occurs, we need to act very quickly so that the customers are not impacted. This is not always easy, especially if you have been asleep for a couple of hours in the middle of the night and need to react asap on an incoming alert, you really want to keep any distraction and friction on a minimum. Besides having to log-in and searching for issues another problem could be the way the monitoring system collects metrics and how the alerts based on aggregated metrics work. For example, if I want to be notified when my service has been terminated (e.g. a bug which causes an NPE or our favorite OOMKilled). Prometheus collects metrics with a certain frequency. Usually it is once every 15 seconds, so in the worst case we lose those 15 seconds if a problem occurs right after Prometheus has once again collected data. After that Grafana processes the data usually every minute and runs a window of last 5 minutes in order to cause an alarm or not. That means in the best case we get an alarm after about a minute that our critical service has been terminated, in some cases after 5 or 10 minutes. Why waste so much time when we can get the information directly from Kubernetes in the same second and react much faster.

So let’s see how we can make our lives easier and separate alerts from metrics with iLert Kubernetes Agent.

Requirements

You will need:

  • a Mac / Unix machine
  • an iLert account signup now, its free
  • a Kubernetes cluster
  • potentially an AWS account with AWS Lambda access (if you want to trigger incident actions)

Understanding the setup

Before we start, let’s first understand how it works. In the following example, I want to be notified about problems in my cluster via SMS, Voice or push notifications and respond as quickly as possible to fix the problem.

go to alert source

ilert-kube-agent is a service that listens to the Kubernetes API server and generates incidents about the health state of the pods and the nodes. The agent detects the most common problems in the kubernetes cluster and does not require additional configuration. However, it is possible to pass a simple configuration if for example you don’t want to receive a specific type of alert. As soon as a problem is detected, the agent creates an incident in iLert about it. Next comes the reaction to the incident, in this case using a Lambda function, in order to quickly fix the potential problem - without requiring another context switch.

iLert setup instructions

  1. Go to the “Alert sources” tab and click Create a new alert source
go to alert source
  1. Enter a name and select your desired escalation policy. Select Kubernetes as the Integration Type and click on Save.
create new alert source
  1. On the next page, an API Key is generated. You will need it below when setting up the ilert-kube-agent deployment.
view alert source

Kubernetes Agent setup instructions

In this example I will use the helm setup. If you’re considering an installation with a simple yaml manifest or Terraform or Serverless (Lambda), you can refer to our instructions here

  1. Add the helm charts repo and update it
helm repo add ilert https://ilert.github.io/charts/
helm repo update
  1. Deploy the agent with the API Key that you have generated in iLert
   helm upgrade --install --namespace kube-system \
    ilert-kube-agent ilert/ilert-kube-agent \
    --set config.settings.apiKey="<YOUR KEY HERE>"
  1. Verify your agent installation
✗ kubectl --namespace kube-system get pod -l app=ilert-kube-agent -w
NAME                                READY   STATUS    RESTARTS   AGE
ilert-kube-agent-57d8747dd5-b7z1x   1/1     Running   0          37s
ilert-kube-agent-57d8747dd5-kzx8t   1/1     Running   0          25s

Finished! Now Kubernetes events will create incidents in iLert.

Incident action setup instructions

for the sake of this demonstration we are using AWS Lambda, you may use any other externally triggerable resource

In order to respond to an incident with incident actions in our kubernetes cluster, we need to create a Lambda connector and incident action for our Kubernetes alert source in iLert. The first thing to do is to create a Lambda function in AWS. For our demonstration purposes, I have created this repository, which contains a simple Lambda function with the utils to scale our deployment or statefulset.

  1. Deploy Lambda API
# Clone repo
git clone git@github.com:iLert/kubernetes-alerting-lambda-sample.git
cd kubernetes-alerting-lambda-sample
# Install dependencies
npm install
# Build binary
make build
# Deploy
serverless deploy --cluster=<CLUSTER NAME HERE> --region=<REGION HERE>
  1. After the deployment, a Lambda URL and an Authorization API Key are generated. You will need it below when setting up the connector and incident action in ILert.
go to new connector
  1. Go to the Connectors page of your iLert account and click on the Create connector
go to new connector
  1. On the next page, name the connector e.g. Serverless API, choose AWS lambda as type and paste the Authorization API Key that you generated in the step 2.
create new connector
  1. Go to our Kubernetes alert source that you have created before and navigate to the incident actions tab, then click on the Create first incident action button.
go to alert source connection
  1. On the next page, choose AWS lambda as type, choose the connector that you have just generated, choose Manually via incident action as trigger mode, name the incident action e.g. Scale my service, paste the Lambda URL that you generated in the step 2 and paste the custom content template:
{
  "name": "my-super-service",
  "namespace": "default",
  "type": "deployment",
  "replicas": 20
}
create new connection

We are now able to react to each kubernetes incident with a quick incident action from within iLert.

Responding to alerts

Now let’s see our setup in practice. In this example we get an alert that our service uses and increased amount of memory, probably because it has to process more traffic than usual. I want to spread the load off this service as soon as I get a notification about this problem and analyse the issue afterwards.

mobile responding to alert

As you can see I used the incident action that we created before to solve the problem, right from my smartphone.

mobile incident action

Checking the status of our service in Kubernetes, we see that it has more replicas now.

✗ kubectl --namespace default get pod -l app=my-super-service
NAME                               READY   STATUS    RESTARTS   AGE
my-super-service-cb49f6b58-98m2j   1/1     Running   0          4d5h
my-super-service-cb49f6b58-cvkxg   0/1     Running   0          3s
my-super-service-cb49f6b58-gzx7t   1/1     Running   0          2d23h
my-super-service-cb49f6b58-n58vm   0/1     Running   0          3s
my-super-service-cb49f6b58-zzg6t   1/1     Running   0          90m
...

Conclusion

Of course not all problems are solved by scaling or rolling back, but response time is always critical to a successful business.

Our experience with different sized clusters and monitoring systems shows that you should not always rely on a single solution, even a very popular and well-established one. Solving a day-to-day problem in the shortest and most efficient way can make your life a lot easier.