thegeeklab/content/posts/2022/ssl-certificate-monitoring-pitfalls/index.md

26 lines
3.6 KiB
Markdown
Raw Normal View History

---
title: "SSL certificate monitoring pitfalls"
date: 2022-01-31T23:00:00+01:00
authors:
- robert-kaussow
tags:
- Sysadmin
2022-03-12 17:54:12 +00:00
- Today I learned
resources:
- name: feature
src: "images/feature.jpg"
params:
anchor: Center
credits: >
[Erik Mclean](https://unsplash.com/@introspectivedsgn) on
[Unsplash](https://unsplash.com/photos/cVJWdOncbm8)
---
Certificates are a fundamental part of the Internet's security. At least since Let's Encrypt, a free and automated Certificate Authority, has started its service, SSL is nearly used everywhere. To avoid Certificate issues and possible service outages, it's a good idea to monitor the SSL certificates used by your services, especially as Let's Encrypt certificates have a short lease time of 90 days.
I'm using Prometheus to monitor my infrastructure, and for Prometheus there are multiple ways to get started. Most of the tutorials and posts of the internet will cover the case of expired certificates, and it's pretty easy to achieve. I prefer to use Telegraf, a plugin based metrics collector that also provides Prometheus compatible outputs, instead of dedicated Prometheus exporters. To monitor SSL certificates, I'm using the `x509_cert` input plugin of Telegraf that provides a metric called `x509_cert_expiry` which can be utilized to write simple alerting rules. That's actually pretty cool already, as Prometheus will send out alerts a few weeks before the certificates would expire in case there is a problem within the automatic renewal process.
A week ago, Let's Encrypt has informed affected users that they need to [revoke faulty certificates](https://community.letsencrypt.org/t/questions-about-renewing-before-TLS-ALPN-01-revocations/170449) issued and validated with the `TLS-ALPN-01` challenge. Even if I'm using the `DNS-01` for almost all of my certificates, I have also received a mail and started to look into it. Sadly, the notification mail only contained a "random" ACME registration ID, and I was not able to find the matching client. As mentioned, I don't really use `TLS-ALPN-01`, so I decided to stop the research and leave it to my monitoring to tell me which forgotten service is the evil one after the certificates were revoked. Nothing happened after the revocation, and the monitoring was not complaining. Good - well no, a user reported that one of the services is not reachable anymore and of course this was the one missing client that was using `TLS-ALPN-01` verified certificates - dang. While the issue itself was easy to resolve by a force renew of the certificate, I was still wondering why the monitoring has not caught it.
2022-01-31 22:16:26 +00:00
Well, this was the first time that I had to deal with _revoked_ certificates instead of _expired_ certificates. To be honest, I never thought about the detection of revoked certificates in my monitoring setup before, and therefore this case wasn't covered. But it looks like a fix is also not that straight forward as expected. The used Telegraf input `x509_cert` is not able to detect revoked certificates yet, and the common Prometheus [`blackbox_exporter`](https://github.com/prometheus/blackbox_exporter/issues/6) also don't want to handle this case. The only way I have found so far is to use the [`ssl_exporter`](https://github.com/ribbybibby/ssl_exporter) that provides some revocation information of the certificates using OSCP. If you are already running multiple exporters, that might be the way to go for you. Personally, I prefer to handle as much as possible using Telegraf, so I might look into a [fix](https://github.com/influxdata/telegraf/issues/10550) for the `x509_cert` during the next weeks. However, lessons learned :blue_book: