1. What Is Prometheus? – Prometheus: Up & Running [Book]

Chapter 1.

What Is Prometheus?

Prometheus is an open source, metrics-based monitoring system. Of course,
Prometheus is far from the only one of those out there, so what makes it
notable?

Prometheus does one thing and it does it well. It has a simple yet powerful
data model and a query language that lets you analyse how your applications
and infrastructure are performing. It does not try to solve problems outside of
the metrics space, leaving those to other more appropriate tools.

Since its beginnings with no more than a handful of developers working in
SoundCloud in 2012, a community and ecosystem has grown around Prometheus.
Prometheus is primarily written in Go and licensed under the Apache 2.0
license. There are hundreds of people who have contributed to the project
itself, which is not controlled by any one company. It is always hard to tell
how many users an open source project has, but I estimate that as of 2018, tens of thousands of organisations are using Prometheus in production. In
2016 the Prometheus project became the second member1 of the Cloud Native Computing Foundation (CNCF).

For instrumenting your own code, there are client libraries in all the popular
languages and runtimes, including Go, Java/JVM, C#/.Net, Python, Ruby, Node.js,
Haskell, Erlang, and Rust. Software like Kubernetes and Docker are already
instrumented with Prometheus client libraries. For third-party software that
exposes metrics in a non-Prometheus format, there are hundreds of
integrations available. These are called exporters, and include HAProxy, MySQL,
PostgreSQL, Redis, JMX, SNMP, Consul, and Kafka. A friend of mine even added
support for monitoring Minecraft servers, as he cares a lot about his frames
per second.

A simple text format makes it easy to expose metrics to Prometheus. Other
monitoring systems, both open source and commercial, have added support for
this format. This allows all of these monitoring systems to focus more on core
features, rather than each having to spend time duplicating effort to support
every single piece of software a user like you may wish to monitor.

The data model identifies each time series not just with a name, but also with
an unordered set of key-value pairs called labels. The PromQL query language
allows aggregation across any of these labels, so you can analyse not just per
process but also per datacenter and per service or by any other labels that you
have defined. These can be graphed in dashboard systems such as Grafana.

Alerts can be defined using the exact same PromQL query language that you use
for graphing. If you can graph it, you can alert on it. Labels make maintaining
alerts easier, as you can create a single alert covering all possible label
values. In some other monitoring systems you would have to individually create
an alert per machine/application. Relatedly, service discovery can
automatically determine what applications and machines should be scraped from
sources such as Kubernetes, Consul, Amazon Elastic Compute Cloud (EC2), Azure,
Google Compute Engine (GCE), and OpenStack.

For all these features and benefits, Prometheus is performant and simple to
run. A single Prometheus server can ingest millions of samples per second. It
is a single statically linked binary with a configuration file. All components
of Prometheus can be run in containers, and they avoid doing anything fancy that
would get in the way of configuration management tools. It is designed to be
integrated into the infrastructure you already have and built on top of, not to
be a management platform itself.

Now that you have an overview of what Prometheus is, let’s
step back for a minute and look at what is meant by “monitoring” in order to
provide some context. Following that I will look at what the main components of
Prometheus are, and what Prometheus is not.

What Is Monitoring?

In secondary school one of my teachers told us that if you were to ask ten
economists what economics means, you’d get eleven answers. Monitoring has a
similar lack of consensus as to what exactly it means. When I tell others what I do, people think my job entails everything from
keeping an eye on temperature in factories, to employee monitoring where I was
the one to find out who was accessing Facebook during working hours, and even
detecting intruders on networks.

Prometheus wasn’t built to do any of those things.2 It was built to aid
software developers and administrators in the operation of production computer
systems, such as the applications, tools, databases, and networks backing
popular websites.

So what is monitoring in that context? I like to narrow this sort of
operational monitoring of computer systems down to four things:

Alerting

Knowing when things are going wrong is usually the most important thing that
you want monitoring for. You want the monitoring system to call in a human
to take a look.

Debugging

Now that you have called in a human, they need to investigate to determine the
root cause and ultimately resolve whatever the issue is.

Trending

Alerting and debugging usually happen on time scales on the order of minutes
to hours. While less urgent, the ability to see how your systems are being used
and changing over time is also useful. Trending can feed into design decisions
and processes such as capacity planning.

Plumbing

When all you have is a hammer, everything starts to look like a nail. At the
end of the day all monitoring systems are data processing pipelines. Sometimes
it is more convenient to appropriate part of your monitoring system for
another purpose, rather than building a bespoke solution. This is not strictly
monitoring, but it is common in practice so I like to include it.

Depending on who you talk to and their background, they may consider only some
of these to be monitoring. This leads to many discussions about monitoring going
around in circles, leaving everyone frustrated. To help you understand where others
are coming from, I’m going to look at a
small bit of the history of monitoring.

A Brief and Incomplete History of Monitoring

While monitoring has seen a shift toward tools including Prometheus in the
past few years, the dominant solution remains some combination of Nagios and
Graphite or their variants.

When I say Nagios I am including any software within the same broad family, such
as Icinga, Zmon, and Sensu. They work primarily by regularly executing scripts
called checks. If a check fails by returning a nonzero exit code, an alert is
generated. Nagios was initially started by Ethan Galstad in 1996, as an MS-DOS
application used to perform pings. It was first released as NetSaint in 1999, and
renamed Nagios in 2002.

To talk about the history of Graphite, I need to go back to 1994. Tobias
Oetiker created a Perl script that became Multi Router Traffic Grapher, or MRTG
1.0, in 1995. As the name indicates, it was mainly used for network monitoring
via the Simple Network Management Protocol (SNMP). It could also obtain metrics
by executing scripts.3 The year 1997 brought big changes with a move of some code to C, and the
creation of the Round Robin Database (RRD) which was used to store metric data.
This brought notable performance improvements, and RRD was the basis for other
tools including Smokeping and Graphite.

Started in 2006, Graphite uses Whisper for metrics storage, which has a
similar design to RRD. Graphite does not collect data itself, rather it is sent
in by collection tools such as collectd and Statsd, which were created in 2005
and 2010, respectively.

The key takeway here is that graphing and alerting were once completely separate concerns performed by different tools. You could write a
check script to evaluate a query in Graphite and generate alerts on that basis,
but most checks tended to be on unexpected states such as a process not running.

Another holdover from this era is the relatively manual approach to
administering computer services. Services were deployed on individual machines
and lovingly cared for by systems administrators. Alerts that might potentially
indicate a problem were jumped upon by devoted engineers. As cloud and cloud
native technologies such as EC2, Docker, and Kubernetes have come to prominence,
treating individual machines and services like pets with each getting individual attention does not scale. Rather, they
should be looked at more as cattle and administered and monitored as a group. In
the same way that the industry has moved from doing management by hand, to
tools like Chef and Ansible, to now starting to use technologies like
Kubernetes, monitoring also needs to make a similar transition from checks on
individual processes on individual machines to monitoring based on service health as a whole.

You may have noticed that I didn’t mention logging. Historically logs have been used as something that you use tail, grep, and awk on by hand. You might have had an
analysis tool such as AWStats to produce reports once a hour or day. In more
recent years they have also been used as a significant part of monitoring, such
as with the Elasticsearch, Logstash, and Kibana (ELK) stack.

Now that we have looked a bit at graphing and alerting, let’s look at how metrics and
logs fit into things. Are there more categories of monitoring than those two?

Categories of Monitoring

At the end of the day, most monitoring is about the same thing: events. Events
can be almost anything, including:

  • Receiving a HTTP request

  • Sending a HTTP 400 response

  • Entering a function

  • Reaching the else of an if statement

  • Leaving a function

  • A user logging in

  • Writing data to disk

  • Reading data from the network

  • Requesting more memory from the kernel

All events also have context. A HTTP request will have the IP address it is
coming from and going to, the URL being requested, the cookies that are set,
and the user who made the request. A HTTP response will have how long the
response took, the HTTP status code, and the length of the response body. Events
involving functions have the call stack of the functions above them, and
whatever triggered this part of the stack such as a HTTP request.

Having all the context for all the events would be great for debugging and
understanding how your systems are performing in both technical and business
terms, but that amount of data is not practical to process and store. Thus
there are what I would see as roughly four ways to approach reducing that volume of data to something workable, namely profiling, tracing, logging, and metrics.

Profiling

Profiling takes the approach that you can’t have all the context for all of the
events all of the time, but you can have some of the context for limited
periods of time.

Tcpdump is one example of a profiling tool. It allows you to record network traffic
based on a specified filter. It’s an essential debugging tool, but you can’t
really turn it on all the time as you will run out of disk space.

Debug builds of binaries that track profiling data are another example. They
provide a plethora of useful information, but the performance impact of
gathering all that information, such as timings of every function call, means
that it is not generally practical to run it in production on an ongoing basis.

In the Linux kernel, enhanced Berkeley Packet Filters (eBPF) allow detailed
profiling of kernel events from filesystem operations to network oddities.
These provide access to a level of insight that was not generally available
previously, and I’d recommend reading
Brendan Gregg’s writings on the subject.

Profiling is largely for tactical debugging. If it is being used on a longer
term basis, then the data volume must be cut down in order to fit into one of
the other categories of monitoring.

Tracing

Tracing doesn’t look at all events, rather it takes some proportion of events
such as one in a hundred that pass through some functions of interest. Tracing
will note the functions in the stack trace of the points of interest, and
often also how long each of these functions took to execute. From this you can
get an idea of where your program is spending time and which code paths are
most contributing to latency.

Rather than doing snapshots of stack traces at points of interest, some tracing
systems trace and record timings of every function call below the
function of interest. For example, one in a hundred user HTTP requests might be
sampled, and for those requests you could see how much time was spent talking
to backends such as databases and caches. This allows you to see how timings
differ based on factors like cache hits versus cache misses.

Distributed tracing takes this a step further. It makes tracing work across
processes by attaching unique IDs to requests that are passed from one process
to another in remote procedure calls (RPCs) in addition to whether this request
is one that should be traced. The traces from different processes and
machines can be stitched back together based on the request ID. This is a vital
tool for debugging distributed microservices architectures. Technologies in
this space include OpenZipkin and Jaeger.

For tracing, it is the sampling that keeps the data volumes and instrumentation
performance impact within reason.

Logging

Logging looks at a limited set of events and records some of the context for
each of these events. For example, it may look at all incoming HTTP requests, or
all outgoing database calls. To avoid consuming too much resources, as a rule of
thumb you are limited to somewhere around a hundred fields per log entry.
Beyond that, bandwidth and storage space tend to become a concern.

For example, for a server handling a thousand requests per second, a log entry
with a hundred fields each taking ten bytes works out as a megabyte per second.
That’s a nontrivial proportion of a 100 Mbit network card, and 84 GB of storage
per day just for logging.

A big benefit of logging is that there is (usually) no sampling of events, so
even though there is a limit on the number of fields, it is practical to determine how slow requests are affecting one particular user talking to one
particular API endpoint.

Just as monitoring means different things to different people, logging also
means different things depending on who you ask, which can cause confusion. Different types of logging have
different uses, durability, and retention requirements. As I see it, there are four
general and somewhat overlapping categories:

Transaction logs

These are the critical business records that you must keep safe at all costs,
likely forever. Anything touching on money or that is used for critical
user-facing features tends to be in this category.

Request logs

If you are tracking every HTTP request, or every database call, that’s a request log.
They may be processed in order to implement user facing features, or just for
internal optimisations. You don’t generally want to lose them, but it’s not the
end of the world if some of them go missing.

Application logs

Not all logs are about requests; some are about the process itself. Startup
messages, background maintenance tasks, and other process-level log lines are
typical. These logs are often read directly by a human, so you should try to avoid
having more than a few per minute in normal operations.

Debug logs

Debug logs tend to be very detailed and thus expensive to create and store.
They are often only used in very narrow debugging situations, and are tending
towards profiling due to their data volume. Reliability and retention requirements tend to be low, and debug logs
may not even leave the machine they are generated on.

Treating the differing types of logs all in the same way can end you up in the worst of all worlds,
where you have the data volume of debug logs combined with the extreme
reliability requirements of transaction logs. Thus as your system grows you
should plan on splitting out the debug logs so that they can be handled separately.

Examples of logging systems include the ELK stack and Graylog.

Metrics

Metrics largely ignore context, instead tracking aggregations over time of
different types of events. To keep resource usage sane, the amount of different numbers being tracked needs to be limited: ten thousand per process is a
reasonable upper bound for you to keep in mind.

Examples of the sort of metrics you might have would be the number of times you
received HTTP requests, how much time was spent handling requests, and how many
requests are currently in progress. By excluding any information about context,
the data volumes and processing required are kept reasonable.

That is not to say, though, that context is always ignored. For a HTTP request
you could decide to have a metric for each URL path. But the ten thousand metric
guideline has to be kept in mind, as each distinct path now counts as a
metric. Using context such as a user’s email address would be unwise, as they
have an unbounded cardinality.4

You can use metrics to track the latency and data volumes handled by each of
the subsystems in your applications, making it easier to determine what exactly
is causing a slowdown. Logs could not record that many fields, but once you
know which subsystem is to blame, logs can help you figure out which exact user
requests are involved.

This is where the tradeoff between logs and metrics becomes most apparent.
Metrics allow you to collect information about events from all over your
process, but with generally no more than one or two fields of context with
bounded cardinality. Logs allow you to collect information about all of one
type of event, but can only track a hundred fields of context with unbounded
cardinality. This notion of cardinality and the limits it places on metrics is
important to understand, and I will come back to it in later chapters.

As a metrics-based monitoring system, Prometheus is designed to track
overall system health, behaviour, and performance rather than individual
events. Put another way, Prometheus cares that there were 15 requests in
the last minute that took 4 seconds to handle, resulted in 40 database
calls, 17 cache hits, and 2 purchases by customers. The cost and code
paths of the individual calls would be the concern of profiling or logging.

Now that you have an understanding of where Prometheus fits in the overall
monitoring space, let’s look at the various components of Prometheus.

Prometheus Architecture

Figure 1-1 shows the overall architecture of Prometheus. Prometheus
discovers targets to scrape from service discovery. These can be your own
instrumented applications or third-party applications you can scrape via
an exporter. The scraped data is stored, and you can use it in
dashboards using PromQL or send alerts to the Alertmanager, which will convert
them into pages, emails, and other notifications.

Architecture diargram

Figure 1-1.

The Prometheus architecture

Client Libraries

Metrics do not typically magically spring forth from applications; someone has
to add the instrumentation that produces them. This is where client libraries
come in. With usually only two or three lines of code, you can both define a
metric and add your desired instrumentation inline in code you control. This is
referred to as direct instrumentation.

Client libraries are available for all the major languages and runtimes. The
Prometheus project provides official client libraries in Go, Python, Java/JVM, and
Ruby. There are also a variety of third-party client libraries, such as for
C#/.Net, Node.js, Haskell, Erlang, and Rust.

Official Versus Unofficial

Don’t be put off by integrations such as client libraries being unofficial or
third party. With hundreds of applications and systems that you may wish to
integrate with, it is not possible for the Prometheus project team to have the time and expertise to create and maintain them all. Thus the vast majority of
integrations in the ecosystem are third party. In order to keep things
reasonably consistent and working as you would expect, guidelines are available
on how to write integrations.

Client libraries take care of all the nitty-gritty details such as
thread-safety, bookkeeping, and producing the Prometheus text exposition format
in response to HTTP requests. As metrics-based monitoring does not track
individual events, client library memory usage does not increase the more
events you have. Rather, memory is related to the number of metrics you have.

If one of the library dependencies of your application has Prometheus
instrumentation, it will automatically be picked up. Thus by instrumenting a
key library such as your RPC client, you can get instrumentation for it in all
of your applications.

Some metrics are typically provided out of the box by client libraries such as CPU usage
and garbage collection statistics, depending on the library and runtime environment.

Client libraries are not restricted to outputting metrics in the Prometheus
text format. Prometheus is an open ecosystem, and the same APIs used to feed
the generation text format can be used to produce metrics in other formats or
to feed into other instrumentation systems. Similarly, it is possible to take
metrics from other instrumentation systems and plumb it into a Prometheus
client library, if you haven’t quite converted everything to Prometheus
instrumentation yet.

Exporters

Not all code you run is code that you can control or even have access to, and thus
adding direct instrumentation isn’t really an option. For example, it is
unlikely that operating system kernels will start outputting Prometheus-formatted metrics over HTTP anytime soon.

Such software often has some interface through which you can access metrics.
This might be an ad hoc format requiring custom parsing and handling, such as is
required for many Linux metrics, or a well-established standard such as SNMP.

An exporter is a piece of software that you deploy right beside the application
you want to obtain metrics from. It takes in requests from Prometheus, gathers
the required data from the application, transforms them into the correct format, and finally
returns them in a response to Prometheus. You can think of an exporter as a
small one-to-one proxy, converting data between the metrics interface of an
application and the Prometheus exposition format.

Unlike the direct instrumentation you would use for code you control, exporters use a different style of instrumentation known as custom collectors or
ConstMetrics.5

The good news is that given the size of the Prometheus community, the exporter you need probably already exists and can be used with little
effort on your part. If the exporter is missing a metric you are interested in, you can
always send a pull request to improve it, making it better for the
next person to use it.

Service Discovery

Once you have all your applications instrumented and your exporters running,
Prometheus needs to know where they are. This is so Prometheus will know what
is meant to monitor, and be able to notice if something it is meant to be monitoring is not responding. With
dynamic environments you cannot simply provide a list of applications and
exporters once, as it will get out of date. This is where service discovery comes in.

You probably already have some database of your machines,
applications, and what they do. It might be inside Chef’s database, an inventory
file for Ansible, based on tags on your EC2 instance, in labels and annotations
in Kubernetes, or maybe just sitting in your documentation wiki.

Prometheus has integrations with many common service discovery mechanisms, such as
Kubernetes, EC2, and Consul. There is also a generic integration for those whose
setup is a little off the beaten path (see “File”).

This still leaves a problem though. Just because Prometheus has a list of machines
and services doesn’t mean we know how they fit into your architecture. For example, you might be using the EC2 Name tag6 to indicate what application
runs on a machine, whereas others might use a tag called app.

As every organisation does it slightly differently, Prometheus allows you to
configure how metadata from service discovery is mapped to monitoring targets
and their labels using relabelling.

Scraping

Service discovery and relabelling give us a list of targets to be
monitored. Now Prometheus needs to fetch the metrics. Prometheus does this by
sending a HTTP request called a scrape. The response to the scrape is parsed and ingested
into storage. Several useful metrics are also added in, such as if the scrape
succeeded and how long it took. Scrapes happen regularly; usually you would
configure it to happen every 10 to 60 seconds for each target.

Pull Versus Push

Prometheus is a pull-based system. It decides when and what to scrape,
based on its configuration. There are also push-based systems, where the
monitoring target decides if it is going to be monitored and how often.

There is vigorous debate online about the two designs, which often bears
similarities to debates around Vim versus EMACS. Suffice to say both have pros
and cons, and overall it doesn’t matter much.

As a Prometheus user you should understand that pull is ingrained in the core
of Prometheus, and attempting to make it do push instead is at best unwise.

Storage

Prometheus stores data locally in a custom database. Distributed systems
are challenging to make reliable, so Prometheus does not attempt to do any form
of clustering. In addition to reliability, this makes Prometheus easier to run.

Over the years, storage has gone through a number of redesigns, with the storage
system in Prometheus 2.0 being the third iteration. The storage system can
handle ingesting millions of samples per second, making it possible to monitor
thousands of machines with a single Prometheus server. The compression
algorithm used can achieve 1.3 bytes per sample on real-world data. An SSD is
recommended, but not strictly required.

Dashboards

Prometheus has a number of HTTP APIs that allow you to both request raw data
and evaluate PromQL queries. These can be used to produce graphs and dashboards.
Out of the box, Prometheus provides the expression browser. It uses these APIs
and is suitable for ad hoc querying and data exploration, but it is not a
general dashboard system.

It is recommended that you use Grafana for dashboards. It has a wide variety of
features, including official support for Prometheus as a data source. It can
produce a wide variety of dashboards, such as the one in Figure 1-2.
Grafana supports talking to multiple Prometheus servers, even within a single
dashboard panel.

A Grafana dashboard

Figure 1-2.

A Grafana dashboard

Recording Rules and Alerts

Although PromQL and the storage engine are powerful and efficient, aggregating
metrics from thousands of machines on the fly every time you render a graph can
get a little laggy. Recording rules allow PromQL expressions to be evaluated on
a regular basis and their results ingested into the storage engine.

Alerting rules are another form of recording rules. They also evaluate PromQL
expressions regularly, and any results from those expressions become alerts.
Alerts are sent to the Alertmanager.

Alert Management

The Alertmanager receives alerts from Prometheus servers and turns them
into notifications. Notifications can include email, chat applications such as
Slack, and services such as PagerDuty.

The Alertmanager does more than blindly turn alerts into notifications on a one-to-one basis. Related alerts can be aggregated into one notification, throttled
to reduce pager storms,7 and
different routing and notification outputs can be configured for each of your different
teams. Alerts can also be silenced, perhaps to snooze an issue you are already aware of in advance when you know maintenance is scheduled.

The Alertmanager’s role stops at sending notifications; to manage human
responses to incidents you should use services such as PagerDuty and ticketing
systems.

Tip

Alerts and their thresholds are configured in Prometheus, not in the
Alertmanager.

Long-Term Storage

Since Prometheus stores data only on the local machine, you are limited by how
much disk space you can fit on that machine.8 While you usually care only about the most recent
day or so worth of data, for long-term capacity planning a longer retention
period is desirable.

Prometheus does not offer a clustered storage solution to store data
across multiple machines, but there are remote read and write APIs
that allow other systems to hook in and take on this role. These allow PromQL
queries to be transparently run against both local and remote data.

What Prometheus Is Not

Now that you have an idea of where Prometheus fits in the broader monitoring
landscape and what its major components are, let’s look at some use
cases for which Prometheus is not a particularly good choice.

As a metrics-based system, Prometheus is not suitable for storing event logs or individual events. Nor is it the best choice for high cardinality data,
such as email addresses or usernames.

Prometheus is designed for operational monitoring, where small inaccuracies and
race conditions due to factors like kernel scheduling and failed scrapes are a
fact of life. Prometheus makes tradeoffs and prefers giving you data that is
99.9% correct over your monitoring breaking while waiting for perfect data. Thus in applications involving money or billing, Prometheus should be used with caution.

In the next chapter I will show you how to run Prometheus and do some basic
monitoring.

1 Kubernetes was the first member.

2 Temperature monitoring of machines and datacenters is actually not uncommon. There are even a few users using Prometheus to track the weather for fun.

3 I have fond memories of setting up MRTG in the early 2000s, writing scripts to report temperature and network usage on my home computers.

4 Email addresses also tend to be personally identifiable information (PII), which bring with them compliance and privacy concerns that are best avoided in monitoring.

5 The term ConstMetric is colloquial, and comes from the Go client library’s MustNewConstMetric function used to produce metrics by exporters written in Go.

6 The EC2 Name tag is the display name of an EC2 instance in the EC2 web console.

7 A page is a notification to an oncall engineer which they are expected to prompty investigate or deal with. While you may receive a page via a traditional radio pager, these days it more likely comes to your mobile phone in the form of an SMS, notification, or phone call. A pager storm is when you receive a string of pages in rapid succession.

8 However, modern machines can hold rather a lot of data locally, so a separate clustered storage system may not be necessary for you.