Load Averages#

A load average is an interesting metric. It's very holistic in that it's telling you more about the CPU usage than you first realise. As a metric, it's valid to use it to determine if your system is under heavy load and needs investigating or it's underutilised and can be scaled down to a smaller size.

My knowledge on this is based on the amazing (investigative) work of Brendan Gregg. I'll link to his blog post in the resources part of the book, not here, as I want you to focus on my interpretation of his work (which I've heavily summarised) and then visit his work later on. This reason for this is simple: you must focus your time spent studying, and his work is quite advanced.

Here's a fact about load averages on Linux: they're not just about what's running on the CPU right now. The averages include other resource consuming CPU tasks that are waiting for other things to finish, like reading a file from the disk or a remote network connection to respond with some data. Even those the last tasks are technically doing very little, because they're waiting around, they're still consuming some resources.

So, Linux load averages are a holistic view of the system's usage. They're not just about what is currently running on the CPU at that moment in time.

Brendan Gregg puts it like this:

"On Linux, load averages are (or try to be) "system load averages", for the system as a whole, measuring the number of threads that are working and waiting to work (CPU, disk, uninterruptible locks). Put differently, it measures the number of threads that aren't completely idle. Advantage: includes demand for different resources."

SO if a thread is completely and utterly idle, it's excluded from the calculation that measures the system's load average. If it's waiting for disk access, or some other resource, then it's doing something and is therefore included.

Brendan finalises his article with this summary, which is perfect:

"In 1993, a Linux engineer found a nonintuitive case with load averages, and with a three-line patch changed them forever from "CPU load averages" to what one might call "system load averages." His change included tasks in the uninterruptible state, so that load averages reflected demand for disk resources and not just CPUs. These system load averages count the number of threads working and waiting to work, and are summarized as a triplet of exponentially-damped moving sum averages that use 1, 5, and 15 minutes as constants in an equation. This triplet of numbers lets you see if load is increasing or decreasing, and their greatest value may be for relative comparisons with themselves."

So, a load average is useful because it tells us if demand for system resources is going up or down - but each of the three averages (1, 5, and 15 minute) is based on the other two, in order to determine if the demand is rising or lowering over time. This means the load average isn't a one-stop-shop and we need more information to get an real view of the system's situation.

Better metrics#

So, if the load averages on Linux are a sort of mix of things, how do you truly measure specific metrics about what is actually using up your system's CPU time? We're going to look at the following five things:

per-CPU utilisation
per-process CPU utilisation
per-thread run queue (scheduler) latency
CPU run queue latency
CPU run queue length

These give us a much better idea of what's happening on a system, but they're also quite an advanced topic, so we'll only cover them at a high level here and won't explore them too much. For now, we're fine with the load average and a few tools to help us see what's happening on a system.

Note

In "ITOps: Level Two" we cover the above in a lot of detail and use awesome tools like sysdig to not only extract the information, but also store it and graph it Grafana with cool looking graphs!

per-CPU utilisation#

This metric lets us measure the utilisation of a specific CPU (or a CPU core, which will show as a CPU to the system.) Now we will get an actual, per CPU (core) measurement that directly reflects the utilisation of the CPU without anything else being included in the results.

per-process CPU utilisation#

This is even more powerful in my opinion, because now we know which specific process or processes are using what amount of CPU time. Knowing that a CPU (core) is heavily utilised is one thing, but know why it's heavily utilised, combined with whom is utilising it, allows you to pin down the problem and therefore allows you solve it.

Run queues#

Finally, we have: per-thread run queue (scheduler) latency, CPU run queue latency, and CPU run queue length. I've grouped these together into one explanation.

Threads are what processes create to complete some unit of work for them. They can be short lived or long lived. A good example of using threads is a web server: they (usually) have a main process but then uses threads to handle inbound HTTP(S) connections. This allows the software to scale up and up, dealing with multiple connections at once.

This "per-thread run queue latency" metric tells us one thing: are threads having to queue up before being able (scheduled) to execute? If so, then we have a performance problem. The same applies to the CPU run queue latency and the CPU run queue length - how long is it taking to schedule tasks on the CPU and how long is the queue of waiting threads?

Using these latency and queue measurements you can determine if there's too much demand for the resources of your system and therefore you might want to increase the amount of resources. This can be done in two ways: vertically or horizontally.

Vertically simply means the system itself gets bigger: more CPU, RAM and (faster?) disk. Horizontally means you add more systems to the pool of resources and move workloads to the other systems.

We cover this level of monitoring (and alerting) in "ITOps Level Two".

Next#

Let's explore measuring the utilisation of each resource on our Ubuntu Server. We're going to start with a classic: top