Fri 03 February 2017

Using and understanding top to measure CPU and memory consumption

by Rob Kendrick , 2017

This is a quick email detailing what the output from the "top" tool means, which is a utility which shows a list of running processes on a UNIX system along with statistics on their memory and CPU consumption.

I've attached to this email an image of some example top output from NGI. I've made it an image just in case people's mail clients decide to render it in a variable-pitch font, destroying readability.

I'll dismantle the lines one by one, from the top. I'm sure a lot of this is already known by you all or be obvious, but it may be worth exploring it all just in case, as there's a lot of confusion about some of the data here. As such this email is a little longer than I had originally intended.

If anyone has any questions, feel free to contact me directly, or reply to this posting on the mailing list.

Line 1: time, uptime, logged in users, load averages.

- time: Current local time
- uptime: time system the system booted
- logged in users: number of interactive sessions (console, ssh, etc)
- load averages:
  There are three load average numbers, and these are the average "load"
  of the system over 1 minute, 5 minutes, and 15 minutes.   Load is
  calculated by adding 1 for any process or thread that is one of:
    - Runnable if CPU resources available (R-state)
    - Waiting on IO (D-state)
  In the attached image, the instantaneous load is well north of 20,
  but this number is not useful to know.  Which is why the numbers
  are averaged.

Line 2: Task statistics - Running tasks (or more accurately, runnable; as you can only run as many tasks as you have hardware threads). - Sleeping tasks which are not currently requesting any CPU or IO resources at all (perhaps waiting on a socket, for example.) - Stopped tasks, which have been sent SIGSLEEP to temporary remove them from the scheduler's task list. - Zombie tasks. These have died (naturally or otherwise), but their parent process is yet to call wait() on them to collect their exist status.

Line 3: Total userland usage, kernel/system usage, "nice" process usage, idle time, IO wait, hardware interrupt time, software interrupt time, stolen time.

The important thing to remember is that this is all real time, not
CPU time, and they don't always add up.  They are all expressed as
percentages of available CPU time used since top last refreshed.

- Total userland usage: The amount of available CPU time used by
  userland processes.
- Kernel/system usage: The amount of available CPU time used by
  Linux itself.
- "Nice" process usage: The amount of available CPU time used by
  low-priority processes (manipulated by the 'nice' command).
- Hardware interrupt time: The amount of available CPU time used
  servicing hardware interrupts.
- Software interrupt time: The amount of available CPU time used
  servicing software interrupts.  (System calls)
- Stolen time: Estimate of time stolen by the Hypervisor's
  overhead when running the system in a VM.  This should always
  be zero on NGI.

Line 4: Total hardware memory, total used by processes and kernel, unused memory, kernel buffers.

- Total hardware memory is what is wired to the CPU and not
  reserved for other parts of the system.  In this case, 4GB minus
  the memory used by PCI and the GPU.
- Total used by processes is what is consumed in userland, ie
- Unused memory is spare.
- Buffers is essentially memory used by Linux itself.

Line 5: Total swap, used swap, free swap, cache memory

- Total/used/free swap is hopefully self-explanatory.
- Cache memory is trickier.  This is *physical* memory that has
  been used to cache the contents of block devices (eMMC, SSD,
  USB sticks, etc).  It is not "used" in a traditional sense;
  the instant a userland process requires more RAM and there
  is no unused RAM, this RAM is instantly raided to satisfy the

Now comes a table of processes and statistics. The columns are as follows:

- PID: Process identifier.  Typically a 16 bit number, 0 is 
  forbidden, 1 is 'init', which is the parent of last resort.
  NGI uses systemd as init.

- USER: The user the process is running as.

- PR: The process's priority.  This normally runs from -20 to
  20, where the *lower* the number, the higher the priority.
  There is also an 'RT' real-time priority which trumps all.
  It is not "real time", however.

- NI: Niceness.  This is an offset to apply to the normal
  priority.  It is normally zero.

- VIRT: Virtual memory size.  This is the size of the process's
  address space.  This doesn't mean it is all memory used,
  however: some may be maps of files on disc, or may be yet
  unused and not had real RAM allocated to back it.

- RES: Resident size.  This is how much RAM the process actually
  has allocated to it.  This may be backed by physical RAM or by

- SHR: Sharable memory.  This is the amount of memory being used
  that could possibly be shared with other processes.  This includes
  shared memory used for IPC, as well as memory maps of files on
  disc (such as the executable itself, shared libraries, etc.)

- S: State.  S = sleeping, R = runnable, D = blocked waiting for
  IO, Z = waiting for parent process to collect corpse/exit status.

- CPU: Percentage of CPU used by this process since the last refresh.
  Note that 100% is 100% of one thread.  In a quad-core system, this
  could reach the high 300s.

- MEM: Percentage of available RAM (combined physical and swap) that
  this process is using, as a factor of its RES, described above.

- TIME: CPU seconds used by this process since it started.  What this
  means is that if it were to use a whole CPU thread for a second (or
  50% of a thread for two seconds), 1 will be added.

- COMMAND: Title of process.

Note: This document is licensed under CC-BY-SA and was originally created by Codethink Ltd.