-
Notifications
You must be signed in to change notification settings - Fork 813
Metricspedia
System checks are implemented per OS:
We pull system.mem.free directly from what the OS gives us, and we precompute system.mem.used
as system.mem.total
- system.mem.free
. To exclude cached memory, you can subtract cached from the used:
avg:system.mem.used{host:myhost} - avg:system.mem.cached{host:myhost}
We also provide a convenience metric called system.mem.usable
, which is the sum of free, buffered and cached, with the assumption that the OS will give up buffered and cached memory to other apps that need it if necessary. That metric is also available as a percentage as system.mem.pct_usable
, which is useful for alerting on.
If you're interested in seeing how we compute these memory metrics, this is a link to the code
The system.load
family of metrics are collected from the operating system and provide a high level metric for how backed up the machine's cpu is. The number roughly means how many processes are waiting for cpu time in the last N minutes, where N corresponds to the number value of the load metric, ie. system.load.5
refers to the last 5 minutes.
A healthy system should have a load value of about the number of cpus it has. That means the cpus are well-utilized without being overloaded. It's worth noting that since machines have many cpus these days, a load of 4 for example could be good or bad, depending on how many cpus that machine has. For convenience, we've created a derived metric family, system.load.norm
, which is system.load
divided by the number of cpus on that machine. This value is useful for alerting on, since you always know that values greater than 1 are bad.
If you're interested in seeing how we compute these load metrics, this is a link to the code