SAP’s own Java VM SapMachine comes with a handy monitoring tool you may not know: SapMachine Vitals.
Think sysstat
tools – pidstat
, iostat
, etc – plus JVM stats rolled into one and baked into our JVM for your convenience.
Curious? After me.
Our SapMachine Vitals give you a gap-less history of metrics: JVM-, process- and system-data, periodically sampled and buffered inside the JVM. No need to go back in time to start collecting data – its right there. The record buffer spans ten days, with the last hour kept in higher resolution.
Our developers use SapMachine Vitals for the first-response analysis of JVM issues. Its charm lies in its convenience: while you *could* get all this data with a combination of different tools, Vitals are just very practical and painless.
Enabling the Vitals
Nothing to do: SapMachine Vitals are enabled by default.
Their overhead is very low – they cost ~300K of memory, a bit more for long-running JVMs. The CPU overhead is negligible. Note that you can still disable them manually with -XX:-EnableVitals
.
Accessing the Vitals Report
The Vitals report can be accessed via jcmd
, via error log, or via JVM exit report.
Via jcmd
To print the Vitals report, use:
jcmd <pid> VM.vitals
The VM.vitals command has several useful suboptions. help is your friend:
jcmd <pid> help VM.vitals
Syntax : VM.vitals [options]
Options: (options must be specified using the <key> or <key>=<value> syntax)
scale : [optional] Memory usage in which to scale. Valid values are: k, m, g (fixed scale) or "dynamic" for a dynamically chosen scale. (STRING, dynamic)
csv : [optional] csv format. (BOOLEAN, false)
no-legend : [optional] Omit legend. (BOOLEAN, false)
reverse : [optional] Reverse printing order. (BOOLEAN, false)
now : [optional] sample now values (BOOLEAN, false)
Via error log (hs_err_pid.log)
Vitals are part of the hs_err_pid.log
(following the NMT section) the JVM generates in case of an abnormal abort.
That can be highly useful for investigating JVM issues. Vitals show you the development of JVM- and system stats right up to the crash point. The history buffer spans the last ten days, so we usually have a no-gap history too. That makes spotting issues like slowly rising resource leaks a lot easier.
At JVM exit
Start the JVM with -XX:+PrintVitalsAtExit
, and it will print the Vitals report to stdout
just before it exits.
Its companion -XX:+DumpVitalsAtExit
instead dumps two files into the current directory, a text file and a CSV file (sapmachine_vitals_pid.txt resp. sapmachine_vitals_pid.csv). Their names can be overwritten with -XX:VitalsFile=your-filename
.
Both switches are very handy for quick analysis, especially if your program is short-lived.
For example, want to know the RSS envelope of your program? Its heap usage, how many classes it loads, how many threads it spawns? Answering these questions would need several tools and could be fiddly. E.g., for an RSS envelope, you would fire up pidstat
and then have fun isolating the pidstat output. With SapMachine, you just:
java -XX:DumpVitalsAtExit my-java-program
And since -XX:+DumpVitalsAtExit
gives you a CSV file, too, you can drop that one directly into Excel to get footprint diagrams on the quick.
For very short-lived programs, consider increasing the Vitals sample frequency. By default, Vitals take one data sample every 10 seconds. Using -XX:VitalsSampleInterval
, you can change that frequency:
java -XX:DumpVitalsAtExit -XX:VitalsSampleInterval=1 my-java-program
Note that the Vitals report always at least contains one sample taken at JVM exit, regardless of how short-lived the program was.
How to read the Vitals report
The Vitals report has three sub-sections: system (and container) stats, process stats and and JVM stats:
------------system------------
avail: Memory available without swapping [host] [krn]
comm: Committed memory [host]
crt: Committed-to-Commit-Limit ratio (percent) [host]
swap: Swap space used [host]
si: Number of pages swapped in [host] [delta]
so: Number of pages pages swapped out [host] [delta]
p: Number of processes
t: Number of threads
tr: Number of threads running
tb: Number of threads blocked on disk IO
cpu-us: CPU user time [host]
cpu-sy: CPU system time [host]
cpu-id: CPU idle time [host]
cpu-st: CPU time stolen [host]
cpu-gu: CPU time spent on guest [host]
cgroup-lim: cgroup memory limit [cgrp]
cgroup-slim: cgroup memory soft limit [cgrp]
cgroup-usg: cgroup memory usage [cgrp]
cgroup-kusg: cgroup kernel memory usage (cgroup v1 only) [cgrp]
-----------process------------
virt: Virtual size
rss-all: Resident set size, total
rss-anon: Resident set size, anonymous memory [krn]
rss-file: Resident set size, file mappings [krn]
rss-shm: Resident set size, shared memory [krn]
swdo: Memory swapped out
cheap-usd: C-Heap, in-use allocations (may be unavailable if RSS > 4G) [glibc]
cheap-free: C-Heap, bytes in free blocks (may be unavailable if RSS > 4G) [glibc]
cpu-us: Process cpu user time
cpu-sy: Process cpu system time
io-of: Number of open files
io-rd: IO bytes read from storage or cache
io-wr: IO bytes written
thr: Number of native threads
-------------jvm--------------
heap-comm: Java Heap Size, committed
heap-used: Java Heap Size, used
meta-comm: Meta Space Size (class+nonclass), committed
meta-used: Meta Space Size (class+nonclass), used
meta-csc: Class Space Size, committed [cs]
meta-csu: Class Space Size, used [cs]
meta-gctr: GC threshold
code: Code cache, committed
nmt-mlc: Memory malloced by hotspot [nmt]
nmt-map: Memory mapped by hotspot [nmt]
nmt-gc: NMT "gc" (GC-overhead, malloc and mmap) [nmt]
nmt-oth: NMT "other" (typically DBB or Unsafe.allocateMemory) [nmt]
nmt-ovh: NMT overhead [nmt]
jthr-num: Number of java threads
jthr-nd: Number of non-demon java threads
jthr-cr: Threads created [delta]
jthr-st: Total reserved size of java thread stacks [nmt] [linux]
cldg-num: Classloader Data
cldg-anon: Anonymous CLD
cls-num: Classes (instance + array)
cls-ld: Class loaded [delta]
cls-uld: Classes unloaded [delta]
By default, the report prints its records chronologically in descending order (youngest first). The last hour will be shown in 10-second intervals, dropping down to 10-minute intervals after that.
So, what can I do with it?
It won’t replace a full profiler (of course). But it can alleviate the need for more complex tools when doing first-response analysis. Some examples:
My JVM process shows high RSS. My JVM crashed with a native OOM error. My JVM keeps getting OOM killed. What’s going on?
SapMachine Vitals gives you:
- Machine-wide memory stats: how much memory is available and committed, how far overcommitted are we, swap activity
- Container memory statistics and limits
- Process memory stats: virtual size, RSS (sub-categories, too, on newer kernels), process swap size
- What NMT thinks we did allocate – both C-heap and virtual memory
- What we really allocated according to *glibc* – which can be hilariously off from the NMT numbers (think libc-internal overhead or non-JVM allocations)
- How much memory the C-heap retains internally
- JVM memory areas: Java heap, code heap, and Metaspace sizes
- Some of the more useful NMT sub-categories, e.g., the combined size of GC-related structures outside the Java heap or how much we use for DirectByteBuffers.
These data points help narrow OOMs, helping us avoid “Tuning By Folklore.” For example, a high RSS may be caused by issues outside the core JVM, in which case NMT cannot tell you anything and would be the wrong first-responder tool.
JVM ran out of Metaspace. What’s going on?
Vitals show you the number of class loaders, classes, class load activity, Metaspace sizes, and Metaspace GC threshold.
All these data can point you in the right direction:
If the number of classes makes sense to you and they use up the expected amount of space [3], you either deal with Metaspace fragmentation (Java versions older than 16, later versions defragment a lot better) or Metaspace is just plain too small.
If the number of classes seems abnormally high, you may have a class loader leak or misjudged how many classes the application loads.
Depending on the answers, your following analysis tool may be VM.metaspace [4] or a java profiler fo our choice.
Other insights
We see the CPU time spent in kernel and userland. We also see the JVM’s IO activity. These data points give us a first indication of whether we face a problem with the OS or whether the JVM itself burns CPU – for the latter, the next step would be to check the GC activity.
Vitals tell us how many threads are alive – both JVM- and outside native threads. The latter is interesting in the context of a JVM embedded in a native application. We also see the thread creation activity.
We also see the number of open file handles and how many bytes we have read and written.
Why is this SapMachine-only? Why not contribute it to mainline OpenJDK?
We tried. Unfortunately, we were unable to convince upstream [5].
But since Vitals as a lightweight first-responder tool was so valuable, we kept the feature downstream-only in SapMachine.
[1] Sapmachine.io
[2] SapMachine Vitals Documentation
[3] Sizing Metaspace
[4] Analyzing Metaspace with jcmd
[5] https://mail.openjdk.org/pipermail/serviceability-dev/2018-November/025909.html