This blog post tells the story of how we tackled a particular Java memory issue which caused our Cloud Foundry applications to crash. The issue itself might be rather special, but the principles and techniques we used can be applied in a broader context.
The Problem
We are running in SAP BTP Cloud Foundry, which so far has served us well, because it comes close to a “deploy it and it just runs” experience. Indeed, the platform took care of almost all of the infrastructure details. That was about to change.
What Happend?
One day out of the blue our integration test suite which gets executed after each deployment started to fail. Suddenly, the application being tested was no longer available. When restarting the test suite afterwards, it usually finished successfully.
Checking the application logs just revealed the following:
[APP/PROC/WEB/0] OUT Exit status 137 (out of memory)
That was all. Using the metrics provided by our application, we checked the size of the Java heap space, but there was nothing unusual. If anything, it was even smaller than usual, because the application was freshly started.
The message above actually means that the application has been killed by the dreaded Linux out of memory killer. The Java virtual machine (jvm) never realized that it was running out of memory (otherwise it would have told us). Nevertheless, the total memory consumption of the container had exceeded it’s limit and it was killed without a fuzz.
The Analysis
In the Dark
The first thing you want to do in such a situation is to review your memory configuration. Can it be that you jvm thinks it is allowed to use more memory than the container actually provides? Can it be that the headroom for the so called native memory consumption is to small?
Confused? That is understandable, but we got you covered. There is a great series of blog post by Tim Gerlach which explain this intricate domain.
So we grabbed a pencil and did some calculations, but it turned out that the Java Build Memory Calculator actually does a good job for our application.
To be on the safe side we added more headroom by freezing the Java heap while increasing the overall memory available to the container. But the OoM killer was not impressed, the application was killed all the same. We then doubled and even quadrupled the available memory, but no dice, still the application got killed.
In the meantime, other colleagues of my team got busy deploying older versions of the application in an attempt to identify a change which could have caused the problem. However, the problem stubbornly kept happening no matter what we did.
“Fascinating.” to quote Mr Spock, but to the human members of the crew it rather felt like a minor existential crisis.
Some First Clues
Sometimes it is useful to take a step backwards and to reiterate and what one does know for certain.
We were sure that the surge in memory consumption was due to high native memory consumptions, because all other variables we had under control.
The jvm provides features for tracking native memory consumption. It is activated with:
-XX:NativeMemoryTracking=[off | summary | detail]
Once it is active, the jvm can be prompted to print consumption reports for example by using the “jcmd” command line tool.
For us, that was not practical, because things were just happening way to fast. One moment everything was normal, the next moment the application got killed. So whatever happened, happened to quick for us to observe.
We ended up running the following method every 200ms seconds:
public void report() {
String[] args = {"summary"};
String report =
(String)
ManagementFactory.getPlatformMBeanServer()
.invoke(
new ObjectName("com.sun.management:type=DiagnosticCommand"),
"vmNativeMemory",
new Object[] {args},
new String[] {String[].class.getName()});
Pattern p = Pattern.compile("committed=(\d+)KB");
var matcher = p.matcher(report);
var found = matcher.find();
assert found
var match = matcher.group(1);
int committedInKb = Integer.parseInt(match);
if (committedInKb > 800000 /* 800 megabyte */) {
log.warn("MEMORY SURGE {}", report);
return;
}
}
This checks the current memory consumption. If it is above a certain threshold, a summary of the native memory consumption is printed.
This code snippet is meant only for illustration. In fact, there is a much better way to achieve the same thing using SapMachine. More about that later.
Having this in place actually brought up some insights. Here is an excerpt of the report the application printed right before it was killed:
Total: reserved=1597177KB, committed=969081KB
– Java Heap (reserved=331776KB, committed=205196KB)
(mmap: reserved=331776KB, committed=205196KB)– Class (reserved=323953KB, committed=144661KB)
(classes #26089)
( instance classes #24462, array classes #1627)
(malloc=5309KB #78670)
(mmap: reserved=318644KB, committed=139352KB)
( Metadata: )
( reserved=120832KB, committed=120792KB)
( used=118117KB)
( free=2675KB)
( waste=0KB =0.00%)
( Class space:)
( reserved=197812KB, committed=18560KB)
( used=16720KB)
( free=1840KB)
( waste=0KB =0.00%)– Thread (reserved=152845KB, committed=16925KB)
(thread #148)
(stack: reserved=152136KB, committed=16216KB)
(malloc=540KB #890)
(arena=170KB #292)– Code (reserved=252877KB, committed=66985KB)
(malloc=5189KB #19959)
(mmap: reserved=247688KB, committed=61796KB)– GC (reserved=1856KB, committed=1444KB)
(malloc=768KB #3419)
(mmap: reserved=1088KB, committed=676KB)– Compiler (reserved=492597KB, committed=492597KB)
(malloc=391KB #2570)
(arena=492206KB #13)– Internal (reserved=1056KB, committed=1056KB)
(malloc=1024KB #2409)
(mmap: reserved=32KB, committed=32KB)
The “compiler” section stands out. It consumes about 469MB. A value of 300KB is normal for our application.
What does “compiler” actually mean in this context? It refers to the Java just-in-time (JIT) compiler, which compiles the byte code into machine code at runtime, which boost the performance of a Java application by an order of magnitude. Due to performance reasons, it is not an option to disable the JIT compiler.
That was our first clue into the nature of the problem. From here on, two different workstreams developed.
Going deeper with SapMachine
We got in contact with the Java experts @SAP. To begin with, they proposed to use SapMachine instead of OpenJDK, because SapMachine has extended monitoring capabilities. What we had tried with the code snippet above can be achieved much simpler using the SapMachine High Memory Reports Feature: Once activated via jvm parameters, it can be configured to produce all kind of reports once the memory consumption exceeds certain thresholds.
In our case, we used:
-XX:NativeMemoryTracking=detail
-XX:+HiMemReport
-XX:HiMemReportExec=VM.info
-XX:HiMemReportDir=/var/vcap/data/652110e7-75e2-4a3c-9aea-53fef192229c
-Xmx330575K
-XX:+UnlockDiagnosticVMOptions
-XX:+LogCompilation
-XX:LogFile=/var/vcap/data/652110e7-75e2-4a3c-9aea-53fef192229c/jit.log
This logs both memory consumption data as well as activities of the suspected culprit: The JIT compiler.
But where to store those logs? The problem is, that in Cloud Foundry, the filesystem of an application is discarded when the application is killed, so there is not persistence out of the box. We used the Cloud Foundry Volume service to solve this. The paths you see in the options above refer to a persistent volume which has been attached to the application container. If the application is killed, the content of the volume remains. Once the application is up again, one can connect using “cf ssh” and retrieve the gathered data with a classic tar pipe.
Once all the reports had been gathered, the SapMachine colleagues got started on the detailed analysis. They identified it to be an instance of this issue: JDK-8291665.
Implicit JRE Version Upgrades
In parallel, we tried to understand why the problem hit us out of the blue, without us doing an explicit change. This revealed some facts about the Cloud Foundry Java Buildpack.
Here is how we specified which JRE we want to use:
open_jdk_config: '{ jre: { version: 11.+ } }'
Our assumption was that the actual minor and patch version of JRE 11 is determined by the java build pack. That turned out to be not so. Instead, we got the latest minor version for JRE 11, which at that moment was 11.0.16 and no longer 11.0.15. So this implicit upgrade was the “trigger” for the issue we were facing.
That led us to a trivial mitigation:
open_jdk_config: '{ jre: { version: 11.0.15_10 } }'
I would not recommend specifying minor and patch version in general, because you will not automatically get security fixes. However, it fixed our pressing woes, and know we can wait until a fixed version is released.
Conclusion
In application development, ever so often one encounters problems which lie outside one’s technical domain. It can require lots of time and research effort to drive the analysis to a point where the problem can be handed over to an expert.
While these unplanned efforts are certainly problematic from management perspective, they also present a great opportunity for the team to gain knowledge. In our example, this includes:
- Java memory management, in particular native memory consumption
- Java Runtime Compilation
- Memory monitoring capabilities of SapMachine
- Some operational tricks when working with Cloud Foundry
I hope some of this information was useful to you. If you questions or feedback, please leave a comment.