Using strace to understand a 10x Java performance improvement

Jun 13, 2022 · 9 min read

maven-how-to

TL;DR

In this blog post, we’ll examine the UseLinuxPosixThreadCPUClocks command line flag to the JVM. Starting in a patch update of the Sun JVM 1.6, the default value of this switch was changed to true, yielding a nice performance boost, roughly 10x in our test. Users of more recent JVMs get this behavior by default, but exhuming this performance bug is interesting as it shows how strace can be useful even for Java programs.

Examining old performance bugs is important because it is instructive for us as programmers. This sort of computer “archaeology” allows to examine how we can use tools available to us today to detect problems that were significant in the past. We can also learn interesting history lessons about the development and evolution of our computer systems which can help us understand how we got to where we are today.

IBM JVM executes a piece of code 25x faster than Sun JVM

This very thorough bug report against the Sun JVM shows a small piece of Java code that runs about 25x slower on the Sun JVM vs the IBM JVM, according to the person who wrote the report (Note: our tests on a modern JVM show a 10x improvement, results and analysis below).

The code is simply attempting to get the current thread’s CPU time by calling the getCurrentThreadCpuTime function in a tight loop, like so:

private final static long cputime() {
    return java.lang.management.ManagementFactory.getThreadMXBean()
    .getCurrentThreadCpuTime();
}

“Fix” by setting command line parameter

The performance issue can be mitigated on these earlier JVM versions by simply setting the command line flag -XX:+UseLinuxPosixThreadCPUClocks. This flag instructs the JVM to attempt to obtain the thread timing information by using some system calls and flags that are outlined in POSIX. As a result of this bug report, this option’s default value was changed to true.

Why was this option not enabled sooner and what, exactly, does it do? Before we can dive into the specifics of this flag and the purpose it serves, let’s go over some important prerequisite information.

Prerequisite information

The internals of measuring time on Linux changed both in the kernel and in glibc sometime around kernel 2.6.12. We’ll need to understand in more detail what these changes entailed to fully understand what the JVM does with the UseLinuxPosixThreadCPUClocks flag.

Measuring time on Linux systems

There are a series of functions user programs can use to measure time for processes and threads. Understanding these functions and how they related to each other is important for understanding the performance bug we are investigating.

The key functions in this particular case are:

clock_gettime: A system call that implements several different types of clocks, specified by a clock id.
clock_getcpuclockid: A glibc function which gets the clock id of a specified procces (by process id).
pthread_getcpuclockid: A glibc function which gets the clock id of a specified thread (by pthread_t).

Depending on whether a program wants to get the process time or the thread time, it will call clock_getcpuclockid or pthread_getcpuclockid and pass the returned clock id to clock_gettime. Note that the API provides a more streamlined method for getting the current process or thread CPU time with CLOCK_PROCESS_CPUTIME_ID and CLOCK_THREAD_CPUTIME_ID, respectively. Read the man page for more details.

Allowing CPU time to be obtained for processes and threads is part of what makes this API on Linux POSIX compliant. It was not always this way. There was a time when clock_gettime returned the total time since a thread or process was created (instead of the total CPU time).

TSC madness and POSIX compliant clocks

Originally, glibc’s implementation of POSIX clocks did not meet the definition put forth in v3 of the Single Unix Specification. In particular, the specification dictates that clock_gettime should return the amount of CPU processing time. The original glibc implementation was simply returning the amount of time that had passed since the process or thread started its execution.

Moreover, the original implementation was based on reading the value of the Timestamp Counter register (TSC) on the CPU. Reading this value is problematic for numerous reasons: many CPUs at the time (and still today) do not have stable TSCs. This means that:

The TSC of the CPU may increment at a rate with respect to the CPU frequency, which may be adjusted by various power saving systems.
The TSC of various CPUs on SMP systems may not all be synchronized. If a process or thread migrates from one CPU to another the values returned by clock_gettime may be bogus.

This is documented in the “NOTES” section of the clock_gettime man page, which is an entertaining and interesting read.

Eventually, a series of patches were proposed to fix this, making clock_gettime POSIX compliant and safe to use even if threads are migrated between CPUs with different TSC values.

This is all well and good, but how does this related back to UseLinuxPosixThreadCPUClocks?

What does `UseLinuxPosixThreadCPUClocks` do?

Since glibc before 2.4 and the Linux kernel before 2.6.12 did not implement POSIX compliant CPU timing, the JVM added the flag UseLinuxPosixThreadCPUClocks to control which of two CPU timing methods would be used on Linux.

During the initialization of the HotSpot Java VM a function called os::Linux::fast_thread_clock_init is called which begins by first checking the UseLinuxPosixThreadCPUClocks flag:

void os::Linux::fast_thread_clock_init() {
  if (!UseLinuxPosixThreadCPUClocks) {
    return;
  }

As you can see, this flag serves as a way to bail out of the fast_thread_clock_init function. What exactly is this flag protecting?

If you read through the code, you’ll find that the JVM will use the value of UseLinuxPosixThreadCPUClocks to determine whether or not it should attempt to detect if the pthread_getcpuclockid function exists and works on the system.

If the checks pass, an internal flag is set to true (_supports_fast_thread_cpu_time) which the VM will check whenever the Java program attempts to check the CPU time:

If the flag is true: thread CPU time will be checked with calls to pthread_getcpuclockid and clock_gettime.
If the flag is false: thread CPU time will be checked by opening the file /proc/[pid]/stat and parsing the output.

As the original bug report suggests: using pthread_getcpuclockid and clock_gettime is much faster (about 10x in our test below) than opening a file, reading the contents, parsing it, cleaning up, and returning the CPU time.

Java program `strace` comparison

Let’s compare the strace output of the test program provided in the original bug report on OpenJDK 1.8 with the UseLinuxPosixThreadCPUClocks flag enabled and disabled:

class Test {
  public static void main(String[] args) {
    long base = cputime();

    for(int k = 0; k < 100 * 1000; k++) {
            cputime();
    }

    System.out.println("Took: " + ((cputime() - base) / 1000.0 / 1000.0) + " ms.");
  }

  private final static long cputime() {
      return java.lang.management.ManagementFactory.getThreadMXBean()
      .getCurrentThreadCpuTime();
  }
}

Build this program with: javac Test.java.

With `UseLinuxPosixThreadCPUClocks` disabled

On JVM 1.6, this is the default. On more recent JVMs, this option needs to be explicitly disabled, so let’s do that. We’ll expect to see lots of calls to open, read, etc as the proc filesystem will be consulted when trying to determine the CPU time.

Run the program with strace -fceclock_gettime,open,read,close,fstat java -XX:-UseLinuxPosixThreadCPUClocks Test:

% strace -fc java -XX:-UseLinuxPosixThreadCPUClocks Test
Took: 12170.0 ms.
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 35.53    0.022333           0    201007           read
 28.99    0.018224           0    100057           close
 24.44    0.015365           0    100107        41 open
 10.89    0.006847           0    100066           fstat
  0.14    0.000090           0     17979           clock_gettime
------ ----------- ----------- --------- --------- ----------------
100.00    0.062859                519216        41 total

The program completed in 11.61 seconds with over 200 thousand calls to read and over 100 thousand calls to each of read, close, open, and fstat.

Just to get the current thread’s CPU time! Wow.

Note that the ~18 thousand clock_gettime calls here are unrelated to our test program, but we recommend any motivated readers to take a look and find where those come from because it’s interesting ;)

With `UseLinuxPosixThreadCPUClocks` enabled

On JVM versions above 1.6, this is the default, so we don’t need to specify anything. We expect to see far fewer calls to open, read, et al. and many more calls to clock_gettime. Let’s be explicit bout enabling UseLinuxPosixThreadCPUClocks incase anyone supporting a legacy JVM wants to see how to do it.

Run the program with strace -fceclock_gettime,open,read,close,fstat java -XX:+UseLinuxPosixThreadCPUClocks Test:

Took: 1670.643325 ms.
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 99.91    0.007688           0    115472           clock_gettime
  0.09    0.000007           0      1003           read
  0.00    0.000000           0       104        41 open
  0.00    0.000000           0        54           close
  0.00    0.000000           0        64           fstat
------ ----------- ----------- --------- --------- ----------------
100.00    0.007695                116697        41 total

The program completed in 1.2 seconds (nearly 10x faster) with over 100k calls to clock_gettime, but far, far fewer calls to open, read, et al.

Conclusion

As we’ve mentioned in other posts, we’re big fans of using strace on anything and everything, including Java programs! strace isn’t free, so you can’t run it constantly. However, you can run strace from time to time to ensure that your programs are doing what you expect. Doing this will often highlight interesting behavior and possibly reveal incorrect assumptions you’ve made about underlying systems and libraries.

Using strace to understand a 10x Java performance improvement

TL;DR

IBM JVM executes a piece of code 25x faster than Sun JVM

“Fix” by setting command line parameter

Prerequisite information

Measuring time on Linux systems

TSC madness and POSIX compliant clocks

What does `UseLinuxPosixThreadCPUClocks` do?

Java program `strace` comparison

With `UseLinuxPosixThreadCPUClocks` disabled

With `UseLinuxPosixThreadCPUClocks` enabled

Conclusion

Set up your own package repository.

You might also like other posts...

HOWTO: Private Maven Repositories

Gradle and Gradle alternatives

Debugging SSL in Java using mitmproxy

Focus on your code and let us worry about package management

Using strace to understand a 10x Java performance improvement

TL;DR

IBM JVM executes a piece of code 25x faster than Sun JVM

“Fix” by setting command line parameter

Prerequisite information

Measuring time on Linux systems

TSC madness and POSIX compliant clocks

What does UseLinuxPosixThreadCPUClocks do?

Java program strace comparison

With UseLinuxPosixThreadCPUClocks disabled

With UseLinuxPosixThreadCPUClocks enabled

Conclusion

Set up your own package repository.

You might also like other posts...

HOWTO: Private Maven Repositories

Gradle and Gradle alternatives

Debugging SSL in Java using mitmproxy

Focus on your code and let us worry about package management

What does `UseLinuxPosixThreadCPUClocks` do?

Java program `strace` comparison

With `UseLinuxPosixThreadCPUClocks` disabled

With `UseLinuxPosixThreadCPUClocks` enabled