In this blog post, we’ll examine the
UseLinuxPosixThreadCPUClocks command line flag to the JVM. Starting in a patch update of the Sun JVM 1.6, the default value of this switch was changed to
true, yielding a nice performance boost, roughly 10x in our test. Users of more recent JVMs get this behavior by default, but exhuming this performance bug is interesting as it shows how
strace can be useful even for Java programs.
Examining old performance bugs is important because it is instructive for us as programmers. This sort of computer “archaeology” allows to examine how we can use tools available to us today to detect problems that were significant in the past. We can also learn interesting history lessons about the development and evolution of our computer systems which can help us understand how we got to where we are today.
IBM JVM executes a piece of code 25x faster than Sun JVM
This very thorough bug report against the Sun JVM shows a small piece of Java code that runs about 25x slower on the Sun JVM vs the IBM JVM, according to the person who wrote the report (Note: our tests on a modern JVM show a 10x improvement, results and analysis below).
The code is simply attempting to get the current thread’s CPU time by calling the
getCurrentThreadCpuTime function in a tight loop, like so:
“Fix” by setting command line parameter
The performance issue can be mitigated on these earlier JVM versions by simply setting the command line flag
-XX:+UseLinuxPosixThreadCPUClocks. This flag instructs the JVM to attempt to obtain the thread timing information by using some system calls and flags that are outlined in POSIX. As a result of this bug report, this option’s default value was changed to true.
Why was this option not enabled sooner and what, exactly, does it do? Before we can dive into the specifics of this flag and the purpose it serves, let’s go over some important prerequisite information.
The internals of measuring time on Linux changed both in the kernel and in glibc sometime around kernel 2.6.12. We’ll need to understand in more detail what these changes entailed to fully understand what the JVM does with the
Measuring time on Linux systems
There are a series of functions user programs can use to measure time for processes and threads. Understanding these functions and how they related to each other is important for understanding the performance bug we are investigating.
The key functions in this particular case are:
clock_gettime: A system call that implements several different types of clocks, specified by a clock id.
clock_getcpuclockid: A glibc function which gets the clock id of a specified procces (by process id).
pthread_getcpuclockid: A glibc function which gets the clock id of a specified thread (by
Depending on whether a program wants to get the process time or the thread time, it will call
pthread_getcpuclockid and pass the returned clock id to
clock_gettime. Note that the API provides a more streamlined method for getting the current process or thread CPU time with
CLOCK_THREAD_CPUTIME_ID, respectively. Read the man page for more details.
Allowing CPU time to be obtained for processes and threads is part of what makes this API on Linux POSIX compliant. It was not always this way. There was a time when
clock_gettime returned the total time since a thread or process was created (instead of the total CPU time).
TSC madness and POSIX compliant clocks
Originally, glibc’s implementation of POSIX clocks did not meet the definition put forth in v3 of the Single Unix Specification. In particular, the specification dictates that
clock_gettime should return the amount of CPU processing time. The original glibc implementation was simply returning the amount of time that had passed since the process or thread started its execution.
Moreover, the original implementation was based on reading the value of the Timestamp Counter register (TSC) on the CPU. Reading this value is problematic for numerous reasons: many CPUs at the time (and still today) do not have stable TSCs. This means that:
- The TSC of the CPU may increment at a rate with respect to the CPU frequency, which may be adjusted by various power saving systems.
- The TSC of various CPUs on SMP systems may not all be synchronized. If a process or thread migrates from one CPU to another the values returned by
clock_gettimemay be bogus.
This is documented in the “NOTES” section of the
clock_gettime man page, which is an entertaining and interesting read.
Eventually, a series of patches were proposed to fix this, making
clock_gettime POSIX compliant and safe to use even if threads are migrated between CPUs with different TSC values.
This is all well and good, but how does this related back to
Since glibc before 2.4 and the Linux kernel before 2.6.12 did not implement POSIX compliant CPU timing, the JVM added the flag
UseLinuxPosixThreadCPUClocks to control which of two CPU timing methods would be used on Linux.
During the initialization of the HotSpot Java VM a function called
os::Linux::fast_thread_clock_init is called which begins by first checking the
As you can see, this flag serves as a way to bail out of the
fast_thread_clock_init function. What exactly is this flag protecting?
If you read through the code, you’ll find that the JVM will use the value of
UseLinuxPosixThreadCPUClocks to determine whether or not it should attempt to detect if the
pthread_getcpuclockid function exists and works on the system.
If the checks pass, an internal flag is set to
_supports_fast_thread_cpu_time) which the VM will check whenever the Java program attempts to check the CPU time:
- If the flag is
true: thread CPU time will be checked with calls to
- If the flag is
false: thread CPU time will be checked by opening the file
/proc/[pid]/statand parsing the output.
As the original bug report suggests: using
clock_gettime is much faster (about 10x in our test below) than opening a file, reading the contents, parsing it, cleaning up, and returning the CPU time.
Let’s compare the
strace output of the test program provided in the original bug report on OpenJDK 1.8 with the
UseLinuxPosixThreadCPUClocks flag enabled and disabled:
Build this program with:
On JVM 1.6, this is the default. On more recent JVMs, this option needs to be explicitly disabled, so let’s do that. We’ll expect to see lots of calls to
read, etc as the proc filesystem will be consulted when trying to determine the CPU time.
Run the program with
strace -fceclock_gettime,open,read,close,fstat java -XX:-UseLinuxPosixThreadCPUClocks Test:
The program completed in 11.61 seconds with over 200 thousand calls to
read and over 100 thousand calls to each of
Just to get the current thread’s CPU time! Wow.
Note that the ~18 thousand
clock_gettime calls here are unrelated to our test program, but we recommend any motivated readers to take a look and find where those come from because it’s interesting ;)
On JVM versions above 1.6, this is the default, so we don’t need to specify anything. We expect to see far fewer calls to
read, et al. and many more calls to
clock_gettime. Let’s be explicit bout enabling
UseLinuxPosixThreadCPUClocks incase anyone supporting a legacy JVM wants to see how to do it.
Run the program with
strace -fceclock_gettime,open,read,close,fstat java -XX:+UseLinuxPosixThreadCPUClocks Test:
The program completed in 1.2 seconds (nearly 10x faster) with over 100k calls to
clock_gettime, but far, far fewer calls to
read, et al.
As we’ve mentioned in other posts, we’re big fans of using
strace on anything and everything, including Java programs!
strace isn’t free, so you can’t run it constantly. However, you can run
strace from time to time to ensure that your programs are doing what you expect. Doing this will often highlight interesting behavior and possibly reveal incorrect assumptions you’ve made about underlying systems and libraries.
If you enjoyed this post, you may enjoy some of our other low-level technical posts:
- Two frequently used system calls are ~77% slower on AWS EC2
- How setting the TZ environment variable avoids thousands of system calls
- Monitoring and Tuning the Linux Networking Stack: Sending Data
- Monitoring and Tuning the Linux Networking Stack: Receiving Data
- Illustrated Guide to Monitoring and Tuning the Linux Networking Stack: Receiving Data
- The Definitive Guide to Linux System Calls
- How does
- How does
- APT Hash sum mismatch
- HOWTO: GPG sign and verify deb packages and APT repositories
- HOWTO: GPG sign and verify RPM packages and yum repositories