TL;DR
In this blog post, we’ll examine the UseLinuxPosixThreadCPUClocks
command line flag to the JVM. Starting in a patch update of the Sun JVM 1.6, the default value of this switch was changed to true
, yielding a nice performance boost, roughly 10x in our test. Users of more recent JVMs get this behavior by default, but exhuming this performance bug is interesting as it shows how strace
can be useful even for Java programs.
Examining old performance bugs is important because it is instructive for us as programmers. This sort of computer “archaeology” allows to examine how we can use tools available to us today to detect problems that were significant in the past. We can also learn interesting history lessons about the development and evolution of our computer systems which can help us understand how we got to where we are today.
IBM JVM executes a piece of code 25x faster than Sun JVM
This very thorough bug report against the Sun JVM shows a small piece of Java code that runs about 25x slower on the Sun JVM vs the IBM JVM, according to the person who wrote the report (Note: our tests on a modern JVM show a 10x improvement, results and analysis below).
The code is simply attempting to get the current thread’s CPU time by calling the getCurrentThreadCpuTime
function in a tight loop, like so:
“Fix” by setting command line parameter
The performance issue can be mitigated on these earlier JVM versions by simply setting the command line flag -XX:+UseLinuxPosixThreadCPUClocks
. This flag instructs the JVM to attempt to obtain the thread timing information by using some system calls and flags that are outlined in POSIX. As a result of this bug report, this option’s default value was changed to true.
Why was this option not enabled sooner and what, exactly, does it do? Before we can dive into the specifics of this flag and the purpose it serves, let’s go over some important prerequisite information.
Prerequisite information
The internals of measuring time on Linux changed both in the kernel and in glibc sometime around kernel 2.6.12. We’ll need to understand in more detail what these changes entailed to fully understand what the JVM does with the UseLinuxPosixThreadCPUClocks
flag.
Measuring time on Linux systems
There are a series of functions user programs can use to measure time for processes and threads. Understanding these functions and how they related to each other is important for understanding the performance bug we are investigating.
The key functions in this particular case are:
-
clock_gettime
: A system call that implements several different types of clocks, specified by a clock id. -
clock_getcpuclockid
: A glibc function which gets the clock id of a specified procces (by process id). -
pthread_getcpuclockid
: A glibc function which gets the clock id of a specified thread (bypthread_t
).
Depending on whether a program wants to get the process time or the thread time, it will call clock_getcpuclockid
or pthread_getcpuclockid
and pass the returned clock id to clock_gettime
. Note that the API provides a more streamlined method for getting the current process or thread CPU time with CLOCK_PROCESS_CPUTIME_ID
and CLOCK_THREAD_CPUTIME_ID
, respectively. Read the man page for more details.
Allowing CPU time to be obtained for processes and threads is part of what makes this API on Linux POSIX compliant. It was not always this way. There was a time when clock_gettime
returned the total time since a thread or process was created (instead of the total CPU time).
TSC madness and POSIX compliant clocks
Originally, glibc’s implementation of POSIX clocks did not meet the definition put forth in v3 of the Single Unix Specification. In particular, the specification dictates that clock_gettime
should return the amount of CPU processing time. The original glibc implementation was simply returning the amount of time that had passed since the process or thread started its execution.
Moreover, the original implementation was based on reading the value of the Timestamp Counter register (TSC) on the CPU. Reading this value is problematic for numerous reasons: many CPUs at the time (and still today) do not have stable TSCs. This means that:
-
The TSC of the CPU may increment at a rate with respect to the CPU frequency, which may be adjusted by various power saving systems.
-
The TSC of various CPUs on SMP systems may not all be synchronized. If a process or thread migrates from one CPU to another the values returned by
clock_gettime
may be bogus.
This is documented in the “NOTES” section of the clock_gettime
man page, which is an entertaining and interesting read.
Eventually, a series of patches were proposed to fix this, making clock_gettime
POSIX compliant and safe to use even if threads are migrated between CPUs with different TSC values.
This is all well and good, but how does this related back to UseLinuxPosixThreadCPUClocks
?
What does UseLinuxPosixThreadCPUClocks
do?
Since glibc before 2.4 and the Linux kernel before 2.6.12 did not implement POSIX compliant CPU timing, the JVM added the flag UseLinuxPosixThreadCPUClocks
to control which of two CPU timing methods would be used on Linux.
During the initialization of the HotSpot Java VM a function called os::Linux::fast_thread_clock_init
is called which begins by first checking the UseLinuxPosixThreadCPUClocks
flag:
As you can see, this flag serves as a way to bail out of the fast_thread_clock_init
function. What exactly is this flag protecting?
If you read through the code, you’ll find that the JVM will use the value of UseLinuxPosixThreadCPUClocks
to determine whether or not it should attempt to detect if the pthread_getcpuclockid
function exists and works on the system.
If the checks pass, an internal flag is set to true
(_supports_fast_thread_cpu_time
) which the VM will check whenever the Java program attempts to check the CPU time:
-
If the flag is
true
: thread CPU time will be checked with calls topthread_getcpuclockid
andclock_gettime
. -
If the flag is
false
: thread CPU time will be checked by opening the file/proc/[pid]/stat
and parsing the output.
As the original bug report suggests: using pthread_getcpuclockid
and clock_gettime
is much faster (about 10x in our test below) than opening a file, reading the contents, parsing it, cleaning up, and returning the CPU time.
Java program strace
comparison
Let’s compare the strace
output of the test program provided in the original bug report on OpenJDK 1.8 with the UseLinuxPosixThreadCPUClocks
flag enabled and disabled:
Build this program with: javac Test.java
.
With UseLinuxPosixThreadCPUClocks
disabled
On JVM 1.6, this is the default. On more recent JVMs, this option needs to be explicitly disabled, so let’s do that. We’ll expect to see lots of calls to open
, read
, etc as the proc filesystem will be consulted when trying to determine the CPU time.
Run the program with strace -fceclock_gettime,open,read,close,fstat java -XX:-UseLinuxPosixThreadCPUClocks Test
:
The program completed in 11.61 seconds with over 200 thousand calls to read
and over 100 thousand calls to each of read
, close
, open
, and fstat
.
Just to get the current thread’s CPU time! Wow.
Note that the ~18 thousand clock_gettime
calls here are unrelated to our test program, but we recommend any motivated readers to take a look and find where those come from because it’s interesting ;)
With UseLinuxPosixThreadCPUClocks
enabled
On JVM versions above 1.6, this is the default, so we don’t need to specify anything. We expect to see far fewer calls to open
, read
, et al. and many more calls to clock_gettime
. Let’s be explicit bout enabling UseLinuxPosixThreadCPUClocks
incase anyone supporting a legacy JVM wants to see how to do it.
Run the program with strace -fceclock_gettime,open,read,close,fstat java -XX:+UseLinuxPosixThreadCPUClocks Test
:
The program completed in 1.2 seconds (nearly 10x faster) with over 100k calls to clock_gettime
, but far, far fewer calls to open
, read
, et al.
Conclusion
As we’ve mentioned in other posts, we’re big fans of using strace
on anything and everything, including Java programs! strace
isn’t free, so you can’t run it constantly. However, you can run strace
from time to time to ensure that your programs are doing what you expect. Doing this will often highlight interesting behavior and possibly reveal incorrect assumptions you’ve made about underlying systems and libraries.