TL;DR
This blog post is a followup on our previous post How setting the TZ environment variable avoids thousands of system calls. In this post, we’ll explore a particularly prominent case where a micro-optimization (like removing a system call in a hot path) had a drastic effect on software performance.
What is a “noticeable” improvement?
We previously described an environment variable that an application can set to avoid thousands of extra system calls.
Our post was met with some reasonable questions and a bit of skepticism:
- “This is neat, but does removing a single system call result in any noticeable performance improvement for a program?”
- “This seems unnecessary; system calls are fast on Linux, anyway.”
It’s difficult to say what “noticeable” might mean for each developer and their application. Kernel and driver developers often devote a considerable amount of time to micro-optimize their code and data structures in such a way as to maximize usage of CPU cache and reduce the amount of CPU usage, even by what most programmers might consider tiny amounts. Should application programmers pay less mind to these sorts of optimizations? After all, one might consider these micro-optimizations as not being “noticeable” either.
For the purposes of this article, let’s take “noticeable” to mean “easily measurable and very obvious.” Can we show a real world example of when removing a slow (non-vDSO enabled) system call from a code path was easily measurable and very obvious?
It turns out that there are infact many instances of this in the wild: everything from packet sniffers to programming language runtimes. Let’s look at a particularly infamous case: sigprocmask
and its effect on the Ruby language runtime.
What is sigprocmask
?
sigprocmask
is system call that is used to check or set the current process’ signal mask. This allows a program to block or permit signals and is useful when a program needs to execute a critical section of code that cannot be interrupted.
It is not a particularly complex system call. The kernel code for sigprocmask
shows that it writes a sigset_t
to a C structure that contains the current process’ state (called a task_struct
in the kernel); this is a very fast operation.
Let’s create a simple test program that calls sigprocmask in a tight loop and use strace
and time
to measure it:
Compile with gcc -o test test.c
and execute first with time
and then again with strace
and time
.
On my test system:
- Running simply with
time ./test
, yields on the order of0m0.047s
,0m0.012s
, and0m0.036s
real, user, and system time, respectively. - Running again with
time stace -ttT ./test
, shows overhead imposed bystrace
quite clearly with0m52.364s
,0m9.313s
, and0m14.349s
real, user, and system time, respectively.
The strace
output will show approximate execution times for each sigprocmask
call (which will likely show up as rt_sigprocmask
on your system). The approximate times are very small. On my test system, I get something like 0.000003
most of the time, with an occasional jump to 0.000074
.
(By the way: it turns out that measuring exact system call execution time is difficult for a number of reasons all of which are far outside the scope of this article. We can assume, though, that all of these measurements are equally incorrect.)
Let’s review what we know:
sigprocmask
is a system call used to set or check the current process’ signal mask.- The kernel side code is simple and should execute quickly.
- The
time
andstrace
output both show that executing 1 millionsigprocmask
calls in a tight loop terminates quickly with each call taking a very small amount of time.
With numbers like this how could removing extra calls to sigprocmask
ever be worth the effort?
Look deeper
Sometimes, other code we use in our applications (like system libraries, the kernel, glibc, etc) does things that we don’t quite expect or has side effects that don’t seem obvious up front. To illustrate this point, I’m going to show a severe performance penalty that was incurred by indirectly using sigprocmask
in a test program and then show how this manifested in a real world application.
One of the most prominent examples of extra sigprocmask
calls leading directly to an obvious and easily measurable performance degradation was in our old friend Ruby 1.8.7. The performance loss occurs when Ruby 1.8.7 is compiled with one particular configure
flag.
We’ll start by using Ruby 1.8.7 with the default configure flags used on most operation systems (Debian, Ubuntu, et. al.) at the time Ruby 1.8.7 was widely used. We’ll examine a test case illustrating the performance loss followed by a real world example and an explanation of how this happened in Ruby.
sigprocmask
test case
Let’s take a look at the following test code:
This code is simple: it creates two threads, each of which adds and removes data to an array 10 million times. If we run this with the default Ruby 1.8.7 configure flags and strace
, we’ll see some surprising results:
Over 20 million system calls for sigprocmask
were generated by the Ruby VM from the code I pasted above. You may think: “But, look at how little time was consumed! There’s no way this matters!”
As mentioned before, measuring system call time is a bit complicated. Let’s re-run the test program with time
(but not strace
) and get a reading on how long this test program takes to complete on my system:
So, about 6 seconds real time to execute. This means the code is making roughly 3.3 million sigprocmask
calls per second. Wow.
If we adjust one configure
flag for building Ruby slightly, we can get the Ruby VM build system to avoid making calls to sigprocmask
!
Let’s re-run the strace
and time
tests, but with a Ruby that has been adjusted slightly to avoid making sigprocmask
calls and see what we get:
Whoa! We’ve reduced the calls to sigprocmask
from over 20 million to 3. Looks like strace
had trouble computing the system call execution time :(
Let’s see what time
says:
Roughly 40% faster (real time) than our previous example and less than 1 sigprocmask
call per second.
So, that’s a cool result, but it raises a few important questions:
- Isn’t this example a bit contrived?
- When would this actually matter in the real world?
- What is going on when adjusting the
configure
flags that is actually causing this?
Let’s look at a real world example of when this mattered and then we’ll dive into some of the details of what is actually going on.
Real world example: Puppet
A bug filed against Puppet shows precisely the impact of extra sigprocmask
system calls in the Ruby VM:
I’m currently facing what I consider serious performance issues. In fact, Puppet seems to be very slow in general.
Let me show you a first example:
$ time puppet —version 0.24.5
real 0m0.718s user 0m0.576s sys 0m0.140s
During that time, there is hundreds, if not thousands “rt_sigprocmask(SIG_BLOCK, NULL, [], 8)” calls being made. All this to show the version.
If you follow the thread, you can find more comments complaining about the poor performance of Puppet and further seaches show stackoverflow questions expressing the same concern.
It’s not just Ruby and Puppet, though. You can find similar bug reports for other projects with users reporting 100% CPU usage and hundreds of thousands of calls to sigprocmask
melting their CPUs.
How does this happen? Shouldn’t this be easy to fix?
What’s really going on here is that the calls to sigprocmask
are coming from inside of a pair of glibc functions (that are not system calls): getcontext
and setcontext
.
These two functions are used to save and restore the CPU state and are commonly used by programs and libraries implementing exception handling or userland threads. In the case of Ruby 1.8.7: setcontext
and getcontext
are used as part of the userland thread implementation to context switch between threads.
You’d imagine that this pair of functions would execute quickly, after all, they just need to save or restore a small set of CPU registers. It turns out that in addition to saving the set of CPU registers (a very fast operation) the implementations of these functions in glibc are such that the sigprocmask
system call is used to save or restore the signal mask.
Recall that Linux provides a mechanism (the vDSO) for certain system calls to be executed in userland instead of the kernerl in order to reduce the cost associated with making those system calls. Unfortunately, sigprocmask
is not one of the system calls that is provided by the vDSO. All sigprocmask
system calls will cause a transition from userland to the kernel.
The cost of this transition is much higher than the cost of other operations in setcontext
or getcontext
(which are a few simple memory writes). If you are calling these functions very often you are now performing a slow operation (in this case a system call to sigprocmask
, which does not pass through the vDSO) every time you intended to do something quickly (like saving or restoring the CPU register set to switch threads).
Why does changing a configure flag make this go away?
At the time Ruby 1.8.7 was widely used, it was by default configured with --enable-pthread
which enabled a separate OS-level timer thread that would occasionally run to pre-empt the Ruby VM. This is how the Ruby VM knew when it was time to switch between the userland threads that were mapped to threads created in Ruby programs. Using --enable-pthread
also caused the configure
script to search for and use the functions getcontext
and setcontext
. If you didn’t use --enable-pthread
, the configure
script would instead search for and use _setjmp
and _longjmp
(note the underscores) instead. These functions do not save or restore the signal mask, and thus, thus they do not make the system call to sigprocmask
.
So:
--enable-pthread
and--disable-pthread
were intended to be used to control whether the Ruby VM’s userland threading implementation would use an OS level thread or a simpleVTALRM
signal to notify the Ruby VM when it was time to switch execution to a different Ruby thread.- The unintended side effect of switching between these two pre-emptiong methods is that the underlying primitives used to implement context switching are swapped between
setcontext
/getcontext
and_setjmp
/_longjmp
. - The additional side effect of this low level function swap is either additional calls to
sigprocmask
(if thesetcontext
/getcontext
pair is chosen) or not. - The additional side effect being reduced performance if the pair which uses
sigprocmask
is chosen. - The real world result in which users of Puppet (and other software) have a poor experience.
Phew; all of this from one configure flag that essentially enables or disables a single system call in a hot path.
Conclusion
Micro-optimizations do matter, but the extent to which they matter is, of course, application specific. Being aware that the libraries and code you depend on can do things you might not expect (like make a slow system call in a hot path) and knowing which tools to use to detect and correct these sorts of issues can have a vast impact on your users and, in some cases, your users’ users.