This blog post is a followup on our previous post How setting the TZ environment variable avoids thousands of system calls. In this post, we’ll explore a particularly prominent case where a micro-optimization (like removing a system call in a hot path) had a drastic effect on software performance.
What is a “noticeable” improvement?
We previously described an environment variable that an application can set to avoid thousands of extra system calls.
Our post was met with some reasonable questions and a bit of skepticism:
- “This is neat, but does removing a single system call result in any noticeable performance improvement for a program?”
- “This seems unnecessary; system calls are fast on Linux, anyway.”
It’s difficult to say what “noticeable” might mean for each developer and their application. Kernel and driver developers often devote a considerable amount of time to micro-optimize their code and data structures in such a way as to maximize usage of CPU cache and reduce the amount of CPU usage, even by what most programmers might consider tiny amounts. Should application programmers pay less mind to these sorts of optimizations? After all, one might consider these micro-optimizations as not being “noticeable” either.
For the purposes of this article, let’s take “noticeable” to mean “easily measurable and very obvious.” Can we show a real world example of when removing a slow (non-vDSO enabled) system call from a code path was easily measurable and very obvious?
It turns out that there are infact many instances of this in the wild: everything from packet sniffers to programming language runtimes. Let’s look at a particularly infamous case:
sigprocmask and its effect on the Ruby language runtime.
sigprocmask is system call that is used to check or set the current process’ signal mask. This allows a program to block or permit signals and is useful when a program needs to execute a critical section of code that cannot be interrupted.
It is not a particularly complex system call. The kernel code for
sigprocmask shows that it writes a
sigset_t to a C structure that contains the current process’ state (called a
task_struct in the kernel); this is a very fast operation.
Let’s create a simple test program that calls sigprocmask in a tight loop and use
time to measure it:
gcc -o test test.c and execute first with
time and then again with
On my test system:
- Running simply with
time ./test, yields on the order of
0m0.036sreal, user, and system time, respectively.
- Running again with
time stace -ttT ./test, shows overhead imposed by
stracequite clearly with
0m14.349sreal, user, and system time, respectively.
strace output will show approximate execution times for each
sigprocmask call (which will likely show up as
rt_sigprocmask on your system). The approximate times are very small. On my test system, I get something like
0.000003 most of the time, with an occasional jump to
(By the way: it turns out that measuring exact system call execution time is difficult for a number of reasons all of which are far outside the scope of this article. We can assume, though, that all of these measurements are equally incorrect.)
Let’s review what we know:
sigprocmaskis a system call used to set or check the current process’ signal mask.
- The kernel side code is simple and should execute quickly.
straceoutput both show that executing 1 million
sigprocmaskcalls in a tight loop terminates quickly with each call taking a very small amount of time.
With numbers like this how could removing extra calls to
sigprocmask ever be worth the effort?
Sometimes, other code we use in our applications (like system libraries, the kernel, glibc, etc) does things that we don’t quite expect or has side effects that don’t seem obvious up front. To illustrate this point, I’m going to show a severe performance penalty that was incurred by indirectly using
sigprocmask in a test program and then show how this manifested in a real world application.
One of the most prominent examples of extra
sigprocmask calls leading directly to an obvious and easily measurable performance degradation was in our old friend Ruby 1.8.7. The performance loss occurs when Ruby 1.8.7 is compiled with one particular
We’ll start by using Ruby 1.8.7 with the default configure flags used on most operation systems (Debian, Ubuntu, et. al.) at the time Ruby 1.8.7 was widely used. We’ll examine a test case illustrating the performance loss followed by a real world example and an explanation of how this happened in Ruby.
sigprocmask test case
Let’s take a look at the following test code:
This code is simple: it creates two threads, each of which adds and removes data to an array 10 million times. If we run this with the default Ruby 1.8.7 configure flags and
strace, we’ll see some surprising results:
Over 20 million system calls for
sigprocmask were generated by the Ruby VM from the code I pasted above. You may think: “But, look at how little time was consumed! There’s no way this matters!”
As mentioned before, measuring system call time is a bit complicated. Let’s re-run the test program with
time (but not
strace) and get a reading on how long this test program takes to complete on my system:
So, about 6 seconds real time to execute. This means the code is making roughly 3.3 million
sigprocmask calls per second. Wow.
If we adjust one
configure flag for building Ruby slightly, we can get the Ruby VM build system to avoid making calls to
Let’s re-run the
time tests, but with a Ruby that has been adjusted slightly to avoid making
sigprocmask calls and see what we get:
Whoa! We’ve reduced the calls to
sigprocmask from over 20 million to 3. Looks like
strace had trouble computing the system call execution time :(
Let’s see what
Roughly 40% faster (real time) than our previous example and less than 1
sigprocmask call per second.
So, that’s a cool result, but it raises a few important questions:
- Isn’t this example a bit contrived?
- When would this actually matter in the real world?
- What is going on when adjusting the
configureflags that is actually causing this?
Let’s look at a real world example of when this mattered and then we’ll dive into some of the details of what is actually going on.
Real world example: Puppet
A bug filed against Puppet shows precisely the impact of extra
sigprocmask system calls in the Ruby VM:
I’m currently facing what I consider serious performance issues. In fact, Puppet seems to be very slow in general.
Let me show you a first example:
$ time puppet —version 0.24.5
real 0m0.718s user 0m0.576s sys 0m0.140s
During that time, there is hundreds, if not thousands “rt_sigprocmask(SIG_BLOCK, NULL, , 8)” calls being made. All this to show the version.
It’s not just Ruby and Puppet, though. You can find similar bug reports for other projects with users reporting 100% CPU usage and hundreds of thousands of calls to
sigprocmask melting their CPUs.
How does this happen? Shouldn’t this be easy to fix?
What’s really going on here is that the calls to
sigprocmask are coming from inside of a pair of glibc functions (that are not system calls):
These two functions are used to save and restore the CPU state and are commonly used by programs and libraries implementing exception handling or userland threads. In the case of Ruby 1.8.7:
getcontext are used as part of the userland thread implementation to context switch between threads.
You’d imagine that this pair of functions would execute quickly, after all, they just need to save or restore a small set of CPU registers. It turns out that in addition to saving the set of CPU registers (a very fast operation) the implementations of these functions in glibc are such that the
sigprocmask system call is used to save or restore the signal mask.
Recall that Linux provides a mechanism (the vDSO) for certain system calls to be executed in userland instead of the kernerl in order to reduce the cost associated with making those system calls. Unfortunately,
sigprocmask is not one of the system calls that is provided by the vDSO. All
sigprocmask system calls will cause a transition from userland to the kernel.
The cost of this transition is much higher than the cost of other operations in
getcontext (which are a few simple memory writes). If you are calling these functions very often you are now performing a slow operation (in this case a system call to
sigprocmask, which does not pass through the vDSO) every time you intended to do something quickly (like saving or restoring the CPU register set to switch threads).
Why does changing a configure flag make this go away?
At the time Ruby 1.8.7 was widely used, it was by default configured with
--enable-pthread which enabled a separate OS-level timer thread that would occasionally run to pre-empt the Ruby VM. This is how the Ruby VM knew when it was time to switch between the userland threads that were mapped to threads created in Ruby programs. Using
--enable-pthread also caused the
configure script to search for and use the functions
setcontext. If you didn’t use
configure script would instead search for and use
_longjmp (note the underscores) instead. These functions do not save or restore the signal mask, and thus, thus they do not make the system call to
--disable-pthreadwere intended to be used to control whether the Ruby VM’s userland threading implementation would use an OS level thread or a simple
VTALRMsignal to notify the Ruby VM when it was time to switch execution to a different Ruby thread.
- The unintended side effect of switching between these two pre-emptiong methods is that the underlying primitives used to implement context switching are swapped between
- The additional side effect of this low level function swap is either additional calls to
getcontextpair is chosen) or not.
- The additional side effect being reduced performance if the pair which uses
- The real world result in which users of Puppet (and other software) have a poor experience.
Phew; all of this from one configure flag that essentially enables or disables a single system call in a hot path.
Micro-optimizations do matter, but the extent to which they matter is, of course, application specific. Being aware that the libraries and code you depend on can do things you might not expect (like make a slow system call in a hot path) and knowing which tools to use to detect and correct these sorts of issues can have a vast impact on your users and, in some cases, your users’ users.
If you enjoyed this post, you may enjoy some of our other low-level technical posts:
- Two frequently used system calls are ~77% slower on AWS EC2
- How setting the TZ environment variable avoids thousands of system calls
- Monitoring and Tuning the Linux Networking Stack: Sending Data
- Monitoring and Tuning the Linux Networking Stack: Receiving Data
- Illustrated Guide to Monitoring and Tuning the Linux Networking Stack: Receiving Data
- The Definitive Guide to Linux System Calls
- How does
- How does
- APT Hash sum mismatch
- HOWTO: GPG sign and verify deb packages and APT repositories
- HOWTO: GPG sign and verify RPM packages and yum repositories