TL;DR
This blog post examines a tricky bug in the incredibly useful libarchive-ruby-swig ruby gem. This gem wraps the libarchive C library which can be used to read and write archives of many different formats.
The bug in the C++ code of the RubyGem itself causes Ruby’s GC to mistakenly free an in-use object which later leads to a segfault.
A fix for this issue was sent as a pull request and subsequently merged.
The Segfault
packagecloud supports Debian source packages, often referred to as DSCs. The DSC files themselves are just plain text and are sometimes signed with a GPG key.
A portion of the backend software for packagecloud uses Ruby to determine the type of package the user is uploading. libarchive-ruby-swig is pointed at uploaded files as part of the file type detection process. During the development of support for DSCs, a recurrent segmentation fault was encountered that only seemed to be related to processing DSC packages.
Getting a reproducible test case
In order to sanely debug this, a reproducible test case was needed. I suspected the garbage collector because of previous encounters with similar bugs.
Getting a reproducible test case for a garbage collection related bug can be quite tricky as changes to the environment, directory structure, time of day, etc. can all affect when and how a garbage collection run is executed.
I realized that re-running the test suite only triggered this segfault for DSC files, which are plain text. So, I created a simple test program which used libarchive-ruby-swig and pointed it at a plain text file and forced a garbage collector run:
And we’ve got a winner:
So, what is going on here?
Investigating with GDB
The first step in any fun debugging session is to fire up GDB, get a backtrace, and see what’s what.
It’s a bit tricky if you need Bundler to run your test program, but not too bad:
(Some alerts from GDB about threads and libthread_db were removed for brevity, but the important pieces here are getting the program running and seeing the SIGSEGV come thorugh)
And now, for a backtrace courtesy of bt
:
From the backtrace, we can see that:
- The
read_open_filename
class method is called (stack frames 8 and 9) - An exception is raised at stack frame 7
- Internal MRI functions from frames 6 to 0 attempt to create an exception
- A lookup on a hash table via
st_lookup
causes a segfault
It’s important to note that libarchive-ruby-swig
uses SWIG to autogenerate some wrapper code for interacting with the libarchive C library. This means that we’ll need to dive into some interesting generated C++ code to fully debug this issue.
So, we begin by first examining the source code described in stack frame 8, libarchive_wrap.cxx, line 2486:
At first glance, this code looks reasonable. An exception occurs, it is caught and then raised in Ruby-land so that Ruby programs using this RubyGem can deal with error that was raised appropriately.
Why would this cause a segfault?
Read the assembly
Most of the time, it is far more useful and instructive to the read the actual assembly code which is being executed, especially when debugging. In this case, once the assembly is examined, it’ll be a bit more clear why the segfault happens.
So, ask GDB to show some of the assembly instructions for the function in question:
If you’ve never disassembled C++ code before, the above output will surely look a bit overwhelming, but the key thing to notice about the above output is:
__cxa_guard_acquire
This function is provided by the compiler (in this case g++
) and it is used above to implement the static
storage class qualifier we saw in the exception handling code in the generated SWIG wrapper code earlier:
This usage was intended to initialize the variables c_archive
, e_archive
, and o_except
just one time so that any future exceptions raise to Ruby would not need to reinitialize c_archive
, e_archive
, or o_except
. Their values are stored the first time and re-used.
The assembly code is a bit convoluted, but let’s walk through how static
is implemented for o_except
.
The C++ code:
The assembly code starts by calling the guard function to determine if o_except
has been initialized. If not, control is transferred via a jump instruction (jne
) to another piece of code:
The code that is jumped to initializes o_except
. You’ll see a call to rb_exc_new_cstr
below; rb_exc_new2
is actually just a macro in the Ruby VM source and is replaced with rb_exc_new_cstr
.
After the function is called, its return value is written to o_except
and control is transferred back to the mov
instruction above which appears after the jne
.
And, in this way, static
is implemented for variables defined within functions in C++.
But, what does this have to do with the segfault?
Ruby’s GC implementation
In order to understand this bug, you must understand how Ruby’s garbage collector works. Ruby’s garbage collector is a conservative mark-and-sweep garbage collector.
It works by:
- Crawling in use objects, starting at a set of root objects, and marking them
- Checking the program stack and heap for any value that looks like it could be a Ruby object, and marking those.
- Checking the register set of the CPU for any value that looks like it could be a Ruby object, and marking those.
Due to implementation details, it is impossible for Ruby to know if a particular value found on the stack or heap is actually a Ruby object or not. Ruby acts “conservatively” in that when it finds a value in a CPU register or on the program stack that looks like it refers to a Ruby object, the Ruby object that could be referenced is marked as in-use just incase.
After all objects (and things which look like objects) are marked, a sweep phase begins freeing objects which are not marked.
This process is demonstrated in the animation below (taken from here):
The bug
The bug occurs because:
- The Ruby VM’s object allocator doesn’t know which objects are in use or not, all it can do is guess
- The variables marked as static are initialized once and afterward, never again
- The compiler has optimized the generated assembly to use the fewest number of registers possible. As such, references to Ruby objects aren’t always guaranteed to exist in registers, or on the program stack, if the compiler thinks it can complete a function call or other operation without it. Remember, your compiler just needs to satisfy the ABI of the target system - it doesn’t “know” anything about Ruby or Ruby objects and performs valid optimizations for the target system.
- The Ruby objects that are marked static are not stored on the heap or the program stack; they are stored in a different program memory segment entirely
-
When Ruby’s garbage collector runs, it does not see the static objects because:
- References to the objects aren’t found in the program stack where the Ruby GC will scan
- References to the objects don’t exist in registers due to optimizations and the static initialization code path running once
And so, Ruby’s GC mistakenly frees this object even though it will be used when an exception is generated.
Writing a fix
The quickest and simplest fix for this is to remove the static
storage class qualifier:
There are two effects of removing static
:
- The execption object is recreated in Ruby-land everytime an exception occurrs in C++
- A reference to the Ruby object will actually exist on the program stack so Ruby’s GC will see it when it scans
A fix was sent via pull request on GitHub and merged to the project.
Deploying the fix
packagecloud uses packagecloud for maintaining internal dependencies. A fix for this issue was deployed immediately so that development of the feature could continue in development environments without waiting for a fix to be accepted and merged to the general project.
Our exact step-by-step for this:
- The original repository was forked
- The fix was committed
- A version bump was committed
- The RubyGem was rebuilt and pushed to packagecloud
Rebuilding the RubyGem and pushing to packagecloud was quick and easy:
then:
Conclusion
Ruby’s garbage collector can prove to be a tricky adversary when writing or using C or C++ based RubyGems. You need to carefully consider how the garbage collector will interact with Ruby objects created and allocated in C/C++ and what the implications are when using storage class qualifiers.
Any complex system will eventually require the developers to manage their own set of dependencies in order to get bugfixes, performance improvements, or new features that can be used in the application.
Having a place for these objects to live and be tracked is crucial for ensuring that production, development, and test environments are using the same versions of every piece of software in the stack.