Debugging an MRI Ruby GC segfault caused by a buggy RubyGem

Nov 16, 2014 · 12 min read

rubygem-how-to

TL;DR

This blog post examines a tricky bug in the incredibly useful libarchive-ruby-swig ruby gem. This gem wraps the libarchive C library which can be used to read and write archives of many different formats.

The bug in the C++ code of the RubyGem itself causes Ruby’s GC to mistakenly free an in-use object which later leads to a segfault.

A fix for this issue was sent as a pull request and subsequently merged.

The Segfault

packagecloud supports Debian source packages, often referred to as DSCs. The DSC files themselves are just plain text and are sometimes signed with a GPG key.

A portion of the backend software for packagecloud uses Ruby to determine the type of package the user is uploading. libarchive-ruby-swig is pointed at uploaded files as part of the file type detection process. During the development of support for DSCs, a recurrent segmentation fault was encountered that only seemed to be related to processing DSC packages.

Getting a reproducible test case

In order to sanely debug this, a reproducible test case was needed. I suspected the garbage collector because of previous encounters with similar bugs.

Getting a reproducible test case for a garbage collection related bug can be quite tricky as changes to the environment, directory structure, time of day, etc. can all affect when and how a garbage collection run is executed.

I realized that re-running the test suite only triggered this segfault for DSC files, which are plain text. So, I created a simple test program which used libarchive-ruby-swig and pointed it at a plain text file and forced a garbage collector run:

require 'rubygems'
require 'libarchive_rs'

5.times do
  begin
    GC.start
    Archive.read_open_filename("hi.txt") do |archive|
      while entry = archive.next_header
        puts "hi: #{entry.pathname}"
      end
    end
  rescue Archive::Error => e
    puts "Got an error: #{e.inspect}"
  end
end

And we’ve got a winner:

/usr/lib/x86_64-linux-gnu/ruby/gems/2.1.4/gems/libarchive-ruby-swig-0.6.0/lib/libarchive_rs.rb:169: [BUG] Segmentation fault at 0x00000000000008

-- Control frame information -----------------------------------------------
c:0006 p:---- s:0023 e:000022 CFUNC  :read_open_filename
c:0005 p:0043 s:0017 e:000016 METHOD /usr/lib/x86_64-linux-gnu/ruby/gems/2.1.4/gems/libarchive-ruby-swig-0.6.0/lib/libarchive_rs.rb:169
c:0004 p:0029 s:0010 e:000009 BLOCK  hi.rb:7 [FINISH]
c:0003 p:---- s:0007 e:000006 CFUNC  :times
c:0002 p:0022 s:0004 E:0001d8 EVAL   hi.rb:4 [FINISH]
c:0001 p:0000 s:0002 E:000178 TOP    [FINISH]

-- Ruby level backtrace information ----------------------------------------
hi.rb:4:in `<main>'
hi.rb:4:in `times'
hi.rb:7:in `block in <main>'
/usr/lib/x86_64-linux-gnu/ruby/gems/2.1.4/gems/libarchive-ruby-swig-0.6.0/lib/libarchive_rs.rb:169:in `read_open_filename'
/usr/lib/x86_64-linux-gnu/ruby/gems/2.1.4/gems/libarchive-ruby-swig-0.6.0/lib/libarchive_rs.rb:169:in `read_open_filename'
...

So, what is going on here?

Investigating with GDB

The first step in any fun debugging session is to fire up GDB, get a backtrace, and see what’s what.

It’s a bit tricky if you need Bundler to run your test program, but not too bad:

(Some alerts from GDB about threads and libthread_db were removed for brevity, but the important pieces here are getting the program running and seeing the SIGSEGV come thorugh)

vagrant@precise64:~/libarchive-ruby-swig$ gdb -q `which ruby`
Reading symbols from /usr/bin/ruby...done.
(gdb) r `which bundle` exec ruby hi.rb hi.txt
Starting program: /usr/bin/ruby `which bundle` exec ruby hi.rb hi.txt
process 1884 is executing new program: /usr/bin/ruby
Got an error: #<Archive::Error: Unrecognized archive format>

Program received signal SIGSEGV, Segmentation fault.
0x0000555555623a4f in st_lookup (table=0x555555c26040, key=151, value=0x7fffffffd558) at st.c:414
414 st.c: No such file or directory.
(gdb)

And now, for a backtrace courtesy of bt:

(gdb) bt 14
#0  0x0000555555623a4f in st_lookup (table=0x555555c26040, key=151, value=0x7fffffffd558) at st.c:414
#1  0x000055555566e714 in lookup_method_table (id=151, klass=93824996867840) at vm_method.c:191
#2  search_method (defined_class_ptr=<synthetic pointer>, id=151, klass=93824996867840) at vm_method.c:541
#3  rb_method_entry_get_without_cache (klass=93824996867840, id=151, defined_class_ptr=0x7fffffffd618) at vm_method.c:566
#4  0x000055555567760e in check_funcall_respond_to (th=0x5555559b55b0, klass=<optimized out>, recv=93824996867880, mid=4840) at vm_eval.c:359
#5  0x0000555555677a91 in rb_check_funcall (recv=93824996867880, mid=4840, argc=0, argv=0x7fffffffd780) at vm_eval.c:416
#6  0x0000555555577c31 in make_exception (argc=1, argv=0x7fffffffd778, isstr=<optimized out>) at eval.c:659
#7  0x0000555555579030 in rb_exc_raise (mesg=93824996867880) at eval.c:567
#8  0x00007ffff42b5385 in _wrap_Reader_read_open_filename__SWIG_0 (argv=0x7ffff7eeb0a0, argc=3, self=<optimized out>) at libarchive_wrap.cxx:2486
#9  _wrap_Reader_read_open_filename (nargs=<optimized out>, args=0x7ffff7eeb0a0, self=<optimized out>) at libarchive_wrap.cxx:2628
#10 0x000055555566dfca in vm_call_cfunc_with_frame (th=0x5555559b55b0, reg_cfp=0x7ffff7feae80, ci=<optimized out>) at vm_insnhelper.c:1468
#11 0x0000555555672201 in vm_exec_core (th=<optimized out>, initial=<optimized out>) at insns.def:1028
#12 0x0000555555676b39 in vm_exec (th=0x5555559b55b0) at vm.c:1304
#13 0x00005555556784c9 in invoke_block_from_c (th=0x5555559b55b0, block=<optimized out>, self=<optimized out>, argc=<optimized out>, argv=<optimized out>,
    blockptr=<optimized out>, cref=0x0, defined_class=8) at vm.c:732
(More stack frames follow...)

From the backtrace, we can see that:

The read_open_filename class method is called (stack frames 8 and 9)
An exception is raised at stack frame 7
Internal MRI functions from frames 6 to 0 attempt to create an exception
A lookup on a hash table via st_lookup causes a segfault

It’s important to note that libarchive-ruby-swig uses SWIG to autogenerate some wrapper code for interacting with the libarchive C library. This means that we’ll need to dive into some interesting generated C++ code to fully debug this issue.

So, we begin by first examining the source code described in stack frame 8, libarchive_wrap.cxx, line 2486:

  {
    try {
      result = (Reader *)Reader::read_open_filename((char const *)arg1,(char const *)arg2,arg3);
    } catch(Error &err) {
      static VALUE c_archive = rb_define_module("Archive");
      static VALUE e_archive =
      rb_define_class_under(c_archive, "Error", rb_eStandardError);
      static VALUE o_except = rb_exc_new2(e_archive, err.what());
      rb_exc_raise(o_except);   // <==== line 2486
    }
  }

At first glance, this code looks reasonable. An exception occurs, it is caught and then raised in Ruby-land so that Ruby programs using this RubyGem can deal with error that was raised appropriately.

Why would this cause a segfault?

Read the assembly

Most of the time, it is far more useful and instructive to the read the actual assembly code which is being executed, especially when debugging. In this case, once the assembly is examined, it’ll be a bit more clear why the segfault happens.

So, ask GDB to show some of the assembly instructions for the function in question:

(gdb) disas 0x00007ffff42b5339, +150
Dump of assembler code from 0x7ffff42b5339 to 0x7ffff42b53cf:
   0x00007ffff42b5339: mov    %rax,%rdi
   0x00007ffff42b533c: jne    0x7ffff42b5309 <_wrap_Reader_read_open_filename(int, VALUE*, VALUE)+985>
   0x00007ffff42b533e: callq  0x7ffff42af200 <__cxa_begin_catch@plt>
   0x00007ffff42b5343: cmpb   $0x0,0x208a0e(%rip)
   0x00007ffff42b534a: mov    %rax,%rbx
   0x00007ffff42b534d: je     0x7ffff42b53e8 <_wrap_Reader_read_open_filename(int, VALUE*, VALUE)+1208>
   0x00007ffff42b5353: cmpb   $0x0,0x208a06(%rip)
   0x00007ffff42b535a: je     0x7ffff42b5385 <_wrap_Reader_read_open_filename(int, VALUE*, VALUE)+1109>
   0x00007ffff42b535c: cmpb   $0x0,0x208a05(%rip)
   0x00007ffff42b5363: jne    0x7ffff42b5379 <_wrap_Reader_read_open_filename(int, VALUE*, VALUE)+1097>
   0x00007ffff42b5365: lea    0x2089fc(%rip),%rdi
   0x00007ffff42b536c: callq  0x7ffff42aea30 <__cxa_guard_acquire@plt>
   0x00007ffff42b5371: test   %eax,%eax
   0x00007ffff42b5373: jne    0x7ffff42b5420 <_wrap_Reader_read_open_filename(int, VALUE*, VALUE)+1264>
   0x00007ffff42b5379: mov    0x208a00(%rip),%rdi
   0x00007ffff42b5380: callq  0x7ffff42aed40 <rb_exc_raise@plt>
=> 0x00007ffff42b5385: lea    0x2089d4(%rip),%rdi
   0x00007ffff42b538c: callq  0x7ffff42aea30 <__cxa_guard_acquire@plt>
   0x00007ffff42b5391: test   %eax,%eax
   0x00007ffff42b5393: je     0x7ffff42b535c <_wrap_Reader_read_open_filename(int, VALUE*, VALUE)+1068>
   0x00007ffff42b5395: mov    0x207bec(%rip),%rax
   0x00007ffff42b539c: mov    0x2089cd(%rip),%rdi
   0x00007ffff42b53a3: lea    0x3e43(%rip),%rsi
   0x00007ffff42b53aa: mov    (%rax),%rdx
   0x00007ffff42b53ad: callq  0x7ffff42aea90 <rb_define_class_under@plt>
   0x00007ffff42b53b2: lea    0x2089a7(%rip),%rdi
   0x00007ffff42b53b9: mov    %rax,0x2089b8(%rip)
   0x00007ffff42b53c0: callq  0x7ffff42aed70 <__cxa_guard_release@plt>
   0x00007ffff42b53c5: jmp    0x7ffff42b535c <_wrap_Reader_read_open_filename(int, VALUE*, VALUE)+1068>
   0x00007ffff42b53c7: mov    %rax,%rbx
   0x00007ffff42b53ca: callq  0x7ffff42af170 <__cxa_end_catch@plt>
End of assembler dump.
(gdb)

If you’ve never disassembled C++ code before, the above output will surely look a bit overwhelming, but the key thing to notice about the above output is:

__cxa_guard_acquire

This function is provided by the compiler (in this case g++) and it is used above to implement the static storage class qualifier we saw in the exception handling code in the generated SWIG wrapper code earlier:

static VALUE c_archive = rb_define_module("Archive");
static VALUE e_archive = rb_define_class_under(c_archive, "Error", rb_eStandardError);
static VALUE o_except = rb_exc_new2(e_archive, err.what());

This usage was intended to initialize the variables c_archive, e_archive, and o_except just one time so that any future exceptions raise to Ruby would not need to reinitialize c_archive, e_archive, or o_except. Their values are stored the first time and re-used.

The assembly code is a bit convoluted, but let’s walk through how static is implemented for o_except.

The C++ code:

static VALUE o_except = rb_exc_new2(e_archive, err.what());

The assembly code starts by calling the guard function to determine if o_except has been initialized. If not, control is transferred via a jump instruction (jne) to another piece of code:

0x00007ffff42b5365:  lea    0x2089fc(%rip),%rdi        # 0x7ffff44bdd68 <_ZGVZL39_wrap_Reader_read_open_filename__SWIG_0iPmmE8o_except>
0x00007ffff42b536c:  callq  0x7ffff42aea30 <__cxa_guard_acquire@plt>
0x00007ffff42b5371:  test   %eax,%eax
0x00007ffff42b5373:  jne    0x7ffff42b5420 <_wrap_Reader_read_open_filename(int, VALUE*, VALUE)+1264>
0x00007ffff42b5379:  mov    0x208a00(%rip),%rdi        # 0x7ffff44bdd80 <_ZZL39_wrap_Reader_read_open_filename__SWIG_0iPmmE8o_except>

The code that is jumped to initializes o_except. You’ll see a call to rb_exc_new_cstr below; rb_exc_new2 is actually just a macro in the Ruby VM source and is replaced with rb_exc_new_cstr.

After the function is called, its return value is written to o_except and control is transferred back to the mov instruction above which appears after the jne.

0x00007ffff42b5420 <+1264>:  mov    (%rbx),%rax
0x00007ffff42b5423 <+1267>:  mov    %rbx,%rdi
0x00007ffff42b5426 <+1270>:  callq  *0x10(%rax)
0x00007ffff42b5429 <+1273>:  mov    0x208948(%rip),%rdi
0x00007ffff42b5430 <+1280>:  mov    %rax,%rsi
0x00007ffff42b5433 <+1283>:  callq  0x7ffff42af300 <rb_exc_new_cstr@plt>
0x00007ffff42b5438 <+1288>:  lea    0x208929(%rip),%rdi
0x00007ffff42b543f <+1295>:  mov    %rax,0x20893a(%rip)
0x00007ffff42b5446 <+1302>:  callq  0x7ffff42aed70 <__cxa_guard_release@plt>
0x00007ffff42b544b <+1307>:  jmpq   0x7ffff42b5379 <_wrap_Reader_read_open_filename(int, VALUE*, VALUE)+1097>

And, in this way, static is implemented for variables defined within functions in C++.

But, what does this have to do with the segfault?

Ruby’s GC implementation

In order to understand this bug, you must understand how Ruby’s garbage collector works. Ruby’s garbage collector is a conservative mark-and-sweep garbage collector.

It works by:

Crawling in use objects, starting at a set of root objects, and marking them
Checking the program stack and heap for any value that looks like it could be a Ruby object, and marking those.
Checking the register set of the CPU for any value that looks like it could be a Ruby object, and marking those.

Due to implementation details, it is impossible for Ruby to know if a particular value found on the stack or heap is actually a Ruby object or not. Ruby acts “conservatively” in that when it finds a value in a CPU register or on the program stack that looks like it refers to a Ruby object, the Ruby object that could be referenced is marked as in-use just incase.

After all objects (and things which look like objects) are marked, a sweep phase begins freeing objects which are not marked.

This process is demonstrated in the animation below (taken from here):

mark and sweep animation

The bug

The bug occurs because:

The Ruby VM’s object allocator doesn’t know which objects are in use or not, all it can do is guess
The variables marked as static are initialized once and afterward, never again
The compiler has optimized the generated assembly to use the fewest number of registers possible. As such, references to Ruby objects aren’t always guaranteed to exist in registers, or on the program stack, if the compiler thinks it can complete a function call or other operation without it. Remember, your compiler just needs to satisfy the ABI of the target system - it doesn’t “know” anything about Ruby or Ruby objects and performs valid optimizations for the target system.
The Ruby objects that are marked static are not stored on the heap or the program stack; they are stored in a different program memory segment entirely
When Ruby’s garbage collector runs, it does not see the static objects because:
- References to the objects aren’t found in the program stack where the Ruby GC will scan
- References to the objects don’t exist in registers due to optimizations and the static initialization code path running once

And so, Ruby’s GC mistakenly frees this object even though it will be used when an exception is generated.

Writing a fix

The quickest and simplest fix for this is to remove the static storage class qualifier:

-        static VALUE o_except = rb_exc_new2(e_archive, err.what());
+        VALUE o_except = rb_exc_new2(e_archive, err.what());

There are two effects of removing static:

The execption object is recreated in Ruby-land everytime an exception occurrs in C++
A reference to the Ruby object will actually exist on the program stack so Ruby’s GC will see it when it scans

A fix was sent via pull request on GitHub and merged to the project.

Deploying the fix

packagecloud uses packagecloud for maintaining internal dependencies. A fix for this issue was deployed immediately so that development of the feature could continue in development environments without waiting for a fix to be accepted and merged to the general project.

Our exact step-by-step for this:

The original repository was forked
The fix was committed
A version bump was committed
The RubyGem was rebuilt and pushed to packagecloud

Rebuilding the RubyGem and pushing to packagecloud was quick and easy:

$ gem build libarchive-ruby-swig.gemspec
WARNING:  licenses is empty, but is recommended.  Use a license abbreviation from:
http://opensource.org/licenses/alphabetical
WARNING:  See http://guides.rubygems.org/specification-reference/ for help
  Successfully built RubyGem
  Name: libarchive-ruby-swig
  Version: 0.6.0
  File: libarchive-ruby-swig-0.6.0.gem

then:

$ package_cloud push computology/internal libarchive-ruby-swig-0.6.0.gem
Looking for repository at computology/internal success!
Pushing libarchive-ruby-swig-0.6.0.gem... success!

Conclusion

Ruby’s garbage collector can prove to be a tricky adversary when writing or using C or C++ based RubyGems. You need to carefully consider how the garbage collector will interact with Ruby objects created and allocated in C/C++ and what the implications are when using storage class qualifiers.

Any complex system will eventually require the developers to manage their own set of dependencies in order to get bugfixes, performance improvements, or new features that can be used in the application.

Having a place for these objects to live and be tracked is crucial for ensuring that production, development, and test environments are using the same versions of every piece of software in the stack.