Lilblue Linux: release 20141212. dlclose() is a problem.

I pushed out another version of Lilblue Linux a few days ago but I don’t feel as good about this release as previous ones.  If you haven’t been following my posts, Lilblue is a fully featured amd64, hardened, XFCE4 desktop that uses uClibc instead of glibc as its standard C library.  The name is a bit misleading because Lilblue is Gentoo but departs from the mainstream in this one respect only.  In fact, I strive to make it as close to mainstream Gentoo as possible so that everything will “just work”.  I’ve been maintaining Lilblue for years as a way of pushing the limits of uClibc, which is mainly intended for embedded systems, to see where it breaks and fix or improve it.

As with all releases, there are always a few minor problems, little annoyances that are not exactly show stopper.  One minor oversight that I found after releasing was that I hadn’t configured smplayer correctly.  That’s the gui front end to mplayer that you’ll find on the toolbar on the bottom of the desktop. It works, just not out-of-the-box.  In the preferences, you need to switch from mplayer2 to mplayer and set the video out to x11.  I’ll add that to the build scripts to make sure its in the next release [1].  I’ve also been migrating away from gnome-centered applications which have been pulling in more and more bloat.  A couple of releases ago I switched from gnome-terminal to xfce4-terminal, and for this release, I finally made the leap from epiphany to midori as the main browser.  I like midori better although it isn’t as popular as epiphany.  I hope others approve of the choice.

But there is one issue I hit which is serious.  It seems with every release I hit at least one of those.  This time it was in uClibc’s implementation of dlclose().  Along with dlopen() and dlsym(), this is how shared objects can be loaded into a running program during execution rather than at load time.  This is probably more familiar to people as “plugins” which are just shared objects loaded while the program is running.  When building the latest Lilblue image, gnome-base/librsvg segfaulted while running gdk-pixbuf-query-loaders [2].  The later links against glib and calls g_module_open() and g_module_close() on many shared objects as it constructs a cache of of loadable objects.  g_module_{open,close} are just glib’s wrappers to dlopen() and dlclose() on systems that provide them, like Linux.  A preliminary backtrace obtained by running gdb on `/usr/bin/gdk-pixbuf-query-loaders ./` pointed to the segfault happening in gcc’s __deregister_frame_info() in unwind-dw2-fde.c, which didn’t sound right.  I rebuilt the entire system with CFLAGS+=”-fno-omit-frame-pointer -O1 -ggdb” and turned on uClibc’s SUPPORT_LD_DEBUG=y, which emits debugging info to stderr when running with LD_DEBUG=y, and DODEBUG=y which prevents symbol stripping in uClibc’s libraries.  A more complete backtrace gave:

Program received signal SIGSEGV, Segmentation fault.
__deregister_frame_info (begin=0x7ffff22d96e0) at /var/tmp/portage/sys-devel/gcc-4.8.3/work/gcc-4.8.3/libgcc/unwind-dw2-fde.c:222
222 /var/tmp/portage/sys-devel/gcc-4.8.3/work/gcc-4.8.3/libgcc/unwind-dw2-fde.c: No such file or directory.
(gdb) bt
#0 __deregister_frame_info (begin=0x7ffff22d96e0) at /var/tmp/portage/sys-devel/gcc-4.8.3/work/gcc-4.8.3/libgcc/unwind-dw2-fde.c:222
#1 0x00007ffff22c281e in __do_global_dtors_aux () from /lib/
#2 0x0000555555770da0 in ?? ()
#3 0x0000555555770da0 in ?? ()
#4 0x00007fffffffdde0 in ?? ()
#5 0x00007ffff22d8a2f in _fini () from /lib/
#6 0x00007fffffffdde0 in ?? ()
#7 0x00007ffff6f8018d in do_dlclose (vhandle=0x7ffff764a420 <__malloc_lock>, need_fini=32767) at ldso/libdl/libdl.c:860
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

The problem occurred when running the global destructors in dlclose()-ing  Line 860 of libdl.c has DL_CALL_FUNC_AT_ADDR (dl_elf_fini, tpnt->loadaddr, (int (*)(void))); which is a macro that calls a function at address dl_elf_fini with signature int(*)(void).  If you’re not familiar with ctor’s and dtor’s, these are the global constructors/destructors whose code lives in the .ctor and .dtor sections of an ELF object which you see when doing readelf -S <obj>.  The ctors are run when a library is first linked or opened via dlopen() and similarly the dtors are run when dlclose()-ing.  Here’s some code to demonstrate this:

# Makefile
all: test
tmp.o: tmp.c
        gcc -fPIC -c $^ tmp.o
        gcc -shared -Wl,-soname,$@ -o $@ $
test: test-dlopen.c
        gcc -o $@ $^ -ldl
        rm -f *.so *.o test
// tmp.c
#include <stdio.h>

void my_init() __attribute__ ((constructor));
void my_fini() __attribute__ ((destructor));

void my_init() { printf("Global initialization!\n"); }
void my_fini() { printf("Global cleanup!\n"); }
void doit() { printf("Doing it!\n" ; }
// test-dlopen.c
// This has very bad error handling, sacrificed for readability.
#include <stdio.h>
#include <dlfcn.h>

int main() {
        int (*mydoit)();
        void *handle = NULL;

        handle = dlopen("./", RTLD_LAZY);
        mydoit = dlsym(handle, "doit");

        return 0;

When run, this code gives:

# ./test 
Global initialization!
Doing it!
Global cleanup!

So, my_init() is run on dlopen() and my_fini() is run on dlclose().  Basically, upon dlopen()-ing a shared object as you would a plugin, the library is first mmap()-ed into the process’s address space using the PT_LOAD addresses which you can see with readelf -l <obj>.  Then, one walks through all the global constructors and runs them.  Upon dlclose()-ing the opposite process is done.  One first walks through the global destructors and runs them, and then one munmap()-s the same mappings.

Figuring I wasn’t the only person to see a problem here, I googled and found that Nathan Copa of Alpine Linux hit a similar problem [3] back when Alpine used to use uClibc — it now uses musl.  He identified a problematic commit and I wrote a patch which would retain the new behavior introduced by that commit upon setting an environment variable NEW_START, but would otherwise revert to the old behavior if NEW_START is unset.  I also added some extra diagnostics to LD_DEBUG to better see what was going on.  I’ll add my patch to a comment below, but the gist of it is that it toggles between the old and new way of calculating the size of the munmap()-ings by subtracting an end and start address.  The old behavior used a mapaddr for the start address that is totally wrong and basically causes every munmap()-ing to fail with EINVAL.  This is corrected by the commit as a simple strace -e trace=munmap shows.

My results when running with LD_DEBUG=1 were interesting to say the least.  With the old behavior, the segfault was gone:

# LD_DEBUG=1 /usr/bin//gdk-pixbuf-query-loaders
do_dlclose():859: running dtors for library /lib/ at 0x7f26bcf39a26
do_dlclose():864: unmapping: /lib/
do_dlclose():869: before new start = 0xffffffffffffffff
do_dlclose():877: during new start = (nil), vaddr = (nil), type = 1
do_dlclose():877: during new start = (nil), vaddr = 0x219c90, type = 1
do_dlclose():881: after new start = (nil)
do_dlclose():987: new start = (nil)
do_dlclose():991: old start = 0x7f26bcf22000
do_dlclose():994: dlclose using old start
do_dlclose():998: end = 0x21b000
do_dlclose():1013: removing loaded_modules: /lib/
do_dlclose():1031: removing symbol_tables: /lib/

Of course, all of the munmap()-ings failed.  The dtors were run, but no shared object got unmapped.  When running the code with the correct value of start, I got:

# NEW_START=1 LD_DEBUG=1 /usr/bin//gdk-pixbuf-query-loaders
do_dlclose():859: running dtors for library /lib/ at 0x7f5df192ba26
Segmentation fault

What’s interesting here is that the segfault occurs at  DL_CALL_FUNC_AT_ADDR which is before the munmap()-ing and so before any affect that the new value of start should have! This seems utterly mysterious until you realize that there is a whole set of dlopens/dlcloses as gdk-pixbuf-query-loader does its job — I counted 40 in all!  This is as far as I’ve gotten narrowing down this mystery, but I suspect some previous munmap()-ing is breaking the the dtors for and when the call is made to that address, its no longer valid leading to the segfault.

Rich Felker,  aka dalias, the developer of musl, made an interesting comment to me in IRC when I told him about this issue.  He said that the unmappings are dangerous and that musl actually doesn’t do them.  For now, I’ve intentionally left the unmappings in uClibc’s dlclose() “broken” in the latest release of Lilblue, so you can’t hit this bug, but for the next release I’m going to look carefully at what glibc and musl do and try to get this fix upstream.  As I said when I started this post, I’m not totally happy with this release because I didn’t nail the issue, I just implemented a workaround.  Any hits would be much appreciated!

[1] The build scripts can be found in the releng repository at git:// under tools-uclibc/desktop.  The scripts begin with a <a href=””>hardened amd64 uclibc stage3</a> tarball and build up the desktop.

[2] The purpose of librsvg and gdk-pixbuf is not essential for the problem with dlclose(), but for completeness We state them here: librsvg is a library for rendering scalable vector graphics and gdk-pixbuf is an image loading library for gtk+.  gdk-pixbuf-query-loaders reads a libtool .la file and generates cache of loadable shared objects to be consumed by gdk-pixbuf.

[3] See He suggested that the following commit was doing evil things: