Getting stuck in the mud

I bet you can relate. You find a nasty bug, but every avenue you try leaves your tires spinning in mud. Google searches are mostly fruitless (or there just isn’t a way to formulate a good query), and no one else seems to be able to reproduce the problem. You are alone. Tracebacks leave you thinking, “Oh…NO!” And you finally realize that the best plan is to get some sleep, start fresh in the morning, and use that time in the shower to come up with the next great plan of attack.

Just when you wonder if you will ever make headway, that little breakthrough happens, and it changes the nature of the problem. The new puzzle might be even harder than the first, but at least it’s different now. Your tires can finally grab some of that mud, returning hope and bringing new energy to the problem.

This happened to me recently on my Gentoo/FreeBSD system. It started when I was was running a script I wrote to create slide shows on my web site. I needed to use the “mogrify” command from ImageMagick (a wonderful package for those of us who do work with images). What has always been a trusty utility threw me a curve ball: SEGFAULT – crap! It was the worst kind of bug – down deep in an OS threading function. Ug, I had a bad feeling about this one.

For a little background, FreeBSD is transitioning to a more efficient threading library called “libthr”. A couple of the old legacy libs, libpthread and libc_r, are now mapped to this new one via a mechanism controled by “/etc/libmap.conf”. This file lets you “drop in” replacement shared libs and tell all programs to start using them instead.

On the gentoo-bsd IRC channel, UberLord suggested that I try turning off the libmap.conf mapping to libthr. Brilliant! The problem went away, which was a big clue. My first inkling was that either there was a bug in libthr (yikes!) or a problem in mogrify’s use of threading. So I started down the dreaded debugging path, instrumenting the libthr and mogrify code with printf statements, using gdb, etc. I finally determined that after many calls to pthread_mutex_lock and pthread_mutex_unlock, the CPU register containing the “current thread” (%%gs on i386) suddenly changed. I assumed that mogrify was creating a new thread but perhaps not initializing it, but no such luck; mogrify does not normally use multiple threads (I verified with the authors). Something was stomping on that register, and it was not random: the new thread address was 0x100 larger than the original one, typically, hinting at interference from other code that also used %%gs.

So here I was. Stuck. Trying to imagine new ways to instrument libthr, wondering if I would ever get to the bottom of this. No one else seemed to see the problem (UberLord later did reproduce the bug), and even the FreeBSD thread developer I emailed a couple of times did not provide that “magic idea” (like, “Oh yeah, I know what’s going on!” – don’t we all hope to hear this?). I even traded executables and libs with others who didn’t see the problem, but this uncovered nothing new (yeah, I figured). It’s no fun trying things that you almost know won’t help but have to be tried anyway – it really is like spinning your tires.

Then it happened. I did another grep through the FreeBSD code to see what else could possibly modify %%gs. Well, libpthread does, of course (the older lib usually used for threading), but that’s mapped to libthr by libmap.conf, right? All this time I had assumed libpthread was “locked out” and not used. But oh hell, I’ve seen many other “but that can’t happen” bugs in my life, so I moved aside libpthread.so, completely removing its involvement. With great anticipation, I hit enter to run mogrify again – no segfault! YES! The nature of the problem had now changed, and my tires were getting traction again. With a new level of energy, I stared my search for why this would be happening. According to a reply to my inquiry on the freebsd-threads mail list, there have been cases in which some symbols from an old library are picked up even when libmap.conf is set up to prevent this.

So part of this is still under investigation (i.e. why doesn’t libmap.conf provide a water-tight mapping?), but one thing is clear: libthr itself is not the problem. The big troublemaker is the mixing of threading libs. In fact, symlinking the legacy threading libs (libpthread and libc_r, .a and .so) to libthr has proven to be a stable solution during my testing.

One thing experience has taught me is that you should never give up. Even when those tires are really spinning and getting no where, all you need is a new angle on the problem, and the little ideas that spark change often come at unexpected times.