{"id":135,"date":"2014-12-16T23:16:12","date_gmt":"2014-12-16T23:16:12","guid":{"rendered":"http:\/\/blogs.gentoo.org\/blueness\/?p=135"},"modified":"2014-12-16T23:16:12","modified_gmt":"2014-12-16T23:16:12","slug":"lilblue-linux-release-20141212-dlclose-is-a-problem","status":"publish","type":"post","link":"https:\/\/blogs.gentoo.org\/blueness\/2014\/12\/16\/lilblue-linux-release-20141212-dlclose-is-a-problem\/","title":{"rendered":"Lilblue Linux: release 20141212.  dlclose() is a problem."},"content":{"rendered":"<p>I pushed out another version of Lilblue Linux a few days ago but I don&#8217;t feel as good about this release as previous ones. \u00a0If you haven&#8217;t been following my posts, Lilblue\u00a0is a fully featured amd64, hardened, XFCE4 desktop that uses uClibc instead of glibc as its standard C library. \u00a0The name is a bit misleading because Lilblue\u00a0<strong>is<\/strong>\u00a0Gentoo but departs from the mainstream in this one respect only. \u00a0In fact, I strive to make it as close to mainstream Gentoo as possible so that everything will &#8220;just work&#8221;. \u00a0I&#8217;ve been maintaining Lilblue for years as a way of pushing the limits of uClibc, which is mainly intended\u00a0for embedded systems, to see where it breaks and fix or improve it.<\/p>\n<p>As with all releases, there are always\u00a0a few minor problems, little annoyances that are not exactly show stopper. \u00a0One minor oversight that I found after releasing\u00a0was that I hadn&#8217;t configured smplayer correctly. \u00a0That&#8217;s the gui front end to mplayer that you&#8217;ll find on the toolbar on the bottom of the desktop. It works, just\u00a0not out-of-the-box.\u00a0 In the preferences, you need to switch from mplayer2 to mplayer and set the video out to x11. \u00a0I&#8217;ll add that to the build scripts to make sure its in the next release [1]. \u00a0I&#8217;ve also been migrating away from gnome-centered applications which have been pulling in more and more bloat. \u00a0A couple of releases ago I switched\u00a0from gnome-terminal to xfce4-terminal, and for this release, I finally made the leap from epiphany to midori as the main browser. \u00a0I like midori\u00a0better although\u00a0it isn&#8217;t as popular as epiphany. \u00a0I hope others approve of the choice.<\/p>\n<p>But there is one issue I hit which is serious. \u00a0It seems with every release I hit at least one of those. \u00a0This time it was in uClibc&#8217;s implementation of dlclose(). \u00a0Along with dlopen() and dlsym(), this is how shared objects can be loaded into a running program during execution rather than at load time. \u00a0This is probably\u00a0more familiar to people\u00a0as\u00a0&#8220;plugins&#8221; which are just shared objects loaded while the program is running. \u00a0When building the latest Lilblue image,\u00a0gnome-base\/librsvg segfaulted while running\u00a0gdk-pixbuf-query-loaders [2]. \u00a0The later links against glib and calls g_module_open() and g_module_close() on many shared objects as\u00a0it constructs a\u00a0cache of of loadable objects. \u00a0g_module_{open,close} are just glib&#8217;s wrappers to dlopen() and dlclose() on systems that provide them, like Linux. \u00a0A preliminary backtrace obtained\u00a0by running gdb on `\/usr\/bin\/gdk-pixbuf-query-loaders .\/libpixbufloader-svg.la` pointed to the segfault happening in gcc&#8217;s\u00a0__deregister_frame_info() in unwind-dw2-fde.c, which didn&#8217;t sound right. \u00a0I rebuilt the entire system with\u00a0CFLAGS+=&#8221;-fno-omit-frame-pointer -O1 -ggdb&#8221; and turned on uClibc&#8217;s\u00a0SUPPORT_LD_DEBUG=y, which emits debugging info to stderr when running with LD_DEBUG=y, and DODEBUG=y which prevents\u00a0symbol stripping in uClibc&#8217;s libraries. \u00a0A more complete backtrace gave:<\/p>\n<pre>Program received signal SIGSEGV, Segmentation fault.\r\n__deregister_frame_info (begin=0x7ffff22d96e0) at \/var\/tmp\/portage\/sys-devel\/gcc-4.8.3\/work\/gcc-4.8.3\/libgcc\/unwind-dw2-fde.c:222\r\n222 \/var\/tmp\/portage\/sys-devel\/gcc-4.8.3\/work\/gcc-4.8.3\/libgcc\/unwind-dw2-fde.c: No such file or directory.\r\n(gdb) bt\r\n#0 __deregister_frame_info (begin=0x7ffff22d96e0) at \/var\/tmp\/portage\/sys-devel\/gcc-4.8.3\/work\/gcc-4.8.3\/libgcc\/unwind-dw2-fde.c:222\r\n#1 0x00007ffff22c281e in __do_global_dtors_aux () from \/lib\/libbz2.so.1\r\n#2 0x0000555555770da0 in ?? ()\r\n#3 0x0000555555770da0 in ?? ()\r\n#4 0x00007fffffffdde0 in ?? ()\r\n#5 0x00007ffff22d8a2f in _fini () from \/lib\/libbz2.so.1\r\n#6 0x00007fffffffdde0 in ?? ()\r\n#7 0x00007ffff6f8018d in do_dlclose (vhandle=0x7ffff764a420 &lt;__malloc_lock&gt;, need_fini=32767) at ldso\/libdl\/libdl.c:860\r\nBacktrace stopped: previous frame inner to this frame (corrupt stack?)\r\n<\/pre>\n<p>The problem occurred when\u00a0running the global destructors in dlclose()-ing libbz2.so.1. \u00a0Line 860 of libdl.c has\u00a0DL_CALL_FUNC_AT_ADDR (dl_elf_fini, tpnt-&gt;loadaddr, (int (*)(void))); which is a macro that calls a function at address dl_elf_fini with signature int(*)(void). \u00a0If you&#8217;re not familiar with ctor&#8217;s and dtor&#8217;s, these are the global constructors\/destructors whose code lives in the .ctor and .dtor sections of an ELF object which you see when doing\u00a0readelf -S &lt;obj&gt;. \u00a0The ctors are run when a library is first linked or opened via dlopen() and similarly the dtors are run when dlclose()-ing. \u00a0Here&#8217;s some code to demonstrate this:<\/p>\n<pre># Makefile\r\nall: tmp.so test\r\ntmp.o: tmp.c\r\n        gcc -fPIC -c $^\r\ntmp.so: tmp.o\r\n        gcc -shared -Wl,-soname,$@ -o $@ $\r\ntest: test-dlopen.c\r\n        gcc -o $@ $^ -ldl\r\nclean:\r\n        rm -f *.so *.o test\r\n<\/pre>\n<pre>\/\/ tmp.c\r\n#include &lt;stdio.h&gt;\r\n\r\nvoid my_init() __attribute__ ((constructor));\r\nvoid my_fini() __attribute__ ((destructor));\r\n\r\nvoid my_init() { printf(\"Global initialization!\\n\"); }\r\nvoid my_fini() { printf(\"Global cleanup!\\n\"); }\r\nvoid doit() { printf(\"Doing it!\\n\" ; }\r\n<\/pre>\n<pre>\/\/ test-dlopen.c\r\n\/\/ This has very bad error handling, sacrificed for readability.\r\n#include &lt;stdio.h&gt;\r\n#include &lt;dlfcn.h&gt;\r\n\r\nint main() {\r\n        int (*mydoit)();\r\n        void *handle = NULL;\r\n\r\n        handle = dlopen(\".\/tmp.so\", RTLD_LAZY);\r\n        mydoit = dlsym(handle, \"doit\");\r\n        mydoit();\r\n        dlclose(handle);\r\n\r\n        return 0;\r\n}\r\n<\/pre>\n<p>When run, this code gives:<\/p>\n<pre># .\/test \r\nGlobal initialization!\r\nDoing it!\r\nGlobal cleanup!\r\n<\/pre>\n<p>So, my_init() is run on dlopen() and my_fini() is run on dlclose(). \u00a0Basically, upon dlopen()-ing a shared object as you would a plugin, the library is first mmap()-ed into the process&#8217;s address space using\u00a0the PT_LOAD addresses which you can see with\u00a0readelf -l &lt;obj&gt;. \u00a0Then, one walks through all the global constructors and runs them. \u00a0Upon dlclose()-ing the opposite process is done. \u00a0One first walks through the global destructors and runs them, and then\u00a0one munmap()-s the same mappings.<\/p>\n<p>Figuring I wasn&#8217;t the only person to see a problem here, I googled and found that Nathan Copa of Alpine Linux hit a similar problem [3] back when Alpine used to use uClibc &#8212; it now uses musl. \u00a0He identified a problematic commit and I wrote a patch which would retain the new behavior introduced by\u00a0that commit\u00a0upon setting an environment variable NEW_START, but would otherwise revert to the old behavior if NEW_START is unset. \u00a0I also added some extra diagnostics to LD_DEBUG to better see what was going on. \u00a0I&#8217;ll add my\u00a0patch to a comment below, but the gist of it is that it toggles between the old and new way of calculating\u00a0the size of the munmap()-ings by subtracting an end and start\u00a0address. \u00a0The old behavior used a mapaddr for the start address that\u00a0is totally wrong and basically causes every munmap()-ing to fail with\u00a0EINVAL. \u00a0This is corrected by the commit as a simple strace -e trace=munmap shows.<\/p>\n<p>My results when running with LD_DEBUG=1 were\u00a0interesting to say the least. \u00a0With the old behavior, the segfault was gone:<\/p>\n<pre># LD_DEBUG=1 \/usr\/bin\/\/gdk-pixbuf-query-loaders libpixbufloader-svg.la\r\n...\r\ndo_dlclose():859: running dtors for library \/lib\/libbz2.so.1 at 0x7f26bcf39a26\r\ndo_dlclose():864: unmapping: \/lib\/libbz2.so.1\r\ndo_dlclose():869: before new start = 0xffffffffffffffff\r\ndo_dlclose():877: during new start = (nil), vaddr = (nil), type = 1\r\ndo_dlclose():877: during new start = (nil), vaddr = 0x219c90, type = 1\r\ndo_dlclose():881: after new start = (nil)\r\ndo_dlclose():987: new start = (nil)\r\ndo_dlclose():991: old start = 0x7f26bcf22000\r\ndo_dlclose():994: dlclose using old start\r\ndo_dlclose():998: end = 0x21b000\r\ndo_dlclose():1013: removing loaded_modules: \/lib\/libbz2.so.1\r\ndo_dlclose():1031: removing symbol_tables: \/lib\/libbz2.so.1\r\n...\r\n<\/pre>\n<p>Of course, all\u00a0of the munmap()-ings failed. \u00a0The dtors were run, but no shared object got unmapped. \u00a0When running the code with the correct value of start, I got:<\/p>\n<pre># NEW_START=1 LD_DEBUG=1 \/usr\/bin\/\/gdk-pixbuf-query-loaders libpixbufloader-svg.la\r\n...\r\ndo_dlclose():859: running dtors for library \/lib\/libbz2.so.1 at 0x7f5df192ba26\r\nSegmentation fault\r\n<\/pre>\n<p>What&#8217;s interesting here is that the segfault occurs at \u00a0DL_CALL_FUNC_AT_ADDR which is <strong>before<\/strong>\u00a0the munmap()-ing and so before any affect that the new value of start should have! This seems utterly mysterious until you realize that there is a whole set of dlopens\/dlcloses as gdk-pixbuf-query-loader does its job &#8212; I counted 40 in all! \u00a0This is as far as I&#8217;ve gotten narrowing down this mystery, but I suspect some previous munmap()-ing is breaking the the dtors for libbz2.so.1 and when the call is made to that address, its no longer valid leading to the segfault.<\/p>\n<p>Rich Felker, \u00a0aka dalias, the developer of musl, made an interesting comment to me in IRC when I told him about this issue. \u00a0He said that the unmappings are dangerous and that musl actually doesn&#8217;t do them. \u00a0For now, I&#8217;ve intentionally left\u00a0the unmappings in uClibc&#8217;s dlclose() &#8220;broken&#8221; in the latest release of Lilblue, so you can&#8217;t hit this bug, but for the next release I&#8217;m going to look carefully at what glibc and musl do and try to get this fix upstream. \u00a0As I said when I started this post, I&#8217;m not totally happy with this release because I didn&#8217;t nail the issue, I just implemented a workaround. \u00a0Any hits would be much appreciated!<\/p>\n<p>[1] The build scripts can be found in the releng repository at\u00a0git:\/\/git.overlays.gentoo.org\/proj\/releng.git under tools-uclibc\/desktop. \u00a0The scripts begin with a &lt;a href=&#8221;http:\/\/distfiles.gentoo.org\/releases\/amd64\/autobuilds\/current-stage3-amd64-uclibc-hardened\/&#8221;&gt;hardened amd64 uclibc stage3&lt;\/a&gt; tarball and build up the desktop.<\/p>\n<p>[2] The purpose of librsvg and gdk-pixbuf is not essential for the problem with dlclose(), but for completeness We state them here:\u00a0librsvg is a library for rendering scalable vector graphics and\u00a0gdk-pixbuf is an image loading library for gtk+. \u00a0gdk-pixbuf-query-loaders reads a libtool .la file and generates cache of loadable shared objects to be consumed by gdk-pixbuf.<\/p>\n<p>[3] See \u00a0http:\/\/lists.uclibc.org\/pipermail\/uclibc\/2012-October\/047059.html. He suggested that the following commit was doing evil things: http:\/\/git.uclibc.org\/uClibc\/commit\/ldso?h=0.9.33&amp;id=9b42da7d0558884e2a3cc9a8674ccfc752369610<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I pushed out another version of Lilblue Linux a few days ago but I don&#8217;t feel as good about this release as previous ones. \u00a0If you haven&#8217;t been following my posts, Lilblue\u00a0is a fully featured amd64, hardened, XFCE4 desktop that uses uClibc instead of glibc as its standard C library. \u00a0The name is a bit &hellip; <a href=\"https:\/\/blogs.gentoo.org\/blueness\/2014\/12\/16\/lilblue-linux-release-20141212-dlclose-is-a-problem\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Lilblue Linux: release 20141212.  dlclose() is a problem.&#8221;<\/span><\/a><\/p>\n","protected":false},"author":141,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":"","jetpack_publicize_message":"","jetpack_is_tweetstorm":false,"jetpack_publicize_feature_enabled":true},"categories":[1,3],"tags":[],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","_links":{"self":[{"href":"https:\/\/blogs.gentoo.org\/blueness\/wp-json\/wp\/v2\/posts\/135"}],"collection":[{"href":"https:\/\/blogs.gentoo.org\/blueness\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.gentoo.org\/blueness\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.gentoo.org\/blueness\/wp-json\/wp\/v2\/users\/141"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.gentoo.org\/blueness\/wp-json\/wp\/v2\/comments?post=135"}],"version-history":[{"count":46,"href":"https:\/\/blogs.gentoo.org\/blueness\/wp-json\/wp\/v2\/posts\/135\/revisions"}],"predecessor-version":[{"id":181,"href":"https:\/\/blogs.gentoo.org\/blueness\/wp-json\/wp\/v2\/posts\/135\/revisions\/181"}],"wp:attachment":[{"href":"https:\/\/blogs.gentoo.org\/blueness\/wp-json\/wp\/v2\/media?parent=135"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.gentoo.org\/blueness\/wp-json\/wp\/v2\/categories?post=135"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.gentoo.org\/blueness\/wp-json\/wp\/v2\/tags?post=135"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}