{"id":2418,"date":"2025-10-12T11:14:26","date_gmt":"2025-10-12T09:14:26","guid":{"rendered":"https:\/\/blogs.gentoo.org\/mgorny\/?p=2418"},"modified":"2025-10-12T11:14:26","modified_gmt":"2025-10-12T09:14:26","slug":"how-we-incidentally-uncovered-a-7-year-old-bug-in-gentoo-ci","status":"publish","type":"post","link":"https:\/\/blogs.gentoo.org\/mgorny\/2025\/10\/12\/how-we-incidentally-uncovered-a-7-year-old-bug-in-gentoo-ci\/","title":{"rendered":"How we incidentally uncovered a 7-year old bug in gentoo-ci"},"content":{"rendered":"<p>&#8220;Gentoo CI&#8221; is the\u00a0service providing periodic linting for the\u00a0Gentoo repository.  It is a part of\u00a0the\u00a0<a rel=\"external\" href=\"https:\/\/wiki.gentoo.org\/wiki\/Project:Repository_mirror_and_CI\">Repository mirror and\u00a0CI<\/a> project that I&#8217;ve started in\u00a02015.  Of course, it all started as\u00a0a\u00a0temporary third-party solution, but it persisted, was integrated into Gentoo Infrastructure and\u00a0grew organically into quite a\u00a0monstrosity.<\/p>\n<p>It&#8217;s imperfect in\u00a0many ways.  In\u00a0particular, it has only some degree of\u00a0error recovery and\u00a0when things go wrong beyond that, it requires a\u00a0manual fix.  Often the\u00a0&#8220;fix&#8221; is to stop mirroring a\u00a0problematic repository.  Over time, I&#8217;ve started having serious doubts about the\u00a0project, and\u00a0<a rel=\"external\" href=\"https:\/\/archives.gentoo.org\/gentoo-dev\/6b358608f6e244cb96ce527ad47b3e0483eaf0c6.camel@gentoo.org\/\">proposed sunsetting most of\u00a0it<\/a>.<\/p>\n<p>Lately, things have been getting worse.  What started as\u00a0a\u00a0minor change in\u00a0behavior of\u00a0Git triggered a\u00a0whole cascade of\u00a0failures, leading to me\u00a0finally announcing the\u00a0deadline for\u00a0sunsetting the\u00a0mirroring of\u00a0third-party repositories, and\u00a0starting ripping non-critical bits out of\u00a0it.  Interesting enough, this whole process led me to finally discover the\u00a0root cause of\u00a0most of\u00a0these failures \u2014 a\u00a0bug that has existed since the\u00a0very early version of\u00a0the\u00a0code, but\u00a0happened to be hidden by\u00a0the\u00a0hacky error recovery code.  Here&#8217;s the\u00a0story of\u00a0it.<\/p>\n<p><!--more--><\/p>\n<hr \/>\n<p>Repository mirror and\u00a0CI is\u00a0basically a\u00a0bunch of\u00a0shell scripts with\u00a0Python helpers run via a\u00a0cronjob (<a rel=\"external\" href=\"https:\/\/github.com\/projg2\/repo-mirror-ci\/\">repo-mirror-ci code<\/a>).  The\u00a0scripts are responsible for\u00a0syncing the\u00a0lot of\u00a0public Gentoo repositories, generating caches for\u00a0them, publishing them onto our mirror repositories, and\u00a0finally running pkgcheck on\u00a0the\u00a0Gentoo repository.  Most of\u00a0the\u00a0&#8220;unexpected&#8221; error handling is <kbd>set -e -x<\/kbd>, with a\u00a0dumb logging to\u00a0a\u00a0file, and\u00a0mailing on a\u00a0cronjob failure.  Some common errors are handled gracefully though \u2014 sync errors, pkgcheck failures and\u00a0so on.<\/p>\n<p>The\u00a0whole cascade started when Git was upgraded on\u00a0the\u00a0server.  The\u00a0upgrade involved a\u00a0change in\u00a0behavior where <kbd>git checkout -- ${branch}<\/kbd> stopped working; you could only specify files after the\u00a0<kbd>--<\/kbd>.  The\u00a0fix was trivial enough.<\/p>\n<p>However, once the\u00a0issue was fixed I&#8217;ve started periodically seeing sync failures from\u00a0the\u00a0Gentoo repository.  The\u00a0scripts had a\u00a0very dumb way of\u00a0handling sync failures: if\u00a0syncing failed, they removed the\u00a0local copy entirely and\u00a0tried again.  This generally made sense \u2014 say, if\u00a0upstream renamed the\u00a0main branch, <kbd>git pull<\/kbd> would fail but a\u00a0fresh clone would be\u00a0a\u00a0cheap fix.  However, the\u00a0Gentoo repository is quite big and\u00a0when it gets removed due to sync failure, cloning it afresh from\u00a0the\u00a0Gentoo infrastructure failed.<\/p>\n<p>So when it failed, I\u00a0did a\u00a0quick hack \u2014 I&#8217;ve cloned the\u00a0repository manually from\u00a0GitHub, replaced the\u00a0remote and\u00a0put it in\u00a0place.  Problem solved.  Except a\u00a0while later, the\u00a0same issue surfaced.  This time I\u00a0kept an additional\u00a0local clone, so I wouldn&#8217;t have to\u00a0fetch it from server, and\u00a0added it again.  But then, it got removed once more, and\u00a0this was really getting tedious.<\/p>\n<p>What I have assumed then is that the\u00a0repository is failing to\u00a0sync due to\u00a0some temporary problems, either network or\u00a0Infrastructure related.  If\u00a0that were the\u00a0case, it really made no\u00a0sense to\u00a0remove it and\u00a0clone afresh.  On\u00a0top of\u00a0that, since we are sunsetting support for\u00a0third-party repositories anyway, there is no\u00a0need for\u00a0automatic recovery from\u00a0issues such as\u00a0branch name changes.  So I removed that logic, to\u00a0have sync fail immediately, without removing the\u00a0local copy.<\/p>\n<p>Now, this had important consequences.  Previously, any failed sync would result in\u00a0the\u00a0repository being removed and\u00a0cloned again, leaving no trace of\u00a0the\u00a0original error.  On\u00a0top of\u00a0that, a\u00a0logic stopping the\u00a0script early when the\u00a0Gentoo repository failed meant that the\u00a0actual error wasn&#8217;t even saved, leaving me only with the\u00a0subsequent clone failures.<\/p>\n<p>When the\u00a0sync failed again (and\u00a0of\u00a0course it did), I was able to\u00a0actually investigate what was wrong.  What actually happened is that the\u00a0repository wasn&#8217;t on\u00a0a\u00a0branch \u2014 the\u00a0checkout was detached at\u00a0some commit.  Initially, I\u00a0assumed this was some fluke, perhaps also related to\u00a0the\u00a0Git upgrade.  I&#8217;ve switched manually to\u00a0<kbd>master<\/kbd>, and\u00a0that fixed it.  Then it broke again.  And\u00a0again.<\/p>\n<p>So far I&#8217;ve been mostly dealing with the\u00a0failures asynchronously \u2014 I wasn&#8217;t around at\u00a0the\u00a0time of\u00a0the\u00a0initial failure, and\u00a0only started working on\u00a0it after a\u00a0few failed runs.  However, finally the\u00a0issue resurfaced so\u00a0fast that I was able to\u00a0connect the\u00a0dots.  The\u00a0problem likely happened immediately after gentoo-ci hit an\u00a0issue, and\u00a0bisected it!  So I&#8217;ve started suspecting that there is another issue in\u00a0the\u00a0scripts, perhaps another case of\u00a0missed <kbd>--<\/kbd>, but I couldn&#8217;t find anything relevant.<\/p>\n<p>Finally, I&#8217;ve started looking at\u00a0the\u00a0post-bisect code.  What we were doing is calling <kbd>git rev-parse HEAD<\/kbd> prior to\u00a0bisect, and\u00a0then using that result in\u00a0<kbd>git checkout<\/kbd>.  This obviously meant that after every bisect, we ended up with detached tree, i.e. precisely the\u00a0issue I was seeing.  So why didn&#8217;t I notice this before?<\/p>\n<p>Of\u00a0course, because of\u00a0the\u00a0sync error handling.  Once bisect broke the\u00a0repository, next sync failed and\u00a0the\u00a0repository got cloned again, and\u00a0we never noticed anything was wrong.  We only started noticing once cloning started failing.  So after a\u00a0few days of\u00a0confusion and\u00a0false leads, I finally fixed a\u00a0bug that was present for\u00a0over 7 years in\u00a0production code, and\u00a0caused the\u00a0Gentoo repository to be\u00a0cloned over and\u00a0over again whenever any bad commit happened.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>&#8220;Gentoo CI&#8221; is the\u00a0service providing periodic linting for the\u00a0Gentoo repository. It is a part of\u00a0the\u00a0Repository mirror and\u00a0CI project that I&#8217;ve started in\u00a02015. Of course, it all started as\u00a0a\u00a0temporary third-party solution, but it persisted, was integrated into Gentoo Infrastructure and\u00a0grew organically into quite a\u00a0monstrosity. It&#8217;s imperfect in\u00a0many ways. In\u00a0particular, it has only some degree of\u00a0error recovery &hellip; <a href=\"https:\/\/blogs.gentoo.org\/mgorny\/2025\/10\/12\/how-we-incidentally-uncovered-a-7-year-old-bug-in-gentoo-ci\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;How we incidentally uncovered a 7-year old bug in gentoo-ci&#8221;<\/span><\/a><\/p>\n","protected":false},"author":137,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"jetpack_publicize_message":"","jetpack_is_tweetstorm":false,"jetpack_publicize_feature_enabled":true},"categories":[3],"tags":[],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","_links":{"self":[{"href":"https:\/\/blogs.gentoo.org\/mgorny\/wp-json\/wp\/v2\/posts\/2418"}],"collection":[{"href":"https:\/\/blogs.gentoo.org\/mgorny\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.gentoo.org\/mgorny\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.gentoo.org\/mgorny\/wp-json\/wp\/v2\/users\/137"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.gentoo.org\/mgorny\/wp-json\/wp\/v2\/comments?post=2418"}],"version-history":[{"count":19,"href":"https:\/\/blogs.gentoo.org\/mgorny\/wp-json\/wp\/v2\/posts\/2418\/revisions"}],"predecessor-version":[{"id":2437,"href":"https:\/\/blogs.gentoo.org\/mgorny\/wp-json\/wp\/v2\/posts\/2418\/revisions\/2437"}],"wp:attachment":[{"href":"https:\/\/blogs.gentoo.org\/mgorny\/wp-json\/wp\/v2\/media?parent=2418"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.gentoo.org\/mgorny\/wp-json\/wp\/v2\/categories?post=2418"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.gentoo.org\/mgorny\/wp-json\/wp\/v2\/tags?post=2418"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}