{"id":496,"date":"2016-07-11T14:41:03","date_gmt":"2016-07-11T12:41:03","guid":{"rendered":"https:\/\/blogs.gentoo.org\/mgorny\/?p=496"},"modified":"2016-07-14T15:06:34","modified_gmt":"2016-07-14T13:06:34","slug":"common-filesystem-io-pitfalls","status":"publish","type":"post","link":"https:\/\/blogs.gentoo.org\/mgorny\/2016\/07\/11\/common-filesystem-io-pitfalls\/","title":{"rendered":"Common filesystem I\/O pitfalls"},"content":{"rendered":"<p>Filesystem I\/O is one of\u00a0the\u00a0key elements of the\u00a0standard library in\u00a0many programming languages. Most of\u00a0them derive it from the\u00a0interfaces provided by the\u00a0standard C library, potentially wrapped in\u00a0some portability and\/or OO sugar. Most of\u00a0them share an\u00a0impressive set of pitfalls for careless programmers.<\/p>\n<p>In\u00a0this article, I would like to shortly go over a\u00a0few\u00a0more or\u00a0less common pitfalls that come to my mind.<\/p>\n<p><!--more--><\/p>\n<h2>Overwriting the\u00a0file in-place<\/h2>\n<p>This one will be remembered by me as the\u00a0\u2018setuptools screwup\u2019 for quite some time. Consider the\u00a0following snippet:<\/p>\n<pre><code>if not self.dry_run:\r\n    ensure_directory(target)\r\n    f = open(target,\"w\"+mode)\r\n    f.write(contents)\r\n    f.close()<\/code><\/pre>\n<p>This is the\u00a0code that setuptools used to install scripts. At a\u00a0first glance, it looks good \u2014 and\u00a0seems to work well, too. However, think of what happens if the\u00a0file at\u00a0target exists already.<\/p>\n<p>The\u00a0obvious answer would be: it is overwritten. The\u00a0more commonly noticed pitfall here is that the\u00a0old contents are discarded before the\u00a0new are written. If user happens to run the\u00a0script before it is completely written, he&#8217;ll get unexpected results. If writes fail for some reason, user will be left with partially written new script.<\/p>\n<p>While in the\u00a0case of installations this is not very significant (after all, failure in\u00a0middle of\u00a0installation is never a\u00a0good thing, mid-file or not), this becomes very important when dealing with data. Imagine that a\u00a0program would update your data this way \u2014 and\u00a0a\u00a0failure to add new data (as\u00a0well as\u00a0unexpected program termination, power loss\u2026) would instantly cause all previous data to be erased.<\/p>\n<p>However, there is another problem with this concept. In\u00a0fact, it does not strictly overwrite the\u00a0file \u2014 it opens it in-place and\u00a0implicitly truncates it. This causes more important issues in a\u00a0few cases:<\/p>\n<ul>\n<li>if the\u00a0file is hardlinked to another file(s) or is a\u00a0symbolic link, then the\u00a0contents of all the\u00a0linked files are overwritten,<\/li>\n<li>if the\u00a0file is a\u00a0named pipe, the\u00a0program will hang waiting for the\u00a0other end of the\u00a0pipe to be open for reading,<\/li>\n<li>other special files may cause other unexpected behavior.<\/li>\n<\/ul>\n<p>This is exactly what happened in\u00a0Gentoo. Package-installed script wrappers were symlinked to python-exec, and\u00a0setuptools used by\u00a0pip attempted to install new scripts on top of\u00a0those wrappers. But instead of overwriting the\u00a0wrappers, it overwrote python-exec and\u00a0broke everything relying on\u00a0it.<\/p>\n<p>The\u00a0lesson is simple: don&#8217;t overwrite files like this. The\u00a0easy way around it is to unlink the\u00a0file first \u2014 ensuring that any links are broken, and\u00a0special files are removed. The\u00a0more correct way is to use a\u00a0temporary file (created safely), and\u00a0use the\u00a0atomic <kbd>rename()<\/kbd> call to replace the\u00a0target with it (no unlinking needed then). However, it should be noted that the\u00a0rename can fail and\u00a0a\u00a0fallback code with unlink and\u00a0explicit copy is necessary.<\/p>\n<h2>Path canonicalization<\/h2>\n<p>For some reason, many programmers have taken a\u00a0fancy to canonicalize paths. While canonicalization itself is not that bad, it&#8217;s easy to do it wrongly and\u00a0it cause a\u00a0major headache. Let&#8217;s take a\u00a0look at the\u00a0following path:<\/p>\n<pre>\/\/foo\/..\/bar\/example.txt<\/pre>\n<p>You could say it&#8217;s ugly. It has a double slash, and\u00a0a\u00a0parent directory reference. It almost itches to canonicalize it to more pretty:<\/p>\n<pre>\/bar\/example.txt<\/pre>\n<p>However, this path is not necessarily the\u00a0same as the\u00a0original.<\/p>\n<p>For a\u00a0start, let&#8217;s imagine that <kbd>foo<\/kbd> is actually a\u00a0symbolic link to <kbd>baz\/ooka<\/kbd>. In\u00a0this case, its parent directory referenced by <kbd>..<\/kbd> is actually <kbd>\/baz<\/kbd>, not <kbd>\/<\/kbd>, and\u00a0the\u00a0obvious canonicalization fails.<\/p>\n<p>Furthermore, double slashes can be meaningful. For example, on\u00a0Windows double slash (yes, yes, backslashes are used normally) would mean a\u00a0network resource. In\u00a0this case, stripping the\u00a0adjacent slash would change the\u00a0path to a\u00a0local one.<\/p>\n<p>So, if you are really into canonicalization, first make sure to understand all the\u00a0rules governing your filesystem. On\u00a0POSIX systems, you really need to take symbolic links into consideration \u2014 usually you start with the\u00a0left-most path component and\u00a0expand all symlinks recursively (you need to take into consideration that link target path may carry more symlinks). Once all symbolic links are expanded, you can safely start interpreting the\u00a0<kbd>..<\/kbd> components.<\/p>\n<p>However, if you are going to do that, think of another path:<\/p>\n<pre>\/usr\/lib\/foo<\/pre>\n<p>If you expand it on common Gentoo old-style multilib system, you&#8217;ll get:<\/p>\n<pre>\/usr\/lib64\/foo<\/pre>\n<p>However, now imaging that the\u00a0<kbd>\/usr\/lib<\/kbd> symlink is\u00a0replaced with a\u00a0directory, and\u00a0the\u00a0appropriate files are moved to it. At this point, the\u00a0path recorded by your program is no longer correct since it relies on a\u00a0canonicalization done using a\u00a0different directory structure.<\/p>\n<p>To summarize: think twice before canonicalizing. While it may seem beneficial to have pretty paths or\u00a0use real filesystem paths, you may end up discarding user&#8217;s preferences (if I set a\u00a0symlink somewhere, I don&#8217;t want program automagically switching to another path). If you really insist on it, consider all the\u00a0consequences and\u00a0make sure you do it correctly.<\/p>\n<h2>Relying on\u00a0xattr as an\u00a0implementation for ACL\/caps<\/h2>\n<p>Since common C libraries do not provide proper file copying functions, many people attempted to implement their own with better or worse results. While copying the\u00a0data is a\u00a0minor problem, preserving the\u00a0metadata requires a\u00a0more complex solution.<\/p>\n<p>The\u00a0simpler programs focused on copying the\u00a0properties retrieved via <kbd>stat()<\/kbd> \u2014 modes, ownership and\u00a0times. The\u00a0more correct ones added also support for copying extended attributes (xattrs).<\/p>\n<p>Now, it is a\u00a0known fact that Linux filesystems implement many metadata extensions using extended attributes \u2014 ACLs, capabilities, security contexts. Sadly, this causes many people to assume that copying extended attributes is guaranteed to copy all of that extended metadata as well. This is a\u00a0bad assumption to make, even though it is correct on Linux. It will cause your program to work fine on\u00a0Linux but silently fail to copy ACLs on other systems.<\/p>\n<p>Therefore: always use explicit APIs, and\u00a0never rely on implementation details. If you want to work on\u00a0ACLs, use the\u00a0ACL API (provided by libacl on Linux). If you want to use capabilities, use the\u00a0capability API (libcap or libcap-ng).<\/p>\n<h2>Using incompatible APIs interchangeably<\/h2>\n<p>Now for something less common. There are at least three different file locking mechanisms on\u00a0Linux \u2014 the\u00a0somehow portable, non-standardized <kbd>flock()<\/kbd> function, the\u00a0POSIX <kbd>lockf()<\/kbd> and\u00a0(also POSIX) <kbd>fcntl()<\/kbd> commands. The\u00a0Linux manpage says that commonly both interfaces are implemented using the\u00a0fcntl. However, this is not guaranteed and\u00a0mixing the\u00a0two can result in\u00a0unpredictable results on\u00a0different systems.<\/p>\n<p>Dealing with the\u00a0two standard file APIs is even more curious. On one hand, we have high-level stdio interfaces including <kbd>FILE*<\/kbd> and\u00a0<kbd>DIR*<\/kbd>. On the\u00a0other, we have all fd-oriented interfaces from unistd. Now, POSIX officially supports converting between the\u00a0two \u2014 using <kbd>fileno()<\/kbd>, <kbd>dirfd()<\/kbd>, <kbd>fdopen()<\/kbd> and\u00a0<kbd>fddiropen()<\/kbd>.<\/p>\n<p>However, it should be noted that the\u00a0result of such a\u00a0conversion reuses the\u00a0same underlying file descriptor (rather than duplicating it). Two major points, however:<\/p>\n<ol>\n<li>There is no well-defined way to destroy a\u00a0<kbd>FILE*<\/kbd> or\u00a0<kbd>DIR*<\/kbd> without closing the\u00a0descriptor, nor any guarantee that <kbd>fclose()<\/kbd> or\u00a0<kbd>closedir()<\/kbd> will work correctly on a\u00a0closed descriptor. Therefore, you should not create more than one <kbd>FILE*<\/kbd> (or\u00a0<kbd>DIR*<\/kbd>) for a\u00a0fd, and\u00a0if you have one, always close it rather than the\u00a0fd itself.<\/li>\n<li>The\u00a0stdio streams are explicitly stateful, buffered and\u00a0have some extra magic on\u00a0top (like <kbd>ungetc()<\/kbd>). Once you start using stdio I\/O operations on a\u00a0file, you should not try to use low-level I\/O (e.g. <kbd>read()<\/kbd>) or the\u00a0other way around since the\u00a0results are pretty much undefined. Supposedly <kbd>fflush()<\/kbd> + <kbd>rewind()<\/kbd> could help but no guarantees.<\/li>\n<\/ol>\n<p>So, if you want to do I\/O, decide whether you want stdio or\u00a0fd-based I\/O. Convert between the\u00a0two types only when you need to use additional routines not available for the\u00a0other one; but if those routines involve some kind of content-related operations, avoid using the\u00a0other type for I\/O. If you need to do separate I\/O, use <kbd>dup()<\/kbd> to get a\u00a0clone of the\u00a0file descriptor.<\/p>\n<p>To summarize: avoid combining different APIs. If you really insist on doing that, check if it is supported and\u00a0what are the\u00a0requirements for doing so. You have to be especially careful not to run into undefined results. And\u00a0as usual \u2014 remember that different systems may implement things differently.<\/p>\n<h2>Atomicity of operations<\/h2>\n<p>For the\u00a0end, something commonly known, and\u00a0even more commonly repeated \u2014 race conditions due to non-atomic operations. Long story short, all the\u00a0unexpected results resulting from the\u00a0assumption that nothing can happen to the\u00a0file between successive calls to functions.<\/p>\n<p>I think the\u00a0most common mistake is the\u00a0\u2018does the\u00a0file exist?\u2019 problem. It is awfully common for programs to use some wrappers over <kbd>stat()<\/kbd> (like <kbd>os.path.exists()<\/kbd> in\u00a0Python) to check if a\u00a0file exists, and\u00a0then immediately proceed with opening or creating it. For example:<\/p>\n<pre><code>def do_foo(path):\r\n    if not os.path.exists(path):\r\n        return False\r\n\r\n    f = open(path, 'r')<\/code><\/pre>\n<p>Usually, this will work. However, if the\u00a0file gets removed between the\u00a0precondition check and\u00a0the\u00a0<kbd>open()<\/kbd>, the\u00a0program will raise an\u00a0exception instead of returning False. For example, this can practically happen if the\u00a0file is part of a\u00a0large directory tree being removed via <kbd>rm -r<\/kbd>.<\/p>\n<p>The\u00a0double bug here could be easily fixed via introducing explicit error handling, that will also render the\u00a0precondition unnecessary:<\/p>\n<pre><code>def do_foo(path):\r\n    try:\r\n        f = open(path, 'r')\r\n    except OSError as e:\r\n        if e.errno == errno.ENOENT:\r\n            return False\r\n        raise<\/code><\/pre>\n<p>The\u00a0new snippet ensures that the\u00a0file will be open if it exists at the\u00a0point of\u00a0<kbd>open()<\/kbd>. If it does not, <kbd>errno<\/kbd> will indicate an\u00a0appropriate error. For other errors, we are re-raising the\u00a0exception. If the\u00a0file is removed post <kbd>open()<\/kbd>, the\u00a0fd will still be valid.<\/p>\n<p><p>We could extend this to a\u00a0few generic rules:<\/p>\n<ol>\n<li>Always check for errors, even if you asserted that they should not happen. Proper error checks make many (unsafe) precondition checks unnecessary.<\/li>\n<li>Open file descriptors will remain valid even when the\u00a0underlying files are removed; paths can become invalid (i.e. referencing non-existing files or directories) or start pointing to another file (created using the\u00a0same path). So, prefer opening the\u00a0file as\u00a0soon as\u00a0necessary, and\u00a0<kbd>fstat()<\/kbd>, <kbd>fchown()<\/kbd>, <kbd>futimes()<\/kbd>\u2026 over <kbd>stat()<\/kbd>, <kbd>chown()<\/kbd>, <kbd>utimes()<\/kbd>\u2026<\/li>\n<li>Open directory descriptors will continue to reference the\u00a0same directory even when the\u00a0underlying path is removed or\u00a0replaced; paths may start referencing another directory. When performing operations on multiple files in a\u00a0directory, prefer opening the\u00a0directory and\u00a0using <kbd>openat()<\/kbd>, <kbd>unlinkat()<\/kbd>\u2026 However, note that the\u00a0directory can still be removed and\u00a0therefore further calls may return <kbd>ENOENT<\/kbd>.<\/li>\n<li>If you need to atomically overwrite a\u00a0file with another one, use <kbd>rename()<\/kbd>. To atomically create a\u00a0new file, use <kbd>open()<\/kbd> with <kbd>O_EXCL<\/kbd>. Usually, you will want to use the\u00a0latter to create a\u00a0temporary file, then the\u00a0former to replace the\u00a0actual file with it.<\/li>\n<li>If you need to use temporary files, use <kbd>mkstemp()<\/kbd> or\u00a0<kbd>mkdtemp()<\/kbd> to create them securely. The\u00a0former can be used when you only need an\u00a0open fd (the\u00a0file is removed immediately), the\u00a0latter if you need visible files. If you want to use <kbd>tmpnam()<\/kbd>, put it in a\u00a0loop and\u00a0try opening with <kbd>O_EXCL<\/kbd> to ensure you do not accidentally overwrite something.<\/li>\n<li>When you can&#8217;t guarantee atomicity, use locks to prevent simultaneous operations. For file operations, you can lock the\u00a0file in\u00a0question. For directory operations, you can create and\u00a0lock lock files (however, do not rely on existence of lock files alone). Note though that the\u00a0POSIX locks are non-mandatory \u2014 i.e. only prevent other programs from acquiring the\u00a0lock explicitly but do not block them from performing I\/O ignoring the\u00a0locks.<\/li>\n<li>Think about the\u00a0order of operations. If you create a\u00a0world-readable file, and\u00a0afterwards <kbd>chmod()<\/kbd> it, it is possible for another program to open it before the\u00a0<kbd>chmod()<\/kbd> and\u00a0retain the\u00a0open handle while secure data is being written. Instead, restrict the\u00a0access via mode parameter of <kbd>open()<\/kbd> (or <kbd>umask()<\/kbd>).<\/li>\n<\/ol>\n","protected":false},"excerpt":{"rendered":"<p>Filesystem I\/O is one of\u00a0the\u00a0key elements of the\u00a0standard library in\u00a0many programming languages. Most of\u00a0them derive it from the\u00a0interfaces provided by the\u00a0standard C library, potentially wrapped in\u00a0some portability and\/or OO sugar. Most of\u00a0them share an\u00a0impressive set of pitfalls for careless programmers. In\u00a0this article, I would like to shortly go over a\u00a0few\u00a0more or\u00a0less common pitfalls that come &hellip; <a href=\"https:\/\/blogs.gentoo.org\/mgorny\/2016\/07\/11\/common-filesystem-io-pitfalls\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Common filesystem I\/O pitfalls&#8221;<\/span><\/a><\/p>\n","protected":false},"author":137,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"jetpack_publicize_message":"","jetpack_is_tweetstorm":false,"jetpack_publicize_feature_enabled":true},"categories":[8],"tags":[],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","_links":{"self":[{"href":"https:\/\/blogs.gentoo.org\/mgorny\/wp-json\/wp\/v2\/posts\/496"}],"collection":[{"href":"https:\/\/blogs.gentoo.org\/mgorny\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.gentoo.org\/mgorny\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.gentoo.org\/mgorny\/wp-json\/wp\/v2\/users\/137"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.gentoo.org\/mgorny\/wp-json\/wp\/v2\/comments?post=496"}],"version-history":[{"count":46,"href":"https:\/\/blogs.gentoo.org\/mgorny\/wp-json\/wp\/v2\/posts\/496\/revisions"}],"predecessor-version":[{"id":542,"href":"https:\/\/blogs.gentoo.org\/mgorny\/wp-json\/wp\/v2\/posts\/496\/revisions\/542"}],"wp:attachment":[{"href":"https:\/\/blogs.gentoo.org\/mgorny\/wp-json\/wp\/v2\/media?parent=496"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.gentoo.org\/mgorny\/wp-json\/wp\/v2\/categories?post=496"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.gentoo.org\/mgorny\/wp-json\/wp\/v2\/tags?post=496"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}