Monthly Archives: April 2012

A five commandments for XML format designers

If you’re designing an XML-based data format, then I beg you, please read the few following rules and obey them. XML may look easy, and even is easy but that doesn’t mean that writing a good one is. And if you’re going to invent second HTML, then please, just use JSON or any other random container. That will be easier for you, and easier for us.

1. Thou shalt always write a schema

Every XML format should be well described. And no, your ten-stanza poem is not enough. Complete, dedicated Wiki neither. These usually describe nicely (or less nicely) how to write your XML. That could be great if that’s all you’re interested in. But if that’s supposed to be some public format, there is one more important thing…

It’s called reading. Or parsing. Or just transforming. If you need to handle random XML files, coming from various sources, written by random people, you have to know what you can expect and what can you assume. It’s not enough to say what <x/> does — I need to know where it can appear and what I can find inside.

There are already well-deployed XML description formats such as DTD, Relax-NG or XML Schema. Please use one of them, I will be grateful. Not only they describe the format strictly and accurately but they also provide a very simple means to validate XML files. It’s helpful both to us, who parse it, and to people who actually write such XML.

An XML without spec is an XML where every element can appear anywhere in the document. In other words, it’s not even XML but an ugly tag soup.

2. Thy XML shalt be structured, not flat

XML provides means to create neat, hierarchical structures. Use them. If your documents consists of logical parts like sections or chapters, put their complete content in a single <section/> or <chapter>, or any other thing that may come into your head. That’s the correct way of doing that in XML.

Random headings and separators are not enough. Even if your spec says they always and definitely start a new section, that’s not enough. If you don’t believe us, try splitting that thing into parts yourself. Especially when you have sub-headings, sub-sub-headings and so on.

A flat-structured XML is no real XML. It’s just a text file with a few unnecessary elements.

3. Thou shalt split text into blocks using XML, not text delimeters

Even if you think that’ll make writing much easier, do not ever try to use simple character delimiters to split text into blocks. If you need a list, create a list of XML elements. Like the following:

<l>elem1</l>
<l>elem2</l>
<l>elem3</l>

And yes, I know elem1,elem2,elem3 is shorter and easier to type. But guess what — it’s hell to parse. It isn’t even XML — you either have to handle it externally or create a complex recursive template which will split it and handle each token separately. That’s very bad.

An XML which uses random delimeters to create lists is no XML. It’s called CSV.

4. Thou shalt not allow insane structures

Even if you think noone will create an insane structure in your document, it’s not enough. Saying it’s disallowed on your awesome Wiki is not enough either. Forbid it if it’s supposed to be forbidden.

Otherwise, someone finally will use it. He or she will deliberately ignore your warning because it works. And even if they don’t, we will have to support it anyway in a compliant parser.

If you expect your data to be interchangeable with widely used formats, take a look at them. Don’t allow insane things which none of these formats do — or we’ll have to either refuse to convert some files, convert them incorrectly or waste our time writing complex blocks converting them to sane ones.

Simply, don’t do it. Even HTML doesn’t do that… well, that much.

5. Thou shalt write readable XML, not bytecode

The major point of using XML is that the data is both readable to machines and humans. Leave it that way. You have the whole human language at your disposal, so don’t write zeros, ones and other random numbers which are explained on your great Wiki.

Say, an attribute called type should actually name some type. Say, article can be some type. 1 usually ain’t. And if that type only describes width of indent, then name it so! Calling it a type is as useful as calling it a thing. Or some-other-thing and a-third-thing.

XML without human-readable text is no XML. Hell, even byte-compiled XML should have readable element names! That’s the whole point with it. Otherwise, you just end up developing another custom, useless format.

The suggested dependencies problem

Optional runtime dependencies (or «suggested dependencies») are one of the late problems we’re facing in Gentoo. There’s definitely a need for some standard solution, and it’d be great to put it in the next EAPI. Sadly, there’s no consensus how to solve it.

The optional dependencies problem

Gentoo has a very neat solution for handling optional dependencies and optional features, probably ever since the beginning. It is called «USE flags» and they work very well with the «traditional» optional dependencies. By that, I mean optional dependencies which are both build- and run-time.

Such a dependencies have to be pulled in before the build process starts, and usually require passing specific options in the configure phase. What’s important, both enabling and disabling features requires rebuilding the program in question because of code branches being switched. Thus, it’s perfectly fine if changing USE flags implies rebuilding the package.

Sadly, when it comes to optional runtime dependencies, USE flags are not a perfect solution. «Switching» such a dependency doesn’t require rebuilding the program anymore. It’s usually not even switching — the program can determine in runtime whether a particular dependency is available, and either enable or disable respective features. Simple like that.

If one decides to use USE flags for that, they become partially meaningless. Unless flags start stripping off the code (which is a bad idea), feature availability is dependency- rather than flag-based. So, USE=-ssl is irrelevant if, say, pyopenssl is installed. What’s even worse, flag imply needless rebuilding of such packages just to pull in an additional dependency.

The simple hack — pkg_postinst() messages

The simplest solution right now is just listing the suggested dependencies in pkg_postinst() messages. Combined with has_version helper, those messages can give a pretty nice output, pointing out already installed packages — just take a look at sys-apps/systemd ebuild.

Of course, it’s not a real solution, rather relying on user doing the hard work. The biggest disadvantage is that the dependencies are often going to end up in @world. And then, if user decides to unmerge our package, portage is unable to find and unmerge them as well.

The SDEPEND solution

A pretty common idea is to establish a new variable called SDEPEND (for «suggested»). Such a variable would simply list relevant dependencies, and let portage handle the UI part somehow. It is a minimalistic solution, quite consistent with other parts of PMS. Sadly, it has a few big shortcomings.

First, using our current dependency syntax, you can’t specify that a particular feature requires more than one package; in other words, that two or more suggested dependencies are supposed to be pulled in together. Of course, solving this one would be pretty easy — e.g. by allowing grouping them with parantheses.

A much more important issue is describing what particular dependencies do. Although sometimes this could be guessed by package descriptions pretty well, usually a more friendly text would be great. So, we end up having to implement that somehow.

And that’s usually when Ciaran comes in with ugly exherbism DEPENDENCIES. Sure, it solves most of the issues pointed out here but, hell, do we really want such a thing? Isn’t dependency syntax obscure enough already?

And it’s all rather dependency-oriented. In other words, package comes first, then goes the feature description. «Pass dev-python/pyopenssl or dev-python/python-gnutls to enable secure connections support». I don’t think that’s the most user friendly solution.

The USE flag solution

Another solution is brining a new category of USE flags. It’s not important whether they would be specified using a special variable, common USE_EXPAND or another magical features. In fact, that could be a thing totally separate from USE flags. The point is that some of the package flags would be runtime-switchable.

Unlike traditional USE flags, such flags wouldn’t be stored in vdb. They would be evaluated in place instead, using package.use or similar files, and the dependency tree would use current state of such flags. Of course, they would be allowed for RDEPEND (PDEPEND) use only.

Why reuse USE flags for that? Because it’s the most user-friendly solution. User doesn’t have to learn anything new. He/she enables a flag, does emerge -vDtN @world and notices that new dependencies are pulled in but the package doesn’t have to be rebuilt for that.

If we just add some additional magic for regular USE flags, enabling run-time dependant SSL support could be exactly the same as enabling build-time one — even using the same USE=ssl.

And we could basically even give «backwards» support for older EAPIs. Package managers not supporting the new feature would simply treat runtime-switchable USE flags as regular USE flags, requiring rebuilds of the package.