Five commandments for XML format designers

If you’re designing an XML-based data format, then I beg you, please read the few following rules and obey them. XML may look easy, and even is easy but that doesn’t mean that writing a good one is. And if you’re going to invent second HTML, then please, just use JSON or any other random container. That will be easier for you, and easier for us.

1. Thou shalt always write a schema

Every XML format should be well described. And no, your ten-stanza poem is not enough. Complete, dedicated Wiki neither. These usually describe nicely (or less nicely) how to write your XML. That could be great if that’s all you’re interested in. But if that’s supposed to be some public format, there is one more important thing…

It’s called reading. Or parsing. Or just transforming. If you need to handle random XML files, coming from various sources, written by random people, you have to know what you can expect and what can you assume. It’s not enough to say what <x/> does — I need to know where it can appear and what I can find inside.

There are already well-deployed XML description formats such as DTD, Relax-NG or XML Schema. Please use one of them, I will be grateful. Not only they describe the format strictly and accurately but they also provide a very simple means to validate XML files. It’s helpful both to us, who parse it, and to people who actually write such XML.

An XML without spec is an XML where every element can appear anywhere in the document. In other words, it’s not even XML but an ugly tag soup.

2. Thy XML shalt be structured, not flat

XML provides means to create neat, hierarchical structures. Use them. If your documents consists of logical parts like sections or chapters, put their complete content in a single <section/> or <chapter>, or any other thing that may come into your head. That’s the correct way of doing that in XML.

Random headings and separators are not enough. Even if your spec says they always and definitely start a new section, that’s not enough. If you don’t believe us, try splitting that thing into parts yourself. Especially when you have sub-headings, sub-sub-headings and so on.

A flat-structured XML is no real XML. It’s just a text file with a few unnecessary elements.

3. Thou shalt split text into blocks using XML, not text delimeters

Even if you think that’ll make writing much easier, do not ever try to use simple character delimiters to split text into blocks. If you need a list, create a list of XML elements. Like the following:

<l>elem1</l>
<l>elem2</l>
<l>elem3</l>

And yes, I know elem1,elem2,elem3 is shorter and easier to type. But guess what — it’s hell to parse. It isn’t even XML — you either have to handle it externally or create a complex recursive template which will split it and handle each token separately. That’s very bad.

An XML which uses random delimeters to create lists is no XML. It’s called CSV.

4. Thou shalt not allow insane structures

Even if you think noone will create an insane structure in your document, it’s not enough. Saying it’s disallowed on your awesome Wiki is not enough either. Forbid it if it’s supposed to be forbidden.

Otherwise, someone finally will use it. He or she will deliberately ignore your warning because it works. And even if they don’t, we will have to support it anyway in a compliant parser.

If you expect your data to be interchangeable with widely used formats, take a look at them. Don’t allow insane things which none of these formats do — or we’ll have to either refuse to convert some files, convert them incorrectly or waste our time writing complex blocks converting them to sane ones.

Simply, don’t do it. Even HTML doesn’t do that… well, that much.

5. Thou shalt write readable XML, not bytecode

The major point of using XML is that the data is both readable to machines and humans. Leave it that way. You have the whole human language at your disposal, so don’t write zeros, ones and other random numbers which are explained on your great Wiki.

Say, an attribute called type should actually name some type. Say, article can be some type. 1 usually ain’t. And if that type only describes width of indent, then name it so! Calling it a type is as useful as calling it a thing. Or some-other-thing and a-third-thing.

XML without human-readable text is no XML. Hell, even byte-compiled XML should have readable element names! That’s the whole point with it. Otherwise, you just end up developing another custom, useless format.

Leave a Reply

Your email address will not be published.