XML deserves a second chance
Proposing XML 2.0
XML or Extensible Markup Language is quite an old language. The first edition was specified in 1998 and XML 1.1 came out six years later. The first decade after introduction were the high days of the language. Everything was done with XML at the time, what eventually turned out too much.
Now XML is usually considered outdated and other formats like JSON are favored. What went wrong? And why do I think it deserves a second chance?
What was XML about again?
XML was quite a relief as a replacement for data formats like CSV and ASCII. First of all, XML had a specification, a tree-structure and tags with custom field names.
Around 2005 I bought a book of 1200 pages on XML. That’s when I first learned that XML only took two pages to explain. The rest of the book was only about technologies making use of XML. It’s like if you explain about what pasta is in two pages and then write about a thousand pasta recipes in the following pages.
So what are the basic rules of XML again?
- There is a single root element
- Every start-tag should be closed by an end-tag
- Tags are nested
- The document should only contain Unicode characters.
- Tag names are case-sensitive
- Tags cannot begin with “-”, “.”, or a numeric digit
- Tag names cannot contain certain characters
When we follow these rules the XML is well-formed, like the following example:
What went wrong?
The given example is simple, still I think XML and most of its technologies are over-engineered.
Think for example of the declaration
<?xml version='1.0' encoding='character encoding' standalone='yes|no'?>
The declaration is optional, but often used and programs need to take them into account. Besides, since 2004 there hasn’t been any changes in the specification (and there probably never will).
Another example is that XML allows elements and attributes (<element attribute=”10”>). So we could write our example more compact:
Some prefer attributes, some even have rules when to use attributes and when not, others prefer elements. In my opinion just elements would be clearer and easier for programming languages.
This is also where it explicitly went wrong. The intention was that XML was readable by machines and humans. The first is arguable. Personally, I find it well readable. Programming languages mostly have a hard time with it. And for a data format, that’s a problem…
When you use a programming language like XSLT which is designed for XML, everything seems fine, but when you move to other programming languages like Java, simple things like getting a value are suddenly cumbersome. Especially the default libraries are hard to work with. There are for Java also better libraries like Jdom2 that ‘get’ XML, but these are rather the exception.
Also, not all technologies based on or worked with XML seem suitable. There are some like Xpath, Xquery, XSD and the already mentioned XSLT that suit XML excellent, but other like build scripts with ANT or Soap web services were complicated and hard to read. They were really not a match made in heaven. Gradle for building scripts and REST for web services are a much better match.
JSON with its arrays, maps and fields tend to be much simpler to read by many languages. And its less verbose and good as a data format. However, when you work more intensively with JSON you start missing the power of XML.
Why do I think it deserves a second chance
XML suffers from a phenomenon that is often the case in IT. When a technology becomes mature and productive, most people are disillusioned. Then some other technology which is trendy takes its place.
That’s exactly why XML deserves another chance. But then in a simpler way. It should not fall in the trap as much other software. Software that covers more and more use cases and tries to be more and more powerful. XML 2.0 should be a simpler variant. So we can say that every 2.0 document is also a valid 1.1 document (but not the other way around).
In the 2.0:
- No XML declaration
- Only elements and no attributes
- No comments
- No empty-element tag, such as
So we can do this already without breaking the current standard. This way we are not reinventing the wheel, but give the wheel another powerful spin. A spin that’s easy to understand by machines.
And most importantly we can make use of the current mature ecosystem and tools available.