Sunday, August 26, 2007

OOXML: defective, but don't exaggerate

In a recent article titled Microsoft Office XML Formats? Defective by design, self-proclaimed file format expert Stéphane Rodriguez explains 13 reasons why Microsoft's Office Open XML (OOXML) format should not become an ISO standard. Although I completely agree with his conclusion, Rodriguez seemingly got carried away by his rant, uttering nonsense in some places. Still, most arguments make sense, so I'll cover only those that don't, below.

1) Self-exploding spreadsheets. Here, he modifies an Excel file manually, and is surprised that even this “simple” change breaks the file. He does not once refer to the specification to see whether the thing he changes may be dependent on things elsewhere in the file. So, probably, the file he created is not at all according to spec. Is it strange, then, that Excel goes boom? An Office document is a lot more complex than you would say at first sight, and the storage format is bound to reflect this.

2) Entered versus stored values. Clearly, the values get processed internally as (binary) floating-point numbers, which explains numbers like 1234.1233999999999 cropping up when converted back to the decimal storage needed for XML. If an implementation simply uses the IEEE floating point format, like pretty much any CPU does, no problem. Besides, the fact that Excel writes out (very slightly) inaccurate numbers does not mean that the OOXML standard is flawed, only Excel's implementation thereof.

6) International, but US English first and foremost. Another complaint of Rodriguez is that the numbers get stored in US English locale format (1,234.56), and not the localised format (e.g. 1 234,56). Also, formulas always use English function names, like SUM. Rodriguez claims that this canonicalisation makes processing more complex. Wait, what? You want us to go store things differently depending on how some people would like to view them on the screen? You think a file will be easier to process when we introduce diversity in locales?

12) Document backwards compatibility subject to neutrino radioactivity. So, Excel 2007 cannot properly import graphs from earlier versions. What else is new? Anyway, I do not see how that can be regarded as a flaw in OOXML; the quality of the Microsoft Office suite is something quite different from the standardisation of the OOXML format.

Unrelated to Rodriguez's rant, there's one gem I would not want to keep from you. The folks at Google discovered [PDF link], buried in the over 6000 pages of the OOXML spec, 51 pages of stuff like the following:


If anyone dares to say these specs aren't bloated, point them to part 4, section 2.18.4, pages 1632–1682. I honestly don't know whether to laugh or cry.

If you like, you can have a look at the standard yourself. It is known as ECMA-376. The specs can be downloaded in either zipped PDF format or in a mysterious format called DOCX. For the latter, alas, no fully functional implementation appears to exist.

2 comments:

Georg Muntingh said...

You repeatedly argument that what Stéphane Rodriguez refers to are mistakes of the implementation of OOXML. However, that leads to the question that if Microsoft isn't able to make a decent implementation of OOXML, then who is? Even Microsoft has yet to deliver a full implementation of OOXML. Although these arguments indeed concern Microsofts implementation, they do say something about the specification indirectly.

Anyway, thanks for taking the time to examine his results in detail.

Mark IJbema said...

The first point is of course because we live in the age of the browser. Browsers are weird little programs which are able to make html trees out of random bytes. So you shouldn't be surprised people are surprised nowadays when applications don't support corrupt documents ;)