As I mentioned in my previous blog post I was planning on writing something about the separation of document style (bold / italic etc) from document content (text, tables, images etc), and in this post I will go ahead and take action on that plan. The main reason is an event held two days ago at the Delft university called 'The war surrounding Microsoft's standard' (not a title to indicate an open discussion, but I digress). During one of the interesting discussions with some of the attendees one thing came up in the more than once; the notion of separating styles from document content, and how Open XML does this in a not-so-nice fashion compared to ODF. Something to investigate!
Note that I am not pro- or anti-ODF. If you use if to drive your business, may your business fare well.
Before going into the differences in implementation and what ramifications that has for developers of business solutions first let me discuss how you can style a document in Open XML and ODF. Note that I am not an ODF expert, so if there are any misrepresentations please feel free to publicly chastise me, or just tell me so I can adapt this post to my increase in knowledge. I will also focus on a simple document with text only, not on spreadsheets or presentations.
The Wordprocessing specification of Office Open XML supports three ways to define formatting for your text. After a simple inspection of the OpenOffice.org Writer application it seems that ODF support the same methods.
Format type | Usage |
Default | Used when no direct format or style is applied. |
Direct formatting | A container for formatting which is applied to a single element. |
Styles | A container for formatting which can be applied to multiple elements. |
The defaults are pretty obvious. What font, alignment, color etc… will be used when the author of the document hasn't explicitly set a value. Direct formatting is something that you do as an author when you realize that you want some piece of text to be bold, select it, and press the Bold button, usually on some type of toolbar (or ribbon). Next are the styles, something that is more usable because you create styles separate from the document and apply it to various pieces of content. We probably all know 'Heading 1', 'Heading 2' etc….
Separating the formatting from the content basically makes it easy to change the way your documents looks without needing to touch the content of the document (a bit duh-ish, I know). Note that I explicitly state that for the application of styles. In my opinion direct formatting is exactly the opposite. When you apply a style, you are basically saying 'I want this to be a heading', which is expressly different from 'I want this to be bold'. The first is document specific, the second content specific, e.g. styles should travel with the document, while direct formatting should travel with the content. You can see this in the way formatting and styles are stored and applied. ODF uses the same method of calculating the final picture for a piece of content which has a style applied as well as a direct format (and implicitly the document-defaults).
The following sample is the same for Open XML as for ODF. First the document defaults are applied to the text, next the style is applied (which can be a style-hierarchy), and finally the direct format is applied to the text. In the sample the style indicates that the text should be bold and italic. The direct format of the word 'some' says no bold and no italics, so the end result should be equal the last item.
Now if this is all the same for ODF and Open XML, the difficulty with this separation must be in the way the format stores this information. Let's take a closer look.
Let's first look at on ODF document which uses a direct format. Note that I removed unrelated content and indented the XML (for all samples). The following XML is part of the main document body. Various things that the XML shows is the use of a style definition for the span of text to store the direct formatting. The style is identified using a computer generated name (the author pressed the bold button, no name dialogs appeared obviously). The style definition is stored inside the same XML part as the textual content.
ODF – Direct format
Stored in main document part |
<office:document-content>
<office:automatic-styles>
<style:style
style:name="T1"
style:family="text">
<style:text-properties
fo:font-weight="bold" />
</style:style>
</office:automatic-styles>
<office:body>
<office:text>
<text:p>
<text:span
text:style-name="T1"> Text </text:span>
</text:p>
</office:text>
</office:body>
</office:document-content> |
Second sample, Open XML using direct formatting. There are some similarities to ODF, such as the usage of a container element for storing the properties (rPr versus text-properties). Only Open XML chooses to keep the properties enclosed in the content that has been formatted by the author and hence lacks the separate style definition with the computer generated name. You can also note the usage of the w:t element for storing the text, not used by ODF, which I think is due the mixed versus non-mixed content model, but that is a discussion for another day.
Open XML – Direct format
Stored in main document part |
<w:document>
<w:body>
<w:p>
<w:r>
<w:rPr>
<w:b />
</w:rPr>
<w:t>Text</w:t>
</w:r>
</w:p>
</w:body>
</w:document> |
Now for the application of styles. First ODF. Two samples necessary since the style itself is stored in a different part than the content of the document (styles allow easy changing without touching content remember).The first sample shows the document content referring to the style, the second XML sample shows the style part of an ODF document. In the first sample you can note the usage of the exact same attribute to indicate the style being used as the sample using direct formatting. Also note that the name of the style is now functional since the author has chosen it when he created the style. The second sample is largely the same as the first ODF sample, only moved to a separate part in the ODF ZIP package.
ODF – Styles
Stored in main document part |
<office:document-content>
<office:body>
<office:text>
<text:p>
<text:span
text:style-name="MyStyle"> Text </text:span>
</text:p>
</office:text>
</office:body>
</office:document-content> |
ODF – Styles
Stored in styles part |
<office:document-styles>
<office:styles>
<style:style
style:name="MyStyle"
style:family="text">
<style:text-properties
fo:font-weight="bold" />
</style:style>
</office:styles>
</office:document-styles> |
Next of course are the same samples for Open XML. Some similar things going on here. The style itself is stored outside of the document content in its own part in the Open XML ZIP package. The style has a (somewhat) useful name and is referenced by the formatted content using that name. A big difference is the use of a different piece of markup to identify the style separate from the direct formatting, for which ODF uses the same attribute.
Open XML – Styles
Stored in main document part |
<w:document>
<w:body>
<w:p>
<w:r>
<w:rPr>
<w:rStyle w:val="MyStyle" />
</w:rPr>
<w:t>Text</w:t>
</w:r>
</w:p>
</w:body>
</w:document> |
Open XML – Styles
Stored in styles part |
<w:styles>
<w:style
w:type="paragraph"
w:styleId="MyStyle">
<w:rPr>
<w:b />
</w:rPr>
</w:style>
</w:styles> |
So now that we have examined some of the XML samples for direct formatting and styles let's talk about the results of this for you as a developer of software that uses this stuff. Five scenarios that I want to investigate.
Copying document content
Let's say I want to copy a paragraph from one document to another. Since I want to copy only a little bit of content I want the style of the target document to apply to the copied paragraph (styles travel with documents, direct formatting with content). When using Open XML you can look up the element that you need to copy, say a paragraph, and copy this element to the target document. This will copy not only the paragraph, but also the direct formatting applied to the paragraph and all its inner content. When using ODF, you look up the element that you need to copy, a paragraph again, and copy this element to the target document. Next you need to parse this paragraph and look up all direct-formatting styles that are used in the paragraph and in the inner content, next copy each of these separate styles into your target document, of course making sure to prevent naming collisions and adjusting the content accordingly. And in all likelihood you will have naming collisions since the name 'T1' used by the OpenOffice.org Writer application will probably always be used (I think even a GUID for the name would have made life easier here for ODF developers)
Open XML | ODF |
- Lookup element
- Copy to target document
| - Lookup element
- Copy to target document
- Parse element
- Lookup direct formatting styles
- Prevent style name collisions
- Copy styles
|
Adding new styles programmatically
If you create a new style programmatically, you need to prevent naming collisions again. In Open XML I open the styles part and check that single part. For ODF I need to open two parts, and check in both parts for naming collisions.
Open XML | ODF |
- Open Styles part
- Prevent style name collisions
- Create style
| - Open main part
- Prevent style name collisions
- Open styles part
- Prevent style name collisions
- Create style
|
Adding new styles through the UI
Take a look at the ODF sample of direct formatting and take notice of the 'T1' generated name being used. First of all, is this naming convention in the spec? (seriously, I haven't looked yet) Now what if I go into OpenOffice.org Writer and add a style using that exact same name (since I think T1 s a functional name for my style… somehow). What must be done now by the office application is finding and changing that piece of direct–formatted text since the name of the new style collides with the name of it. As a result the name of the direct format style changes. Hence you cannot easily rely on the name for identifying some formatted text. Given that the same attribute is used to identify a direct format as well as a style, you must go look if that style is a direct format, in which case you shouldn't cache the name somewhere in your code since it might change after being edited. Now you could reason that you should never rely on the name of a direct format, since it is only applied to a single piece of content the name shouldn't matter right? Wrong! The issue is that the same attribute is used to identify style and direct formatting, so you are stuck with checking this programmatically. This means you will need to learn more about ODF before doing this correctly, and in my opinion this makes things more difficult, not less. I can already image the naïve implementation saying that each style named T plus a number will probably be a direct format and not a real style.
(funny side-note is that at the event one of the speakers actually said that open source makes your life easy since you can peek into the source code instead of reading or defining things in a spec, which opens the floodgates for these types of implementation stupidities)
Combining style and direct format
While thinking about these issues suddenly I thought of another interesting detail. How can you apply both a direct format and a style to some content if the same attribute name is used to identify both levels of formatting. Given the output of OpenOffice.org Writer I would suspect you can't. End result? Just put a span in a span! The outer span points to the real style being used, the inner span points to the direct formatting style. In Open XML this is a no-brainer since there are different elements for indicating the style and the direct format.
ODF – Direct format and style |
<office:document-content>
<office:body>
<office:text>
<text:p>
<text:span
text:style-name="MyStyle"> <text:span
text:style-name="T1"> Text </text:span>
</text:span>
</text:p>
</office:text>
</office:body>
</office:document-content> |
And for Open XML:
Open XML – Direct format and style |
<w:document>
<w:body>
<w:p>
<w:r>
<w:rPr>
<w:b />
<w:rStyle w:val="MyStyle" />
</w:rPr>
<w:t>Text</w:t>
</w:r>
</w:p>
</w:body>
</w:document> |
Amount of text content used
One thing I heard earlier about Open XML's way of doing things is that it is more verbose. Take a look at the following character counts for the Open XML and ODF samples.
(Note that this is a somewhat lame example, real verbosity checks will probably take a bit more than a few small pieces of XML)
Sample | Open XML | ODF |
Direct format | 100 characters | 326 characters |
Styles | 218 characters | 365 characters |
In conclusion
Both ODF and Open XML use the same ideas for separating formatting from content, and support similar notions, only Open XML takes a vastly different route. In my humble opinion, Open XML takes an approach which is better suited to development of solutions and makes it easier to work with the markup.
Hope it helps