Recently Malte Timmerman of Sun blogged about working with ODF document encryption. With my experience building Open XML solutions using Microsoft .NET, I thought it would be a nice test to see if I could write some code to decrypt documents created by Open Office, purely using information in the ODF specification or any referenced materials.
During my hour-long research of the ODF spec I kept a record of my experiences. Often I don't do this and then forget the exact order of things…really annoying. This is the write-up of that hour. This post reflects my genuine experience with finding out about ODF encryption. I have not seen this part of ODF before, or the specification of the package structure.
Learning about ODF encryption
I started out from the blog post of Mr Timmerman, and downloaded the ODF specification from his hyperlink. I could start reading at chapter 17, but as anyone would do, I skipped right to 17.3 mentioned by Malte. ODF encryption it is.
The first thing that struck me is that the entire text on ODF encryption was about ½ an A4 in size. Not what I was expecting given the complexity and importance of this topic.
The first sentence was also not what I expected. It says:
The encryption process takes place in the following multiple stages:
BAM. There you have it. These are the steps, now shush. (FYI, stages are always multiple, that why it's stages)
This is seriously the first time I have seen this little information in a specification of this importance. Dutch government for instance mandates that software should be ODF enabled, but I think there was probably little research on what that means and how that will pan out when you start building real software. This is what I jotted down at that point:
- What can be encrypted
- Goal
How
- Who file
- Per zip entry
- Per zip folder
- Per logical part
- Process
I can't place the 'who file' remark, so let's scrap that one (hey, it was late. I had a DevDays session to present, stuff happens). However, the other things are quite relevant. What can be encrypted? Binaries, XML, everything? How? Per zip entry, folder, or logical structures? What is the general process? Why even encrypt? Can't we encrypt the entire file?
As you can imagine, here I am, lost after the first sentence.
Now don't take this the wrong way, but can't someone at least write some words of explanation before showing the steps on how to do it? I am a trainer, and this is not how you train. And IMHO, a spec is used to train people. People bad-mouth Open XML for lots of things, but if there is anything it has on ODF it is documentation. Size does matter.
So, steps it is.
Going through the steps
1: A 20-byte SHA1 digest of the user entered password is created and passed to the package component.
Ok, I know SHA1. There is support available in .NET, no worries. But what is a package component, and who passes who what? If it is my application, isn't it already having the password? So what's up with the 'passing'? Let's do a search for 'package component', nope, first mention. It'll probably just be 'application'.
Then something more critical struck me. In what encoding must the password be placed? Not sure, and definitely not sure if it is mentioned somewhere in the ODF specification. And even if it is mentioned, the password is not exactly part of the document, so does the ODF specification even apply for that?
2.The package component initializes a random number generator with the current time.
Ok. What's up? There are many pseudo random number generators. Do I need to use a cryptographically strong generator, or just do a rand()? I am no expert, please enlighten me.
3.The random number generator is used to generate a random 8-byte initialization vector and 16-byte salt for each file.
For each file? When did files come in to play? Ok. So first there is no explanation, just steps. This is probably one of those in-line explanations. Does this mean I always have to encrypt each file in the ZIP, or just each ODF file I want to encrypt?
4.This salt is used together with the 20-byte SHA1 digest of the password to derive a unique 128-bit key for each file. The algorithm used to derive the key is PBKDF2 using HMAC-SHA-1 (see [RFC2898]) with an iteration count of 1024.
WOW. Hold ON!
The salt is "used together" with the 20-byte SHA1 digest???? WTF??? How does that work? What is "used together" for a term to use in a specification? I could guess now that we do PASS+SALT, but what if it is SALT+PASS (or SSTAPALS J lol). Safe to say that at that point in time I was not feeling that confident with getting my implementation right without access to Open Office. Not happy.
5.The derived key is used together with the initialization vector to encrypt the file using the Blowfish algorithm in cipher-feedback (CFB) mode.
Blowfish? Isn't the inventor on-record saying that he is amazed that Blowfish is still used? Not expecting too much support for it in .NET. That'll be a library to build or license. A comment on the blog of Malte indicates that it is also an issue for people working in government.
Ok. I have lots of questions on the steps. If I don't find this information I'd better debug my code J Downside is that I'll now have to test not only my code, but also how it works against all the office apps that I want to interoperate with. More work for me. Not happy again.
Compression of encrypted files
The encryption section ends with a small blurb on compression.
Each file that is encrypted is compressed before being encrypted.
What, what ,what? Even binaries? Using what algorithm? Earlier I felt that all files should be encrypted, while this sentence makes me feel only some files should. Which is it? There are at least a few options here (take into account that I am not a zipping expert)
- ZIP allows files to be stored compressed or uncompressed individually.
- I need to manually ZIP each entry in the (parent) ZIP, using the .NET ZipStream or similar.
- They actually mean to ZIP the entire package, not files inside the package.
Is it just me, or do we all feel the point of a ZIP container is to ZIP stuff in it? Why do I need to that again? This will be some more integration tests.
Since it was the first mention of compression that I found, perhaps the next section will hold some information on this topic. So I go scan section 17.4, only to find the second paragraph mention 'magic number' mechanisms such as Unix file/magic utility.
And here I am, yet a little more lost than before. Now I also think ODF might be a little depending on one platform.
To allow the contents of the package file to be verified, it is necessary that encrypted files are flagged as 'STORED' rather than 'DEFLATED'.
Ok, probably something that is defined by the ZIP specification. No worries.
As entries which are 'STORED' must have their size equal to the compressed size, it is necessary to store the uncompressed size in the manifest.
I love sentences starting with 'as'. I understand only a little bit of this sentence. It first says 'because zipped entries have a size equal to their size (yeah, duh), you need to store the uncompressed size too. Not sure why I should store the original size separately. Normally when I send a ZIP file to someone, I don't send over an extra text file with that information either. Why is this so necessary and why don't you please tell me about it, and not make me feel more stupid than I am.
Here's one thing I love about the Open XML specification. There are many anchors in the PDF and 'see… blah' links. You can easily navigate using clicks and alt+ß.
Conclusion
So after my initial research of ODF encryption I found it:
Grossly underspecified
- Incomplete on important details
- No mention of decryption
- Using algorithms which even the author does not want you to use, without platform support in at least .NET, Java perhaps too.
- Hard to navigate since the specification does not use anchors / references.
As you can imagine ODF does not hold up to scrutiny that well. How can this important topic be so under-documented?
I will continue to build the decryption application and let you know about my progress.