They have to reflect all the complexity of the applications. In fact Word does have a format designed for easy interchange, called RTF, which has been there almost since the beginning. There was always an assumption that you could use importers and exporters to exchange documents. The idea of things like SGML and HTML-interchangeable, standardized file formats-didn’t really take hold until the Internet made it practical to interchange documents in the first place this was a decade later than the Office binary formats were first invented. That means that whenever a programmer on the Word team had to make a decision about how to change the file format, the only thing they cared about was (a) what was fast and (b) what took the fewest lines of code in the Word code base. The assumption, and a fairly reasonable one at the time, was that the Word file format only had to be read and written by Word. They were not designed with interoperability in mind. A perfect Word file format parser would also have to be able to do something intelligent with the embedded spreadsheet. Office has extensive support for compound documents, for example, you can embed a spreadsheet in a Word document. But if you’re writing everything on your own from scratch, you have to do all that work yourself. If you’re running on Windows, there’s library support for these that makes it trivial… using these features was a shortcut for the Microsoft team. If you wanted to write a from-scratch binary importer, you’d have to support things like the Windows Metafile Format (for drawing things) and OLE Compound Storage. This turned out to be not what people wanted.) (It also meant that deleted data in a document was still in the file. On the hard drives of the day, this meant saving a long document took one second instead of thirty. To save a long document quickly, 14 out of 15 times, only the changes are appended to the end of the file, instead of rewriting the whole document from scratch. For example, Excel 95 and 97 have something called “Simple Save” which they use sometimes as a faster variation on the OLE compound document format, which just wasn’t fast enough for mainstream use. The file format is contorted, where necessary, to make common operations fast.Lexing and parsing are orders of magnitude slower than blitting. There’s no lexing or parsing involved in loading a file. These are binary formats, so loading a record is usually a matter of just copying (blitting) a range of bytes from disk to memory, where you end up with a C data structure you can use.There are a lot of optimizations in the file formats that are intended to make opening and saving files much faster: For the early versions of Excel for Windows, 1 MB of RAM was a reasonable amount of memory, and an 80386 at 20 MHz had to be able to run Excel comfortably. They were designed to be fast on very old computers. The first thing to understand is that the binary file formats were designed with very different design goals than, say, HTML. With a little bit of digging, I’ll show you how those file formats got so unbelievably complicated, why it doesn’t reflect bad programming on Microsoft’s part, and what you can do to work around it. and are impossible to read or create correctly.were created by insanely bad programmers.are the product of a demented Borg mind.A normal programmer would conclude that Office’s binary file formats: If you started reading these documents with the hope of spending a weekend writing some spiffy code that imports Word documents into your blog system, or creates Excel-formatted spreadsheets with your personal finance data, the complexity and length of the spec probably cured you of that desire pretty darn quickly. And these “specs” look more like C data structures than what we traditionally think of as a spec. These are sufficiently complicated that you have to read another 9 page spec to figure that out. You see, Excel 97-2003 files are OLE compound documents, which are, essentially, file systems inside a single file. But wait, that’s not all there is to it! This document includes the following interesting comment:Įach Excel workbook is stored in a compound file. The Excel 97-2003 file format is a 349 page PDF file. These formats appear to be almost completely insane. Last week, Microsoft published the binary file formats for Office.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |