Save, auto-save or always-save?

This article has some valid points; one of those is: "why do we still hit ctrl-s?" or in more general terms, "why is saving a document a distinct operation?". Since I'm interested in storage it triggered me to some brainstorming on how would something like save-less world work in real life.

Firstly, the article is about word processors - WYSIWYG editors for writing (sometimes large) amounts of formatted text, which with graphics and decorations can grow to be quite big. Actually, today's generation of "office applications" is pretty good at document recovery. After some disasters somewhere at the time of Word 95, Microsoft, and following it everyone and his dog, has really invested visible effort to make documents in progress practically non-losable. MS Office and OpenOffice today will write temporary files with "current content" even if Auto-save is turned off to ensure something can be recovered if the program, the operating system or the computer dies while the document is being authored. In the Unix world, VIM does this - I believe it mmaps its buffers to the temporary file so every change is by definition recorded in the file system.

The mmap approach is actually very nice, if it can be pulled off. It means that, barring catastrophic OS or hardware failure, there will always be something waiting recovery if the application itself dies. Unfortunately, it cannot be used when the buffers are in a different format than the output file. In the VIM example, the temp file is binary and practically unusable without being recovered by VIM again. Microsoft's .DOC file format has two "d'oh" moments in this area: firstly, it's almost universally hated among developers because it's almost a straightforward memory dump of the application buffers, requiring tedious reverse engineering to read it, and secondly: even being so, it's not used as a mmap-ed buffer, losing data if the application crashes without saving.

In another bad twist, the long-expected XML formats are particularly bad since they are universally an all-or-nothing approach: either all of the data needs to be canonically serialized and packed inside the ZIP container and safely written down, or none of it will be readable without serious recovery when reopened. Also, while saving .DOC files was fast since the buffers were practically directly passed to write(), XML serialization and compression is computationally significantly more complex.

The article's author wants, and after some thinking I agree with him, two things:

  1. Banishment of a distinct "save" operation
  2. Continuous versioning

These requirements are in contrast with how practically all software works today - from text editors and word processors to business applications - there are "save" or "commit" commands everywhere.

It doesn't have to be like that, though. Mmap is a good start but I think  it's too  limited, especially for complex documents, so something systemic needs to happen - possibly another layer between the file systems and the applications (it doesn't necessarily have to be a distinct out-of-the-blue layer; for example exporting some ZFS features to applications might practically be the only thing that's needed). This layers would have a concept of a "complex document" - a document having multiple parts like text and images. It would accept "diffs" as its input - either structured like XML for the documents or it would simply take over write() and generate diffs from there (the first approch is better since it allows complex multi-part changes as a single transaction). It would proceed to keep a "master" document - either materialized in a file or as a "view" of "latest changes" and version everything.

This layer will need to be "blessed" to be infallible, or as infallible as an operating system kernel (in the sense that userland applications don't expect the kernel to give up the ghost for trivial reasons) by using every trick in the book. The database crowd has been doing these tricks for ages, and some file systems (like ZFS) are making use of them. Journalling, transactions, etc. must be used extensively.

There are some edge cases to work around, like the granularity of document versioning. I have a feeling that it can be similar to whatever the application uses for its "undo" records - if it generates an "undo" record, it's worth saving. In fact, application-specific undo records can be abandoned in favour of such versioning.

Complex documents with versioning introduce another problem - how will they be transferred and backed up? Effectively, the versioned data represents multiple documents, which vaguely reminds of multiple-stream files. If Microsoft had stopped sitting on its collective ass and seen the writing on the wall maybe today multiple-stream files would be common and applications used to dealing with them (I'm referencing MS here as a possible trend-setter; NTFS had multiple-stream files from day 1 but they are incredibly under-used), and versioning and similar features would be easier to integrate. As it stands, with the ancient Unix concept of "everything is a (single) stream of bytes" files winning, it looks like such complex documents must be files containing miniature databases in themselves to keep all the metadata.

The reason why I'm advocating another layer and not simply promoting "best practices" when dealing with files (like mmap(), judicious use of fsync(), etc.) is twofold: firstly, people, including programmers, are lazy and having someone else doing something as sensitive as this properly for them is of great importance; secondly: it's time operating systems stop providing only the basic primitives and move on to solve wider HMI problems. Apple with its humanizing approach (Time Machine,  the Dock that doesn't differentiate much between started and non-started applications, etc.) is on the right track, though not really there yet. Also, this is an opportunity to get standardised on the versioning API and the file format.

I guess the other way to make it happen is to make it Somebody Else's Problem - to use server side applications like Google Apps and simply keep it all there, where it will presumably be backed up forever. But I don't think so - local storage is not going to die in the forseeable future.

Update: Two more issues need to be addressed: the behaviour of "dumb" applications with these complex files and the security of version records. "Dumb" applications can simply always access the last ("current") version of the file, with optionally every write() generating new version records. By "security of version records" I mean the problem that occasionally appears when some sensitive data leaks out in publically available .DOC files (due to the way they are saved, they may contain old and seemingly previously overwritten data). Obviously, there must be a "purge old versions" command but there could also be a mechanism to encrypt the version records so only the original author can see them (the mechanism can be extended to shared passwords, etc. for group authoring).

#1 Re: Save, auto-save or always-save?

Added on 2010-01-08T14:54 by John Leuner

This problem has also been addressed at an OS level, so that instead of having applications be aware of different versions or streams of documents we can have the OS persist all the state transparently. The two examples mentioned in this blog post are the L3 system and persistent java.

#2 Re: Save, auto-save or always-save?

Added on 2010-01-08T16:07 by Ivan Voras

It wouldn't work without some application support - at least some kind of transactional memory. The application still needs to say when the document is in a safe state and can be "checkpointed" by the OS.

Comments !