RESTful manipulation of versioned data

In this post, I’ll sketch a half-baked plan for making an idiomatic RESTful service that handles versioned data in a sensible way. I’m not claiming that the pattern I’m proposing is novel or non-obvious (it is, in fact, rather obviously inspired by git and functional dictionaries); but I haven’t seen it elsewhere, and I believe it might be a good pattern for other folks with similar applications to mine. I’ll briefly introduce my problem domain (specifically, a web-service interface for Wallaby) before detailing my solution. (Feel free to skip the first section if you aren’t interested in the Wallaby application, and skip the second if you have a pretty good idea of how git works.)

Background

Wallaby’s current API maps quite naturally to remote invocations over XML-RPC or SOAP, but a REST interface is more palatable to many client programmers. (And, perhaps more importantly, to this service developer: as a recovering functional programmer, I really appreciate the explicit state transfers inherent in RESTful interaction.) A more dramatic interface change affords the opportunity to think about improving some limitations of the old interface, particularly in terms of the versioning model.

Wallaby’s current versioning support is designed to support two major kinds of operations: saving and reloading the state of the entire system, and comparing the (activated) configuration of an individual node to any previously-activated configuration. These two classes of operations support a lot of interesting use cases really well, including most of the ones Wallaby users care about: providing named snapshots of system state, providing “undo” functionality, and differencing configurations to identify which Condor subsystems need to be restarted.

However, this model has some shortcomings: it treats service state as essentially linear (not as a tree or DAG), it is thus unable to support branches (or to safely handle concurrent modification), and it is unable to optimize cases involving identical snapshots or partial snapshots with distinct identities. In practice, these are not a major obstacle to using Wallaby, but it would be nice to eliminate these problems and, more generally, to support branching and tree-structured state histories.

A quick overview of the git model

The model of versioned data that git presents seems like a pretty sensible starting point: put simply, this model is a content-addressable memory, where objects are stored by hashes of their contents. Objects can be untyped data blobs (think of the contents of files); trees, which contain mappings from names to trees or blobs; or commits, which bundle metadata including a timestamp, an author, and the identity of the parent commit with a tree. The ancestor-of relation implies a DAG of commits. (git supports more features, and its data model is more complex than the brief explanation above, but this non-exhaustive summary should suffice for the purposes of this post.) Objects are write-once and can thus be shared: files and subtrees among multiple trees, commits as parents of multiple commits, etc.

git also supports naming particular commits: tags are names that are fixed to particular commits, and branches are names that can be updated to refer to a commit transitively descended from the commit that they currently point to. (Of course, it is possible in practice to “update” a branch to an unrelated commit, or to “update” a tag, by deleting and recreating it, but these operations are not directly provided by the interface.) All branches and tags, taken collectively, are known as the heads of a repository; typically (but not always), you can follow the ancestry relation back to a single primeval commit from any head.

One common pattern for multiple users collaboratively editing a git repository involves creating clones of that repository, one for each user. Each user then works in his or her clone repository, editing and making commits. When the time comes to share these changes, the user can then push changed branches and new tags from the clone to the original repository. Other users can then pull these changes from the original repository and merge these in to their private clones.

Clearly, the git model has some similarities to REST: client expectations and full object state are transferred explicitly between repositories (so that, e.g., you can’t push an updated branch that isn’t an ancestor of the remote branch you’re pushing to). However, the assumption that each client can make changes to a private, persistent working area and transmit them en masse does not generally hold in the web services domain, especially if we want clients to be essentially stateless.

RESTful versioned data

The basic idea behind this proposal is that “changing” an object simply creates a new object, as with git and functional data structures. The identities of data objects are taken from their content; tree objects contain mappings from names to particular tree or data objects. (These names provide ways to model change, since object identities are dependent on object contents.) Clients who access the state of the system are really accessing a particular named state — that is, the state of a tree with a given (mutable) tag. Clients can construct substantial changes to the state of the system in parallel to the version that other clients are accessing, and then change the whole “state of the system” atomically by changing the identity of the current-state tag. We’ll now spend some time unpacking this fairly dense overview.

The figure below presents an example git-like representation of an employee database; it will be our running example. Note that, as with a git repository, there is a commit object pointing to a tree object; trees contain references either to other tree objects or to data objects. (In the figure, commit objects are orange, tree objects are blue, and data objects are red.)

For clarity, this figure omits some data objects and does not provide explicit IDs for trees. Note that data objects are represented as strings containing structured data. (It would also be possible to represent data as trees mapping from attribute names to structured-data representations of values.) Note also that inter-object references, as in the object representing the Engineering department, are handled by name (i.e. the path through the tree to the referenced object) and not by hashes or other unique identifiers; this makes it possible to change a single object without affecting other objects.

We’ll now cover what happens when we access and modify data. We’ll present one way to access data and two ways to update it, using two different API approaches: a high-level approach and a low-level approach.

Accessing data

We will use HTTP GET requests to access objects or collections of objects. Consider the following examples:

GET /tag/master/employees/wilma: Redirects to a URL explicitly identifying the current state of the system on tag master, something like /commits/89d71040/employees/wilma. This URL points to a representation of the state of /employees/wilma as of commit 89d71040; that is, the object with identity 2c8ec116.
GET /objects/2c8ec116: Returns the representation of the object with identity 2c8ec116.
GET /commits/89d71040/employees/: Returns the name and object ID of every employee as of commit 89d71040, like {“wilma”:2c8ec116, “betty”:3fd513f6, …}
GET /commits/89d71040/employees/?displayname=Flintstone: Returns the name and object ID of every employee as of commit 89d71040 whose displayname attribute contains the string “Flintstone.”

Updating data: the high-level approach

The high-level approach corresponds fairly closely to the managed-object model: one object can be changed at a time, and the changes to that object are immediately committed to service state. However, because we don’t want to force users to update the externally-visible service state (that is, the state of the master tag) with each object update, we will not change the referent of the master tag. Instead, updating an object will create a “detached commit,” as in the following example, in which we want to move Wilma to the corner office next door:

Here’s how we’d update the system to reflect this new state for the wilma object:

A client issues a HTTP PUT request to /tag/master/employees/wilma; at the point of the request, the master tag points to commit 89d71040 and /employees/wilma points to object 2c8ec116. Included in the PUT request body is a representation of the updated state for wilma.
The service creates a new object with the supplied updated state (a30db7ac) for wilma.
The service then creates a new tree object for /employees with the wilma reference updated to point to a30db7ac.
The service then iteratively updates every tree containing an updated tree until it gets to the root.
The service creates a new commit object referencing the newly-created root tree with 89d71040 as a parent commit; say that this new commit has the ID 597028a0.
The PUT request returns, redirecting the client to /commits/597028a0/employees/wilma. Until this commit object becomes transitively reachable from a named tag, it will have a Content-Expires header indicating a point after which it may be garbage collected. The client may inspect the commit object or use it as a point of reference for further updates.

After making several changes, the client may wish to update the master tag, thus affecting the default externally-visible state of the system. In this approach, as in the low-level approach below, updating tags is done with a PUT request. To eliminate the possibility of inconsistent updates across clients, the PUT request body should include both the expected old hash of the tag and the new tag hash; this allows rejecting updates that would shadow other updates. Once a commit is tagged, all of its ancestors are marked as reachable so that they will not expire.

Updating data: the low-level approach

This model makes explicit what happens when changing the state of the system. Say we want to move Wilma to the corner office next door, as above. First, we’ll create a new object corresponding to Wilma’s new state, using HTTP POST to store the new payload, with a request like POST /objects and a payload of the representation of the new object. After this request succeeds, we have the following state:

Note that the /employees tree still points to the old wilma object. The client is responsible for creating a new tree object that points to the new wilma object, which it does by HTTP POST request to the /trees resource. The client must then create a new tree object for the tree containing the modified employees tree; this is the root tree in this case, but in general the process recursively creates parent trees until it creates a new root. Given the identity of the new root tree object, the client can then create a new (detached) commit object, proceeding as in the high-level model.

This low-level approach trades flexibility (specifically, the ability to make commits with multiple changes) for complexity, either in the client or (more likely) in a client library. For applications that must support atomic operations on multiple objects, it might be more suitable than the high-level model; as stated above, though, the high-level model is well-suited for migrating applications from a managed-objects model.

Improvements

There are a lot of features that git has that we’ve omitted or hinted at in the above discussion. Here, we’ll briefly examine a few that it makes sense to support:

Differencing and patching. It should be possible to construct the difference of two trees. (This may be impractical in general, but it should be sensible for trees of data objects that can be distinguished and differenced.) Furthermore, it should be possible to construct a patch object that encapsulates the difference between two trees (i.e. how to transform tree A into tree B) and can be applied to other trees to incorporate these changes.
Merging. Given differencing and patching, it makes sense to have first class support for constructing new commits that are the merge of two or more other commits.
Hooks. It should be possible to specify tests which must succeed before, e.g., a commit succeeds, or before a commit may receive a particular tag. This allows the system to enforce validity properties.
Garbage collection. The system should be space-efficient, automatically eliminating unreachable commits that have expired.

Going forward from here

It seems like this pattern makes sense for my application, but it also seems sufficiently general to be more broadly useful. I’d be inclined to implement this pattern as a framework that handles most of the “git plumbing”-type operations for developer-provided model or resource objects or structures. In fact, it might also make sense to support automatic generation of client libraries for certain classes of application! Dealing with versioned data is a tedious hassle. Anything we can do to make it more straightforward is a huge win.

I welcome constructive feedback on this proposal.

UPDATE: clarified a couple of confusing sentences; thanks to Erik Erlandson for bringing them to my attention.