Synchronizing REST Resources With Asynchronous Notification: A Practical Proposal

From CommerceNet Wiki

Jump to: navigation, search

Contents

[edit] Synchronization with Weak Consistency

When you render a web page that contains pieces snarfed from other web pages, such as a transcluded RSS box, you would often prefer to start spitting out your web page without waiting for the other web pages to load first. So you maintain a local cache you can consult quickly, which may be out of date, but you try not to let it get too out of date.

You can do this by polling the current state of the other web pages; in fact, when you begin to depend on a new web page, polling for its current state is unavoidable. But it would be nice to push change notifications instead of polling for them. If you run both web sites, you could install software on both of them to do the change-notification pushing; but how does that software work?

We will term the association between a web page and a recipient of change notifications a "subscription".

[edit] Reliability and Protocol

The first desideratum for distributed systems is that communications between failure domains be either side-effect-free or idempotent. A notification that a particular web page is in a particular state is idempotent, so that would be a reasonable thing to send.

We have to consider failure scenarios other than retransmission, however. Either the sender, the receiver, or both may fail for some period of time and lose their state, and the network in between may fail for some period of time as well, which the sender and receiver won't automatically know about. We assume that if the sender or receiver does fail, it will initiate some sort of recovery process when it is fixed, and that recovery process is what we consider here.

Dealing with limited network failures is simple; retransmit unacknowledged messages for a while. Requiring that unacknowledged messages be retransmitted forever would impose an unlimited resource requirement on senders of notifications. Receivers of notifications may have the same problem, but they are the ones deciding to take up their own resources, so they can delete unproductive subscriptions or add more resources.

Unavoidably, senders that run out of resources will forget some subscriptions.

We could require that senders and receivers never acknowledge messages until they had ensured that they would not be lost in a failure, but this "durability" constraint increases the resource requirements of things like this by two or three orders of magnitude while dramatically complicating the code.

From a sender's point of view, a receiver failure looks sort of like a network failure. But from the receiver's point of view, it's different because the receiver knows that it has had a failure and may therefore miss messages. So, during recovery, the receiver should re-poll each web page to which it is subscribed; otherwise it will have to wait for retransmission, and that could take quite a long time.

Dealing with long-term network failures requires either that receivers re-poll all of their idle subscriptions periodically, or that senders re-poll all of their idle subscriptions periodically, in order to discover when a previously disconnected peer has become reconnected. Either alternative costs the same quantity of network resources. If the receivers do it, it also handles recovery from sender crashes and sender resource exhaustion, and it is also more incentive-compatible: the receivers inherently commit some resources to each subscription anyway, and are in a better position to choose their tradeoff between worst-case notification latency and network bandwidth usage.

There is a possible race condition if the subscriber polls the current state of the resource and establishes the subscription to be notified of future states in two separate messages. The receiver would normally poll the current state of the resource with an HTTP GET, and this is likely to be handled by a different piece of software than the change-notification software on the sender. Obviously the race condition exists if the GET happens first; less obviously, if the subscription is established first, the change-notification-sending software may still have a newer version than the GET-handling software has a few milliseconds later. So the acknowledgment message for the initial establishment of a subscription should include the current state of the resource, perhaps contingent upon things like If-Modified-Since and ETag.

Probably the creation of a new subscription should be implemented as an HTTP POST to the resource being subscribed to that returns a "see other" redirect to a newly-created URL for that subscription; a GET on that subscription should return the current state of the resource subscribed to, and also indicate to the sender that the subscription is still being used. The sender then uses HTTP POST to send new states of the resource to the receiver, at a URI specified in the initial subscription request. Probably a multipart HTTP form post would be the most straightforward, including a URL, a current representation of the resource, an ETag, a modification time, perhaps an expiry time, and an optional "deleted" marker.

[edit] Sets

We would like to allow for a single subscription to transmit the state of multiple resources, perhaps including some that the receiver does not know about beforehand. For example, a site-mirroring application might want to receive all new versions of files that appear anywhere on a web site, and a blog-reading application might want to receive all new versions of blog front pages mentioned in a blogroll (or, perhaps, all new versions of blog posts in a blog, or many blogs.) A calendar-synchronization application might want to receive all new calendar entries that appear in any of a number of calendar sources.

We will call all of these "sets". We could model them as resources --- perhaps MIME multipart/mixed resources containing a current representation of each of the collected resources --- and subscribe to the entire resource as usual. However, this has a couple of unfortunate effects: a change to any item in the set causes the whole to be re-sent, and there is no way to explain to the set sender that you have the current versions of some of its collected resources, but not all.

So a GET on a subscription to a set returns an HTML document listing URLs, modification times, and ETags for the items in that set, enabling the receiver to poll the current state of any item they're out-of-date on; and the update notifications on the set are exactly the same update notifications that would be sent if the receiver had subscribed directly to the resources in the set, with the small addition of the URL, modification time, and ETag of the subscription that resulted in the update notification being sent.

This immediately suggests a special class of sets that are themselves subscribers to other resources; they "contain" whatever resources they subscribe to, and mostly they forward update messages from those resources after verifying that they represent actual changes. We call such sets "topics".

[edit] Topic Subscriptions

Topics have to know about their incoming subscriptions so they can poll them from time to time to deal with failures. To cause a topic to subscribe to some URI, someone POSTs a request to that topic including the following pieces of information:

  • the URI to subscribe to
  • maximum desired update latency, i.e. maximum polling interval
  • credentials to send to that URI

The response is a redirect to a URL representing the receiving end of the subscription, describing the success or failure of the operation. That URL can be consulted later to see the current state of the subscription: last update received.

[edit] Credentials

Receivers will want to distinguish requested update notifications from spam and spoofing. If they use SSL to establish the initial subscription, they can send a shared secret to use to authenticate update messages. The shared secret could perhaps be used as a username:password pair for Basic or Digest authentication when sending updates.

In order to prevent man-in-the-middle and receiver-spoofing attacks, the subscriber should be able to specify a set of SSL certificates the sender should accept when sending notifications to them over SSL.

Receivers may also need to authenticate themselves to senders when they subscribe, and their authentication information may determine what representations, or even what resources, they are allowed to see. Consequently their authentication information should be saved in their subscription and used to determine these things.

[edit] Browsers

Web browsers cannot receive HTTP POSTs or other inbound traffic except on an established TCP connection, and they cannot maintain more than 2-4 TCP connections to a particular other host on the network in order to receive event notifications, so they need some kind of aggregator to keep them apprised of changes they care about. The "topic" mentioned in a previous section appears to be exactly the desired construct; the only limitation, as described, is that forwarding your updates through a topic obscures their origin, so you cannot tell which sets they are members of.

To solve this problem, we add a new datum to update messages: a list of sets that the resource is part of. When a topic receives an update message, it merges the list of sets into the ones it already knows about, forwards the message if this results in a change, and always includes the full list on forwarded messages.

Because topics manage subscriptions on behalf of the browser, the browser does not need to be able to speak the protocol as a participant, although of course it should be possible to use the browser to inspect objects such as subscriptions and sets. This means that the protocol is free to, for example, require certain HTTP headers or use HTTP PUT.

[edit] Mediators

If you try to subscribe to web pages in today's world in this way, you will find that almost none of them support this protocol --- the result of your HTTP POST won't be in the expected format. So we suggest a mediator that polls a large list of web pages periodically and serves up cached copies of them, while providing change notification. It can work either as a classic mediator along the lines of Shodouka and Crit, or as an HTTP proxy.

(If we executed the polling part of a subscription by GETting the original resource, rather than some subsidiary resource we created by POSTing, then the protocol would be backwards-compatible with ordinary web pages, and when applied to an ordinary web server, it would reduce to HTTP GET polling. In this case, a mediator and a topic are the same thing.)

HTTP POSTs can have unpredictable side effects, so subscribers should use a special GET to ask the resource whether it supports this subscription protocol --- perhaps to consider subscribing to http://whoever/whatever, the subscriber should GET http://whoever/whatever?do_method=subscription_supported, which is guaranteed (formally at least) not to have side effects.

[edit] Backbiting

This protocol consists largely of hearsay: one party repeating statements it alleges another has made, in the form of asserting that particular representations are valid for some resource owned by another authority. Recipients trust these assertions at their own risk; they can mitigate this risk by verifying some of the assertions by GETting the resources, by comparing different senders' versions of the resources, by maintaining reputations for different senders, and by using shared secrets as described earlier.

Unfortunately, "Hearsay" is not available as a name for the protocol, as there is already a "multi-application mobile services delivery system" or perhaps a "voice portal platform" called NMS Hearsay, some software for American English pronunciation training for foreign language speakers called HearSay, and a general-purpose communications package for Acorn RISC OS called Hearsay, with a new version called Hearsay II. Unfortunately "mod_pubsub" is also taken. "watchset"?

Personal tools