HTTP Caching for Dynamic Data
Recently I came across an article at InfoQ that talks about using Using ETags to Reduce Bandwith & Workload with Spring & Hibernate. Unfortunately, the implementation suggested in this article is such that validating the ETag takes as much time (if not more) as generating the resource itself. This is an example of a premature and incorrect optimization. This article barely scratched the surface of the problem with computing caching headers (both expiry related and invalidation related) for dynamic resources.
Let me start with the request headers.
Last-Modified:If the resource is static (e..g an image on the file system), this header could be generated based on the modified date/time on the file. But when the resource is dynamic (e.g. fetched by selecting from multiple tables and some further transformation), how do you determine the Last-Modified value of the resource? In the simplest of cases, it can be done by storing last modified date/time on each of row in the database tables, and set the Last-Modified as the minimum of all the last modified date/time of each row of data used to generate the resource.
Expires: Setting this header takes some experimentation and heuristics based on the kind of data and how often it is being changed/updated/deleted.
ETag: Since this is a key that uniquely identifies the resource, this header could be generated based on the query used to fetch the data and generate the resource. For instance, if the resource is a user’s address book, this header could be generated as follows:
ETag = f(select * from address_book where user='<user>')
What good are these headers without an efficient validation scheme built on the server side? Let me know look at the issues when the server receives caching headers.
If-Modified-Since: When a HTTP server receives this header, the question is, how to determine if the resource is stale or not? The application should be able to provide an answer without re-generating the source, because such regeneration would defeat the purpose of caching. There is no general purpose answer to this question, and the solution lies in somehow relating the queries used to fetch the data to how the resource is created/modified/deleted by the application. Not all applications may have this capability built into, and the application may have to use some heuristics to determine when to expire a resource.
If-Match or If-None-Match: The issues are similar here. The key is to determine if the ETag still corresponds to the resource or not, and the answer may be to use some heuristics here as well.
The key point I am trying to make here is that as applications are starting to expose data over HTTP to clients, it is important to design the databases and queries for caching. This gives an ability to implement caching as and when required.
Caching can not just be an after thought. I would use the ability to build caching as a litmus test for the quality of HTTP based web service APIs e.g. those trying to support REST. Applications trying to build HTTP/REST layers on top of existing applications may find it hard to take advantage of HTTP caching as those layers may not have neither the control nor access to the underlying data’s life cycle. In those case, I would start by looking some refactoring opportunities.



No comments yet.