subbu.org

Distributed Computing with the Browser

with one comment

Most web based systems are architected with three pieces: (a) a user-agent, (b) a web server for processing, and (c) a database for storage. Of these three, the user-agent is typically considered as a presentation node, with almost all the computing performed on the web server node. In this model, each node has some special characteristics:

  • The user-agent is a mutli-threaded and dedicated node with reasonable computing power entirely under the control of the user
  • The web server is a multi-threaded or multi-process node that is shared across several users
  • The storage node is also a multi-threaded or multi-process node that is also shared across several users (via the web server)

In hosted environments, nodes (b) and (c) are expensive. For instance, while a typical shared hosting of an application costs under $100 per year, dedicated servers cost over $200 per month per node. Hosting an application on shared hosting systems therefore introduces some scalability problems that can not be solved cheaply. I was faced with this problem recently, and I had to make some choices to scale the application without requiring additional computing power on the server side.

The Problem

Each user has a several sets of large data loaded either from a GPS device or captured from other GPS enabled software. The problem is to upload this data to a server, process the data to extract some key metrics, and store the data and the metrics in a database. The data is later on served to the browser for presentation purposes, such as mapping and graphing. Given that the data is sourced from GPS enabled devices, the data tends to be fairly large - ranging from 2-3 Mb to 60-100 Mb or even more when you consider historical data. The data is structured, and represented as XML documents.

What makes this problem interesting is the constraint that the solution needs to work with a shared web/application hosting service where the web and database servers run on a shared hardware with limited processing power and memory.

An obvious solution to this problem is as follows.

Approach 1: Let the web server be the sole computing node

  • Step 1: Upload the data to the server
  • Step 2: Parse the data and extract all necessary metrics
  • Step 3: Stored the extracted metrics and the raw data in a database

For the problem at hand, this approach had serious performance issues. It took precious computing cycles and memory to parse the data and extract necessary metrics from it. In fact, the Ruby on Rails environment that I tried this approach with crashed several times while testing with a large data set. Part of the problem may be attributed to parsing the XML into a DOM using the REXML parser. Using the pull-parser part of the REXML API may have reduced the processing overhead, but I have not tried that.

Given that this obvious approach did not scale to the needs, I had to consider alternatives.

Approach 2: Process data in the background on the web server

The second approach I considered was to offload the data processing to a background job. The purpose of this job was to periodically check for any unprocessed data in the database, process the data, perform any necessary cleanup, and wait for the next turn.

As simple as this sounds, this approach had some costs associated with it.

First of all, this background process needs to be cognizant of the current load on the server so that it does not wake up at the wrong time and slow down the overall server. It needs to extract the metrics only when the load the on the server is relatively low. Secondly, the background job needs to be monitored for processing errors. Thirdly, I needed to build mechanism to report the status and errors to the user and or the administrator. These requirements introduce additional complexity without necessarily reducing the processing and memory requirements.

This lead me to the third approach.

Approach 3: Use the browser as a computing node

Why not offload some of this processing load to the browser itself? For the problem at hand, the browser seemed very attractive. As I mentioned above, the browser is a dedicated multi-threaded node, and each user brings in a new node along with him or her. Considering this, I revised the solution to the following:

  • Step 1: Process the data within the browser environment using JavaScript, Applets and the like to extract all metrics
  • Step 2: If necessary, break the data into smaller chunks for more efficient uploading. Given that upload speeds are typically lower than download speeds for most home users, chunking gives better control to the user without broken uploads. By properly chunking the data, the user will also be able to abort the process any time and resume it later on.
  • Step 3: Compress each chunk before uploading so that the data can be served back to the browser in the compressed format for presentation purposes. Given that most modern browsers support HTTP compression, there is no need to decompress the data on the server side. The browser will do the decompression based on the Content-Encoding header.
  • Step 4: Upload each chunk of data along with the extracted metadata to the server, while providing a meaningful status indicator in the UI allowing the user to monitor the activity as well as abort it whenever required.
  • Step 5: Store the data and metrics in the database

In this approach, the XML processing is done within the browser. The only processing done on the web server is to read the HTTP request, and store the data in the database.

The key benefits I realized with this approach include:

  • Lower CPU and memory consumption on the web server node
  • Better indication of the progress of the processing activity as well as more control to the user
  • Reduced complexity of the overall system. Instead of building complex monitoring tools as was required for the second approach, the user would monitor and control the process.

While this approach may not sound innovative, for the problem at hand, recruiting the browser for computational needs reduced the overall cost of the system both in terms of the required processing power and memory as well as cost of hosting the application. The fun part of this approach is designing the system within the constraints on a shared hosting service.

Written on January 14th, 2008 at 3:18 pm

RSS feed | Trackback URI

1 Comment »

Comment by David Pratten
2008-01-16 03:10:41

Hi Subbu I have posted a response to your post at http://www.davidpratten.com/2008/01/16/distributed-computing-with-the-browser/

 
Name (required)
E-mail (required - never shown publicly)
URI
Subscribe to comments via email
Your Comment (smaller size | larger size)
You may use <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> in your comment.

Trackback responses to this post