Yahoo’s Term Extraction + Google Search API

by Subbu Allamaraju on March 26, 2006

I just finished a major enhancement to my blog. The idea is show links
from the web related to the current entry. To see how this works, just
click on the "Similar Entries from Google" link above. This feature uses
Yahoo’s

Term Extraction
service to extract keywords from my Movable Type blog
entries, and feed these keywords to Google’s
Search API to
fetch related entries. The feature is still experimental, and I need to
make a few more refinements.

The first step is to extract some keywords for each Movable Type blog
entry. I wanted this step to be automated. Yahoo has a term extraction API
available as a REST service now, and Nick
Gerakines just wrote a
MT-KeywordExtractor
plugin, which automatically generates keywords as you save an entry. I had
to make a few tweaks to this plugin for two reasons. Firstly, this plugin
uses HTTP method GET to get keywords for the given content, and the blog
entry content itself is encoded in the request URL. This does not work for
large posts due to URL length restrictions. So, I modified this plugin to
use HTTP POST to submit the entry content. The second issue is the
keywords returned from the service. In most cases I tested, this service
returned too many keywords with no apparent ranking. Since I was going to
feed these keywords into Google’s search API, I just picked the first few
tags returned. The results are mixed. The search results did not make
sense always.

The second step is to feed these keywords to
the Google search API. This turned out to be a trivial step.

Update (04/01/2006): This feature is no longer experimental. I updated the scripts to feed as many keywords as Yahoo returned to Google.

Update (06/16/2006): After using this for almost three months, I decided to disable the MT-KeywordExtractor plugin in my blog. The terms extracted by Yahoo’s term extraction service are too vague and broad.

Leave a Comment

Previous post:

Next post: