How to provide ’search this site’ functionality?

July 17th, 2006 sebastian

We wanted to add a ’search this site’ function to a client’s website but did not have the time to study the 200+ existing ways of doing this. Perhaps using the “Microsoft Indexing Service” (or “Index Server”, IS), which fits well with the software running the existing site (IIS), can easily be extended to search within MS Office and PDF documents?

But there is a problem with using IS for this: IS can only index files on a local or remote file system, it does not crawl a website. In our case that is not good enough because the content lives in a database, and we have to follow links like http://mysite.com?page=42. Moreover, we wanted to make sure exactly the content exported through HTTP is indexed, no more no less.

The solution we came up with works like this:

  1. Use a standard webcrawler to download a copy of the site through HTTP and store it the local filesystem of the server.
  2. Use Indexing Service to index the local copy of the site.
  3. Use a small hashtable for mapping the filenames returned by a query back into URLs.

This cleanly separates the webcrawl and the indexing, and the search is entirely ignorant about the (possibly heterogeneous and complicated) software architecture of the site.

So far it is just a prototype, but it seems to work fine.

Entry Filed under: Technology, Tools

2 Comments Add your own

  • 1. Bala Kondepudi  |  April 25th, 2007 at 3:02 pm

    Apache’s Lucene is also a good tool for indexing and effective for web searching.

  • 2. mikeb  |  April 25th, 2007 at 6:58 pm

    Well that’s kind of true — at least the last time I looked in-depth at Lucene, web-crawling and even extracting text from HTML were example code rather than in the core.

    Nutch, http://lucene.apache.org/nutch/ (confusingly, uses Lucene as a library but is a subproject of it), is a web-crawler — perhaps that’s what you were referring to, Bala.

Leave a Comment

Required

Required, hidden

Some HTML allowed:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>

Trackback this post  |  Subscribe to the comments via RSS Feed

Calendar

July 2006
M T W T F S S
« Jun   Aug »
 12
3456789
10111213141516
17181920212223
24252627282930
31  

Most Recent Posts