Namazu rocks

Namazu rocks. It’s a fast, feature-rich search engine that specializes in file-system text indexing. The main use we have for it is to search the various mailing lists the company maintains for projects and general announcements. We had used ht://Dig over the past year, but it was slow, taking some four hours to do an incremental index; it would fail occassionally for no obvious reason; finally, it would seem to choke the server, for some reason pushing load up to unresponsive levels. Index files were also huge compared to the size of the data being indexed.

Namazu is almost the polar opposite in functionality, being pretty fast — a full index creation took about 90 minutes, and one-day incrementals are on the order of five minutes — and relatively light — the indexing was done with nice 19, which is a nice level htdig probably wouldn’t have finished with.

Search results are returned almost instantly. Searches can also be regexes, and can span multiple indexing databases (this may be done in the future). Namazu also knows about MHonarc archive formats, so it correctly indexes by subject, author, etc. I remember some heartache with htdig, where we had to specify exclusion areas, so that the headers and footers wouldn’t get indexed (a source of spurious results, as the “next message” subject lines in the footers would mark an unrelated thread as a hit), so a MHonarc filter is very welcome. There are also filters for MS Word docs, Powerpoint, PDFs, and a few others. I may implement this in the future, with the various binary document types having separate index databases; the user can then pick which indexes to search against.

Cron jobs have been set up for incremental updates and garbage cleaning of index databases; these can be cleaned up somewhat, but they’ll work as is. The only thing left is perhaps some customization of the namazu.cgi (which was drop-in, unlike htdig), so that the help document can be better customized to the corpus that we have, as opposed to some generic one.

Comments are closed.