Re: Mining the data in the correspondance archives

From: Edmund R. Kennedy <ekennedyx_at_yahoo_dot_com>
Date: Tue Nov 30 2004 - 15:42:03 CST

Um. David, I mean preparing a separate index of the documents that people could consult to see what's there. I didn't mean 'indexing' although that's a reasonable interpretation. That sort of key index is effectively invisible to the final user and isn't not very useful when you don't know what key word to use. When I turn to a paper index and don't know the search term I can skim quickly through the index and usually find what I'm looking for quickly.
I'm kind of thinking something like an index of threads, an index of authors, resorting those by date, or size, etc. Right now, the correspondence archives are effectively a knowledge swamp. The ultimate goal is try to drain the swamp and systematically extract the information to end up in the Wiki or even the FAQ. No, I'm not hip deep in alligators yet.

David Mertz <> wrote:
On Nov 30, 2004, at 3:35 PM, laird popkin wrote:
> The indexing part is well solved by a number of open source text
> indexing engines, such as Apache's Lucene, or Zilverline.
> KnowledgeTree also looks very interesting -- it's a full fledged
> document management system that includes text indexing. So that might
> be overkill.

Well, yeah. But it's even easier to solve using Google.

That's what the archive site already does; google happily spiders our
email archive with a good regularity. The search box there is just a
google search with a "site:..." restriction (and I think a little
kludge where I add the term 'hypermail' which is the archive generating
program that puts a little blurb on archived pages; just to exclude
other documents I may host at the same domain).

OVC discuss mailing lists
Send requests to subscribe or unsubscribe to

10777 Bendigo Cove
San Diego, CA 92126-2510
"We must all cultivate our gardens."  Candide-Voltaire

OVC discuss mailing lists
Send requests to subscribe or unsubscribe to
= The content of this message, with the exception of any external
= quotations under fair use, are released to the Public Domain
Received on Tue Nov 30 23:17:43 2004

This archive was generated by hypermail 2.1.8 : Tue Nov 30 2004 - 23:17:44 CST