Re: Mining the data in the correspondance archives

From: Edmund R. Kennedy <ekennedyx_at_yahoo_dot_com>
Date: Wed Dec 01 2004 - 09:56:09 CST

Hello Keith,
 
That sounds fairly promising. I just looked grep over on the Wikipedia and I wonder if fgrep might be the most useful grep. As long as it has a common key of DATE that would allow a user to jump to the full source email that could be helpful. I keep going back and forth in my mind about having an interactive grep available from the home archive page.
 
I'm still kind of attached to an alpha or date arranged list of threads and authors as a supplemental tool. Yes, I know hypermail allows you to do that (on a month by month basis) but I'm looking for different ways to parse the archives for meaning.
 
David Mertz, what do you think?

Keith Copenhagen <K@copetech.com> wrote:
Perhaps we could generate a (limited) list of key terms :

For example :
"Open Source"
Security
"Hardware Platform"
"Operating System"
"Readable Ballot"
Canvassing
"Voter Rolls"
Audit

The resulting list of emails could turn into wiki pages (maybe with a grep
+/-1 line kind of intro) and over time we could by hand cull them back to
the
ones that capture the consderations and the concensus.

-Keith

On Tue, 30 Nov 2004 13:42:03 -0800 (PST), Edmund R. Kennedy
wrote:

> Hello:
>
> Um. David, I mean preparing a separate index of the documents that
> people could consult to see what's there. I didn't mean 'indexing'
> although that's a reasonable interpretation. That sort of key index is
> effectively invisible to the final user and isn't not very useful when
> you don't know what key word to use. When I turn to a paper index and
> don't know the search term I can skim quickly through the index and
> usually find what I'm looking for quickly.
>
> I'm kind of thinking something like an index of threads, an index of
> authors, resorting those by date, or size, etc. Right now, the
> correspondence archives are effectively a knowledge swamp. The ultimate
> goal is try to drain the swamp and systematically extract the
> information to end up in the Wiki or even the FAQ. No, I'm not hip deep
> in alligators yet.
>
> David Mertz wrote:
> On Nov 30, 2004, at 3:35 PM, laird popkin wrote:
>> The indexing part is well solved by a number of open source text
>> indexing engines, such as Apache's Lucene, or Zilverline.
>> KnowledgeTree also looks very interesting -- it's a full fledged
>> document management system that includes text indexing. So that might
>> be overkill.
>
> Well, yeah. But it's even easier to solve using Google.
>
> That's what the archive site already does; google happily spiders our
> email archive with a good regularity. The search box there is just a
> google search with a "site:..." restriction (and I think a little
> kludge where I add the term 'hypermail' which is the archive generating
> program that puts a little blurb on archived pages; just to exclude
> other documents I may host at the same domain).
>
> _______________________________________________
> OVC discuss mailing lists
> Send requests to subscribe or unsubscribe to
> arthur@openvotingconsortium.org
>
>

-- 
Keith Copenhagen
_______________________________________________
OVC discuss mailing lists
Send requests to subscribe or unsubscribe to arthur@openvotingconsortium.org
-- 
10777 Bendigo Cove
San Diego, CA 92126-2510
"We must all cultivate our gardens."  Candide-Voltaire

_______________________________________________
OVC discuss mailing lists
Send requests to subscribe or unsubscribe to arthur@openvotingconsortium.org
==================================================================
= The content of this message, with the exception of any external
= quotations under fair use, are released to the Public Domain
==================================================================
Received on Fri Dec 31 23:17:01 2004

This archive was generated by hypermail 2.1.8 : Fri Dec 31 2004 - 23:17:22 CST