Re: Fwd: Spam help requested

From: Jan Karrman <jan_at_it_dot_uu_dot_se>
Date: Mon Apr 11 2005 - 11:27:51 CDT

On Mon, 11 Apr 2005, David Mertz wrote:

> On Apr 11, 2005, at 4:46 AM, Jan Karrman wrote:
> > I did just write this Perl one-liner to do this:
> > printf qw(&#%d; &#x%x;)[int(rand(2))], ord $& while $ARGV[0] =~ /./g;
>
> The problem isn't the munging algorithm, the simple thing we do for
> headers is fine: e.g. mertz_at_gnosis_dot_cx. We're not trying to make
> it a crypto puzzle, just moderately non-scannable by spam spiders.
>
> It's identifying ALL AND ONLY the email addresses that should be munged
> within message bodies that's the problem.
>
> (1) Find every genuine email address (grabbing the whole address, but
> none of the surrounding text).
> (2) Don't falsely identify anything that has an @ sign w/o being an
> email address
> (3) Don't identify anything that IS an email address, but that is
> desirable to leave in its original form (e.g. copies of press releases,
> contact info for politicians, spacing-sensitive table layouts,
> petitions, etc).
>
> (1) and (2) are tractable, but not trivial. (3) is a real bear.
>
> If anyone sends me a script to do the right thing, I'd be happy to run
> it over the whole collection of old archive HTML files (be sure your
> script doesn't get confused by HTML and only work on plain text). And
> also don't get messed up with attachments of various sorts.
>

It doesn't matter if the text containing an @ sign is an email
address or not - in an HTML document, writing '&#120;&#64;&#121;'
is *equivalent* to 'x@y'. One must of course be careful to not
mess up the HTML markup, but the @ character is not part of that,
so munging anything that matches '[a-zA-Z0-9\.]+@[a-zA-Z0-9\.]+'
will not mess up the markup.

Of course, this is only relevant for HTML documents, but surely,
the attachments are stored is separate files easily distinguished
from the HTML documents?

/Jan
_______________________________________________
OVC discuss mailing lists
Send requests to subscribe or unsubscribe to arthur@openvotingconsortium.org
==================================================================
= The content of this message, with the exception of any external
= quotations under fair use, are released to the Public Domain
==================================================================
Received on Sat Apr 30 23:17:05 2005

This archive was generated by hypermail 2.1.8 : Sat Apr 30 2005 - 23:17:22 CDT