Re: Fwd: Spam help requested

From: Jan Karrman <jan_at_it_dot_uu_dot_se>
Date: Mon Apr 11 2005 - 11:27:51 CDT

On Mon, 11 Apr 2005, David Mertz wrote:

> On Apr 11, 2005, at 4:46 AM, Jan Karrman wrote:
> > I did just write this Perl one-liner to do this:
> > printf qw(&#%d; &#x%x;)[int(rand(2))], ord $& while $ARGV[0] =~ /./g;
> The problem isn't the munging algorithm, the simple thing we do for
> headers is fine: e.g. mertz_at_gnosis_dot_cx. We're not trying to make
> it a crypto puzzle, just moderately non-scannable by spam spiders.
> It's identifying ALL AND ONLY the email addresses that should be munged
> within message bodies that's the problem.
> (1) Find every genuine email address (grabbing the whole address, but
> none of the surrounding text).
> (2) Don't falsely identify anything that has an @ sign w/o being an
> email address
> (3) Don't identify anything that IS an email address, but that is
> desirable to leave in its original form (e.g. copies of press releases,
> contact info for politicians, spacing-sensitive table layouts,
> petitions, etc).
> (1) and (2) are tractable, but not trivial. (3) is a real bear.
> If anyone sends me a script to do the right thing, I'd be happy to run
> it over the whole collection of old archive HTML files (be sure your
> script doesn't get confused by HTML and only work on plain text). And
> also don't get messed up with attachments of various sorts.

It doesn't matter if the text containing an @ sign is an email
address or not - in an HTML document, writing '&#120;&#64;&#121;'
is *equivalent* to 'x@y'. One must of course be careful to not
mess up the HTML markup, but the @ character is not part of that,
so munging anything that matches '[a-zA-Z0-9\.]+@[a-zA-Z0-9\.]+'
will not mess up the markup.

Of course, this is only relevant for HTML documents, but surely,
the attachments are stored is separate files easily distinguished
from the HTML documents?

OVC discuss mailing lists
Send requests to subscribe or unsubscribe to
= The content of this message, with the exception of any external
= quotations under fair use, are released to the Public Domain
Received on Sat Apr 30 23:17:05 2005

This archive was generated by hypermail 2.1.8 : Sat Apr 30 2005 - 23:17:22 CDT