(c) WestTech, 2002 -- may be freely distributed if unaltered Spam Filtering Techniques: Comparing a Half-Dozen Approaches to Eliminating Unwanted Email David Mertz, Ph.D. Noise Reducer, Gnosis Software, Inc. August 2002 The problem of unsolicited email has continued to increase every month for years. Unethical senders bear little or no cost for mass distribution of messages, but normal email users are forced to spend time and effort purging fradulent and otherwise unwanted mail from our mailboxes. This article discusses and compares several broad approaches to the automatic elimination of unwanted email; some specific popular tools following these approaches are mentioned and tested. INTRODUCTION ------------------------------------------------------------------------ This article is incomplete in a special sense. I will only write about ways that computer code can help eliminate unsolicited commercial email (UCE); computer viruses, trojans, and worms; frauds perpetrated electronically; and other undesired and troublesome email. But in some sense, the final and best solution will probably involve elements of -legal code- as well. However, a discussion of law is outside the scope of this article and of the developerWorks website. Considering matters technically--but also with common sense--what is generally called "SPAM" email is somewhat broader than the category -unsolicited commercial email-; basically it encompasses all the email that we do not want, and that is only very loosely directed at us. Such messages are not always commercial per se, and some stretch the outer limits of what it means to be solicited. For example, we do not want to get viruses (even from our unwary friends); nor do we generally want chain letters, even if they don't ask for money; nor proselytizing messages from strangers; nor outright attempts to defraud us. In any case, it is usually not very ambiguous whether a message is spam, and many, many people get the same such emails. The problem with spam is that it tends to swamp desirable email. In my own experience, a few years ago I occassionally received an inappropriate message or two during a day. Every day of this month, I received -many times- more spams than I did legitimate correspondences. On average, I probably get 10 spams for every appropriate email. In some ways I am unusual--as a public writer, I feel I need to maintain a widely published email address; moreover, I both welcome and receive numerous correspondences from strangers related to my published writing and to my software libraries. Unfortunately, a letter from a stranger--with who-knows-which email application, OS, native natural language, and so on, is not -immediately- obvious in its purpose; and spammers try to slip their messages underneath such ambiguities. My seconds are valuable to me, especially when they are claimed many times during every hour of a day. Hiding contact information ------------------------------------------------------------------------ For some email users, a reasonable, sufficient and very simple approach to avoiding spam is simply to remain reticent with email addresses. For these people, an email address is something to be guarded, and revealed only to selected, trusted parties. As extra precautions, email address can be chosen to avoid easily guessed names and dictionary words, and addresses can be disguised when posting to public fora. We have all seen email addresses cutely encoded in forms like "" or "echo zregm@tabfvf.pk | tr A-Za-z N-ZA-Mn-za-m". In addition to hiding addresses, a secretive emailer often uses one or more of the free email services for -throwaway- addresses. If you need to transact email with a semi-trusted party, a temporary address can be used for a few days, then abandoned along with any spam it might thereafter accumulate. The -real- "confidantes only" address is kept protected. I have found in my informal survey of discussions of spam on web-boards, mailing lists, the Usenet, and so on, that a category of email users gains sufficient protection from these basic precautions. However, for me--and for many other people--this approach is simply not possible. I have a publicly available email address, and have good reasons why it needs to remain so. I -do- utilize a variety of addresses within the domain I control to detect the source of spam leaks; but the unfortunate truth is that most spammers get my email address the same way my legitimate correspondents do: from the listing at the top of articles like this, and other public disclosures of my address. FILTERING SOFTWARE ------------------------------------------------------------------------ This article looks at filtering software from a particular perspective. I want to know how well different approaches work in correctly identifying spam as spam, and desirable messages as legitimate. For purposes of answering this question, I am not particularly interested in the details of configuring filter applications to work with various Mail Transfer Agents (MTAs). There is certainly a great deal of aracana surrounding the best configuration of MTAs like Sendmail, QMail, Procmail, Fetchmail, and others. As well, many email clients have their own filtering options and/or plug-in APIs. Fortunately, most of the filters I look at come with pretty good documentation covering configuring them with various MTAs. For purposes of my testing, I developed two corpora of messages: spam and legitimate. These corpora were both taken from mail I myself actually received. Most of the messages used were received in the last couple months, but I added a significant subset of messages up to several years old to broaden the test. I cannot know exactly what will be contained in next month's emails, so the past provides the best facsimile to the future's novelty. That sounds cryptic, but all I mean is that I do not want to limit the patterns to a few words, phrases, regular expressions, etc. that might characterize the very latest emails, but that fail to generalize to the two types. In addition to the corpora, I developed training message sets for those tools that -learn- about spam and non-spam messages. The training sets are both larger and partially disjoint from the testing corpora. The testing corpora consist of slightly fewer than 2000 spam messages, and about the same number of good messages. The training sets are about twice as large. A general comment on testing is worth emphasizing. False negatives in spam filters just mean that some unwanted messages make it to your inbox. Not a good thing, but not horrible in itself. False positives are cases where legitimate messages are misidentified as spam. These can potentially be -very- bad, some legitimate messages are important, even urgent, in nature; and even those messages that are merely conversational are ones we do not want to lose. Most filtering software allows you to save rejected messages in temporary folders pending review--but if you need to review a folder of spam, the usefulness of the software is thereby reduced. Basic structured text filters ------------------------------------------------------------------------ The email client I use has the capability to sort incoming email based on simple strings found in specific header fields, the header in general, and/or in the body. Its capability is very simple, and does not even include regular expression matching. Almost all email clients have this much filtering capability. Over the last few months, I have developed a fairly small number of text filters. These few simple filters correctly catch about 80% of the spam I receive. Unfortunately, they also have a relatively high false positive rate--enough that I need to manually examine -some- of the spam folders from time to time (I sort probable spam into several different folders; and save them all to develop message corpora). Although exact details will differ between users, a general pattern can be useful to most readers: SET 1: A few people or mailing lists that do funny things with their headers that get them flagged on other rules. I catch something in the header (usually the From:) and whitelist it (either to INBOX or somewhere else). SET 2: In no particular order, I run the following spam filters: - Identify a specific bad sender - Look for "<>" as the From: header - Look for "@<" in the header (lots of spam has this for some reason) - Look for "Content-Type: audio". Nothing I want has this, only virii (YMMV). - Look for "euc-kr" and "ks_c_5601-1987" in the headers. I can't read that language, but for some reason I get a *huge* volume of Korean spam (of course, for an actual readers of Korean, this is not a good rule). SET 3: Store messages to known legitimate addresses. I have several such rules, but they all just match a literal To: field. SET 4: Look for messages that have a legit address in the -header-, but that wasn't caught by the previous To: filters. I find that when I am only in the Bcc: field, it's almost always an unsolicited mailing to a list of alphabetically sequential addresses (mertz1@..., mertz37@..., etc). SET 5: Anything left at this point is probably spam (it probably has forged headers to avoid identification of the sender). Whitelist/verification filters ------------------------------------------------------------------------ An fairly aggressive technique for spam filtering is what I would call -whitelist plus automated verification- approach. There are several tools that implement a whitelist with verification: TDMA is a popular multi-platform open source tool; ChoiceMail is a commercial tool for Windows; most others seem more preliminary. A whitelist filter connects to an MTA, and only passes mail from explicitly approved recipients on to the inbox. Other messages generate a special challenge response to the sender. The whitelist filter's response contains some kind of unique code that identifies the original message, such as a hash or sequential ID. This challenge message contains instructions for the sender to reply in order to be added to the whitelist (the response message must contain the code generated by the whitelist filter). Almost all spam messages contain forged return address information, so the challenge usually does not even arrives anywhere; but even those spammers who provide usable return addresses are unlikely to respond to a challenge. When a legitimate sender answers a challenge, her/his address is added to the whitelist so that any future messages from the same address are passed through automatically. Although I have not used any of these tools more than experimentally myself, I would expect whitelist/verification filters to be very nearly 100% effective in blocking spam messages. It is conceivable that spammers will start adding challenge responses to their systems, but this could be countered by making challenges slightly more sophisticated (e.g. requiring small human modification to a code). Spammers who respond, moreover, make themselves more easily traceable for people seeking legal remedies against them. The problem with whitelist/verification filters is the extra burden they place on legitimate senders. Inasmuch as some correspondents may fail to respond to challenges--for any reason--this makes for a type of false positive. In the best case, a slight extra effort is required for legitimate senders. But senders who have unreliable ISPs, picky firewalls, multiple email addresses, non-native understanding of English (or whatever language the challenge is written in), or who simply overlook or cannot be bothered with challenges, may not have their legitimate messages delivered. Moreover, sometimes legitimate "correspondents" are not people at all, but automated response systems with no capability of challenge response. Whitelist/verification filters are likely to require extra efforts to deal with mailing-list signups, online purchases, website registrations, and other "robot correspondences." Distributed adaptive blacklists ------------------------------------------------------------------------ Spam is almost by definition delivered to a large number of recipients. And as a matter of practice, there is little if any customization of spam messages to individual recipients. Each recipient of a spam, however, absent prior filtering must press her own "delete" button to get rid of the message. Distributed blacklists are ways to let one user's delete button warn millions of other users as to the spamminess of the message. Tools like Razor and Pyzor operate around servers that store digests of known spams. When a message is received by a MTA, a distributed blacklist filter is called to determine whether the message is a known spam. These tools use clever statistical techniques for creating digests, so that spams with minor or automated mutations (or just different headers resulting from transport routes) do not prevent recognition of message identity. In addition, maintainers of distributed blacklist servers frequently create "honey-pot" addresses that are created specifically for the purpose of attracting spam (but never for any legitimate correspondences). In my testing, I found *zero* false positive spam categorizations by Pyzor. I would not expect any to occur using other similar tools, like Razor. There is some common sense to this. Even if someone were ill-intentioned in trying to taint legitimate messages, she would not have samples of -my- good messages to report to the servers--it is generally only the spam messages that are widely distributed. It is -conceivable- that a widely sent but legitimate message such as the developerWorks newsletter could be misreported, but the maintainers of distributed blacklist servers would almost certainly detect this and quickly correct such problems. As the below chart shows, however, false negatives are far more common using distributed blacklists than with any of the other techniques I tested. The authors of Pyzor recommend using the tool in -conjunction- with other techniques rather than as a lone filter. While this seems reasonable, it is not clear that such combined filtering will actually produce many more spam identifications than the other techniques by themselves. In addition, since distributed blacklists require talking to a server to perform verification, Pyzor performed far more slowly against my test corpora than did any other techniques. For testing a trickle of messages this is no big deal, for a high-volume ISP, it could be a problem. I also found that I consistently experienced a couple network timeouts for each thousand queries, so my results have a handful of "errors" in place of "spam" or "good" identifications. Rule based rankings ------------------------------------------------------------------------ The most popular tool for rule based spam filtering, by a good margin, is -SpamAssassin-. There are other tools, but they are not as widely used or actively maintained. SpamAssassin (and similar tools) evaluate a large number of patterns--mostly regular expressions--against a candidate message. Some matched patterns add to a message's score, while others subtract from it. If a message's score exceeds a certain threshold, it is filtered as spam; otherwise it is considered legitimate. Some ranking rules are fairly constant over time--forged headers and auto-executing JavaScript, for example, almost timelessly mark spam. Other rules need to be updated as the products and scams advanced by spammers evolves. Herbal Viagra and heirs of African dictators might be the rage today, but tomorrow they might be edged out by some brand new snake-oil drug or pornographic theme. As spam evolves, SpamAssassin must evolve to keep up with it. The README for SpamAssassin makes some very strong claims: In its most recent test, SpamAssassin differentiated between spam and non-spam mail correctly in 99.94% of cases. Since then, it's just been getting better and better! My testing showed nowhere near this level of success. Against my corpora, SpamAssassin had about 0.3% false positives and a whopping 19% false negatives. In fairness, this only evaluated the rule based filters, not the optional checks against distributed blacklists. Additionally, my spam corpus is not purely spam--it also includes a large collection of what are probably virus attachments (I do not open them to check for sure, but I know they are not messages I authorized). SpamAssassin's FAQ disclaims responsibility for finding viruses; on the other hand, the below techniques do much better in finding them, so the disclaimer is not all that compelling. SpamAssassin runs much quicker than distributed blacklists which need to query network servers. But it also runs much slower than even non-optimized versions of the below statistical models (written in interpreted Python using naive data structures). Baysian word distribution filters ------------------------------------------------------------------------ Paul Graham wrote a provocative essay this August 2002. In "A Plan for Spam," Graham suggested building Baysian probability models of spam and non-spam words. Graham's essay, or any general text on statistics and probability, can provide more mathematical background than I will here. The general idea here is that some words occur more frequently in known spam and other words occur more frequently in legitimate messages. Using well-known mathematics, it is possible to generate a "spam-indicative probability" for each word. Another simple mathematical formula can be used to determine the overall "spam probability" of a novel message, based on the collection of words it contains. Graham's idea has several noteworthy benefits: (1) It can generate a filter automatically from corpora of categorized messages rather than requiring human effort in rule development; (2) It can be customized to individual users' characteristic spam and legitimate messages; (3) It can be implemented in a very small number of lines of code; (4) It works surprisingly well. At first brush, it would be reasonable to suppose that a set of hand-tuned and laboriously developed rules like those in SpamAssassin would predict spam more accurately than a scattershod automated approach. It turns out that this supposition is dead wrong. A statistical model basically just works better than a rule based approach. As a side benefit, a Graham-style Bayesian filter is also simpler and faster than SpamAssassin. Within days--perhaps hours--of Graham's article being published, many people simulataneously started working on implementing the system. For purposes of my testing, I used a Python implementation created by a correspondent of mine named John Barham. I thank him for providing his implementation. However, the mathematics are simple enough that every other implementation is largely equivalent. There are some issues of data structures and storage techniques that will effect operating speed of different tools. But the actual predictive accuracy depends on very few factors--the main significant factor is probably the word-lexing technique used, and this matters mostly for eliminating spurious random strings. Barham's implementation simply looks for relatively short, disjoint sequences of characters from a small set (alphanumeric plus a few others). Baysian trigram filters ------------------------------------------------------------------------ Bayesian techniques built on a word model work rather well. One disadvantage the word model has is that the number of "words" one encounters in email is virtually unbounded. This fact may be counterintuitive--it seems reasonable to suppose that you would reach an asymptote once almost all the English words had been included. From my prior research into full text indexing I know that this is simply not true--the number of "word-like" character sequences one encounters is nearly unlimited, and new texts keep producing new sequences. This fact is particularly true of emails, which contain random strings in Message-ID's, content separators, UU and base64 encodings, and so on. There are various ways to throw out words from the model--the easiest is just to discard the sufficiently infrequent ones. I decided to look into how well a much more starkly limited model space would work for a Baysian spam filter. Specifically, I decided to use trigrams for my probability model rather than "words." This idea was not invented whole cloth, of course; there is a variety of research into language recognition/differentiation, cryptographic unicity distances of English, pattern frequencies, and related areas, that strongly suggest trigrams are a good unit. There were several decisions I made along the way. The biggest choice was deciding what a trigram is. While this is somewhat simpler than identifying a "word," the completely naive approach of looking at every (overlapping) sequence of three bytes is non-optimal. In particular, considering high-bit characters--although occurring relatively frequently in multi-byte character sets (i.e. CJK)--forces a much bigger trigram space on us than does looking only at the ASCII range. Limiting the trigram space even further than to low-bit characters produces a smaller space, but not better overall results. For my trigram analysis, I utilized only the most highly differentiating trigrams as message categorizers. But I arrived at the chosen numbers of "spam" and "good" trigrams only by trial and error. I also picked the cutoff probability for spam rather arbitrarily: I made an interesting discover that no message in the "good" corpus was assigned a spam probability above .0071, other than two false positives in the .99 range. Lowering my cutoff from an initial 0.9 to 0.1, however, allowed me to catch a few more message in the "spam" corpus. For purposes of speed, I select no more than 100 "interesting" trigrams from each candidate message--changing that 100 to something else can produce slight variations in the results (but not in an obvious direction). Summary and Chart ------------------------------------------------------------------------ Given the testing methodology described earlier, let us look at the concrete testing results. While I do not present any quantitative data on it, the chart is arranged in order of speed, from fastest to slowest. Trigrams are fast, Pyzor (network lookup) is slow. In evaluating techniques, as I stated, I consider false positives very bad, and false negatives only slightly bad. Good Corpus Spam Corpus "The Truth" 1851 / 0 0 / 1916 Trigram Model 1849 / 2 142 / 1774 Word Model 1847 / 4 97 / 1819 SpamAssassin 1846 / 5 358 / 1558 Pyzor 1847 / 0 (4 err) 971 / 943 (2 err) RESOURCES ------------------------------------------------------------------------ Lawrence Lessig has written a number of books and articles that particularly insightfully contrast what he metonymically calls "west-coast code" and "east-coast code"--i.e. the laws passed in Washington D.C. (and elsewhere) versus the software written in Silicon Valley (and elsewhere). I wrote a short review of Lessig's _Code and Other Laws of Cyberspace_ at: http://gnosis.cx/publish/reviews/lessig_review_rpr Lessig has his own website, of course. It provides a lot to think about: http://cyberlaw.stanford.edu/lessig/ Tagged Message Delivery Agent (TMDA): http://tmda.net/ DigitalPortal Software's ChoiceMail: http://www.digiportal.com/product4.html Pyzor is a Python-based distributed spam catalog/filter. Information is at: http://pyzor.sourceforge.net/ Vipul's Razor is a very popular distributed spam catalog/filter. Razor is optionally called by a number of other filter tools, such as SpamAssassin. Information on Razor is at: http://razor.sourceforge.net/ SpamAssassin is the most popular rule based spam filter. It lives at: http://spamassassin.taint.org/ Paul Graham's essay "A Plan for Spam" can be read at: http://www.paulgraham.com/spam.html Eric Raymond has created a fast implementation of Paul Graham's idea under the name "bogofilter." As well as using some efficient data representation and storage strategies, bogofiliter tries to be smart about identifying what makes a meaningful word: http://www.tuxedo.org/~esr/bogofilter/ My own trigram-based categorization tools are still at an early alpha or prototype level. However, you are welcome to use them as a basis for development. They are public domain, like all the tools I write for developerWorks articles: http://gnosis.cx/download/trigram-tools.zip ABOUT THE AUTHOR ------------------------------------------------------------------------ {Picture of Author: http://gnosis.cx/cgi-bin/img_dqm.cgi} David Mertz dislikes spam. He wishes to thank Andrew Blais for assistance in this article's testing, as well as for listening to David's peculiar fascination with trigrams and their distributions. David may be reached at mertz@gnosis.cx; his life pored over at http://gnosis.cx/publish/. Suggestions and recommendations on this, past, or future, columns are welcomed.