The Spam Secretary

The Spam Secretary is an anti-spam mail filter based on the article at http://www.paulgraham.com/spam.html. It is a single file written in python and is easy to install. It is mailclient independent, and the hope is that it will be MDA independent as well. As long as you can specify a program to run when you deliver mail, have python installed, and your mail is stored in mbox format (maildir format forthcoming), you're all set.

Please look here for files and documentation.

In breif, the filter works by keeping track of how often a word appears in smap mail vs. non-spam mail. If a word often appears in spam, but only seldom appears in non-spam, the odds of that message being spam are greater. In this way, the 15 "most interesting" (most and least spam-probable) words are found in an incoming message and the message is sorted accordingly. This sounds so obvious and easy that it's hard to believe it really works, but it does.

How well does it work? Once you have built up a reasonable database of words, I've seen it work at least 95% effective. What's more, I've NEVER had it assign a "false positive" - guessing something is spam that is not.

The program is really rather flexible, and can be used in several different ways. The only real requirements are that you need to have python (I use 2.2) and you have to be able to invoke a python script to deliver mail. That way the script can put the incoming message into one of two boxes: either "regular incoming", or "spam incoming". After that, the options vary a lot depending on how YOU read and handle your mail. What I do is save all messages that are verified spam (I've checked to make sure the messages in the spam box are really spam, or I move spam that was not correctly identified) to a "VerifiedSpam" mailbox. Regular email that I'm done with, I just delete, which moves it to the "Deleted" mailbox. Then, every time I get a new message, the contents of "VerifiedSpam" and "Deleted" are consumed (parsed and deleted) by TeSS and added to the appropriate word database. All of this is handled in the single command in the .forward file.

The fine details... Assuming my configuration as described above: Incoming mail is [mime decoded and then] broken into tokens - each token is (lowercased) a-z, 0-9, "-", and "'". Each token must have at least one letter in it. Each token may be at most 20 characters. These tokens are then looked up as described in Paul Graham's paper. If it is spam (> 80% probable), it is put in the spam box. If it is not, it is put in the Inbox. All messages in the Deleted box are parsed the same way, and each token is added to the "good words" file (some db file according to your platform - dmb, gdbm, whatever python decides). Then that mail file is truncated to 0 length. The same thing happens to the messages in the VerifiedSpam mailbox. That's all there is to it... I check my spam box every once in a while to make sure no good messages got dumped, and then I move them all to the VerifiedSpam box to be consumed. Regular email I delete as usual (which my mail program moves to a Deleted box). Eventually I may decide to dump Spam directly to VerifiedSpam without looking at it.

Why did I write this where there are so many other anti-spam programs out ther? There were 2 compelling reasons: I use multiple clients, so I needed a server-based solution; I don't like or want to mess with qmail or procmail, which virtually all other server solutions seem to want to use.

 

TESS is Python Powered Hosted by SourceForge.net Logo
Written by
Kurt Werle