Steph of LandLordMax fame recently posted about how to deal with email spam.  This just happens to be the focus of my day job (we’re busy exploiting the synergies of a new paradigm of… actually, no, I don’t work in Dilbert’s office).  I use spam filtering religiously on all my email accounts (4 at last count) and it saves me roughly 2,000 spam a day (my business email address is forwarded to by my college email address, which was widely circulated back when I was young and stupid).

Here is everything you need to know about spam prevention in two sentences sentence: If you are comfortable administering your own email server (and already do), use SpamAssassin.  If you are not, use Popfile.  There, you’re done and didn’t even have to pay me my company’s consulting rate.

SpamAssassin takes a multi-tiered approach to dealing with spam: it has rules written by the administrator (or, if you’re a good outsourcer like any uISV should be, you just download them from the Rules du Jour website) which are handwritten regular expressions which look for spam signs, it has RBL checking, and it has a Bayes statistical engine.  Popfile just does Bayes.  You can find descriptions of exactly how Bayes works elsewhere on the Internet (Paul Graham was one of the early pioneers of the idea, incidentally: see his seminal article A Plan for Spam, with the caveat that the field has improved significantly since then and that you should consider this more of a historical piece than a tech blueprint).

Both SpamAssassin and Popfile, when trained properly, will produce well over 99% accuracy (both in not improperly labeling good mail as spam, and in not missing spam).  Exactly how you do the training depends on your installation.  If you are a mail administrator, start reading about SpamAssassin and understand you will be spending much of your free time feeding the monster.  If you aren’t, can I say for the second time, Popfile will save your life.

Popfile works as an email proxy, which means instead of having your email client (Outlook, Thunderbird, whatever) dial up your server when it wants mail, you have it dial up a port on your local computer instead.  There POPfile says “Hold on chap, just one second” and runs to fetch your mail from the actual server.  Popfile then scans the mail, uses statistical witchcraft to determine whether it is spammy or not, and marks it appropriately (via one or both of adding another message header to it or changing the subject line to read “[spam] Do 1+ 4 yur l0v3d 0n3″).  You then either just scan those subjects yourself or, if you’re smart, configure your mail client to automatically move everything tagged as spam into a special folder, which you periodically run through (look for missclassified emails and then purge).

Classifying emails is a bit of a nuisance: you have to visit a special web page (on your local machine), which you can do with a click of an icon in your system tray.  Then you have to locate the proper email (easy with a search-by-subject or sender feature) and click the appropriate classification (Popfile calls it “bucket”) and click reclassify. 

POPfile isn’t just for spam versus ham — you can have as many buckets as you want.  For example, I use Thunderbird at home to download from both my ISP email account and my bingocardcreator.com account(s).  I have POPfile configured to automatically tag any emails which are autogenerated (Amazon sales receipts, bounce mails, subscription confirmations, and the like), from family, regarding Bingo Card Creator, spam, from 2 particular mailing lists, and the like.  After you classify a shockingly small number of emails (generally less than a day’s worth in my experience, your mileage WILL vary), POPfile gets the hang of things very quickly: it “knows” that a genuine message from Paypal is invariably about Bingo Card Creator (despite it not saying “bingo” anywhere) and it “knows” that a fake message from Paypal (which it is very good at sniffing out) is spam.  My accuracy hovers in the 99.5% range.

The main problem with ANY anti-spam system is dropping legitimate emails.  I lost 3 in the last week, although two were identical (my friend hit the send button twice, apparently).  On a scale of criticality from 1-10 the email that was duplicated was like a 9.2, which was unfortunate (luckily I noticed it in a periodic scan of my spam folder).  The other one was a subscription confirmation (no big deal, since I was expecting it a search found it instantly).  For folks you absolutely want to get mails from, you can set up “magnets”, which are basically hand-written rules which override the statistics.  For example, I’ve got a magnet for my mother — anything she sends goes into family.  I’ve had another magnet for any email which quotes one of my own receipts, because that is what most of my customer support emails look like.  I stopped using that magnet because after the first 3 times that sort of mail happened POPfile was on it like white on rice.

OK, enough plugging POPFile (same story as always: opinions are my true feelings and totally my own, and nobody is sliding me money under the table from them, although you can reasonably expect that since my day job is directly related to this I am not an unbiased observer).

 Steph mentioned another method in his post: challenge/response.  Basically, if someone who you don’t know sends you an email, you send them an automated message saying “Hey, prove you’re a human by responding to this CAPCHA and I will deliver your message”.  DO NOT USE THESE.  The potential problems are numerous and they get even more numerous the more people who use them — and, since if you’re reading this you are a business, you should know that end-users HATE THEM with a burning passion.