
Take a look at the spam filtering research page on Spam Links. The impression we're getting is that you have two sets of people doing the research - one set are the sysadmins, the people at the front line; the other is the researchers who are trying out the latest information theoretic concepts to push that 99.999% rate with few false positives.
Sysadmins tend to work with tried-and-tested rules that discard mail if they trigger: heuristics. Take a look at Spam Filtering for Mail Exchangers, for example, which is an excellent summary of ways to detect and terminate spam sessions coming in to a mail server. These are practical and effective, and based on the mechanics of receiving email. Rule-based scoring, while it can be very effective, can be vulnerable to spammers adjusting their mail using the known defaults, and can fall behind new spammer tactics.
The researchers are much more focussed on email as content. Text Classification Spam Filtering is mostly about classifying the text of the email as spam or ham - and it can do very well.
Why not do true Bayesian filtering instead? Choose a set of the well-selected heuristics that the sysadmins rely on (in SPEWS, sent at 4am, not SPF authorised, DCC seen it) and apply Bayesian statistics to those features. You get to use all of the prior knowledge that is available, taking advantage of the hard work of groups like SPEWS and Spamhaus, but you get to temper that with other features, so making false positives much less likely.
If this is being done somewhere we'd love to have it pointed out, since it seems so obvious. If it isn't: why not?
[Well, it is. SpamAssassin works in just this way, using a back propagation neural network to adjust the scores of the heuristics they use. Quite why this isn't spoken about more frequently at spam conferences is anyone's guess.]
Posted by spamlinks at March 2, 2005 12:00 PM