Spam Links

March 02, 2005

Real Bayesian filtering?

Take a look at the spam filtering research page on Spam Links. The impression we're getting is that you have two sets of people doing the research - one set are the sysadmins, the people at the front line; the other is the researchers who are trying out the latest information theoretic concepts to push that 99.999% rate with few false positives.

Sysadmins tend to work with tried-and-tested rules that discard mail if they trigger: heuristics. Take a look at Spam Filtering for Mail Exchangers, for example, which is an excellent summary of ways to detect and terminate spam sessions coming in to a mail server. These are practical and effective, and based on the mechanics of receiving email. Rule-based scoring, while it can be very effective, can be vulnerable to spammers adjusting their mail using the known defaults, and can fall behind new spammer tactics.

The researchers are much more focussed on email as content. Text Classification Spam Filtering is mostly about classifying the text of the email as spam or ham - and it can do very well.

Why not do true Bayesian filtering instead? Choose a set of the well-selected heuristics that the sysadmins rely on (in SPEWS, sent at 4am, not SPF authorised, DCC seen it) and apply Bayesian statistics to those features. You get to use all of the prior knowledge that is available, taking advantage of the hard work of groups like SPEWS and Spamhaus, but you get to temper that with other features, so making false positives much less likely.

If this is being done somewhere we'd love to have it pointed out, since it seems so obvious. If it isn't: why not?

[Well, it is. SpamAssassin works in just this way, using a back propagation neural network to adjust the scores of the heuristics they use. Quite why this isn't spoken about more frequently at spam conferences is anyone's guess.]

Posted by spamlinks at March 2, 2005 12:00 PM
Comments
Post a comment









Remember personal info?






everything you didn't want to have to know about spam
Spam Links Home Creative Commons License
This work is licensed under a Creative Commons License.
Hosted by spam.abuse.net. Domain registration by Gregg DesElms.
Thanks to these sites, for having provided mirrors in the past:
Spamfo, OpenRBL, DNSLife, CerealKiller, MysticNights, Ih8spammers, Sysadmin.info, Westdam

SPAM is a trademark of Hormel Foods.

Page last updated: 15-Nov-2004