ecchi: I suggest you reread his methods specifically regarding chained tokens and statistical analysis. His chi square method is quite accurate.
Briefly, every piece of data in the header and body of the message that your friend sends is grouped into tokens. Each of those tokens is assigned a statistical value for each message received. When you receive email from your friend, it might have 90 tokens. Each spam will consist of a number of tokens as well. The statistical guess at the end decides whether the email has more spam tokens than antispam tokens + a deviation of the weight/confidence that a given token is a spam indicator. You can also adjust the Noise Reduction to bias that determination.
So, if you get a lot of spam routed through tivoli.co.uk's open relay, a valid message from a friend working for IBM/Tivoli will have a higher spam probability, but, because of the spam/non-spam score, those tokens won't be weighted as highly. That means, that other portions of the content would be dealt with.
Because of the way emails are written, no matter how many 419 emails I receive, its the pattern of the 419 email that gives it away. Doesn't matter whether it mentions Nigeria, Sumatra, or any other international city, statistically, there are things in the email that give it away. Since training dspam, I have only had 2 419 emails that it has missed in the last month, all of the rest were properly tagged.
As for false positives, I had one that it tagged as spam that wasn't. Short of that, my emails go into the right folders, every time, I don't have to deal with setting up rules or anything. When I was using spamassassin, my Citibank statement, bellsouth statement, etc were all considered spam. I was amazed at how much mail I was losing with spamassassin.
The main advantage to statistical versus pattern matching is the diversity of your email versus mine. Emails that you receive might be considered spam if I received them, and vice versa. With Spamassassin, you are using a shared corpus among people that have culled this from their inbox to feed it and generate the weights.
There are other more advanced analysis methods, covered in detail in their whitepapers.
__________________
SnapReplay.com a different way to share photos - iPhone & Android
|