[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Ifile-discuss] Improving classification of spams
From: |
Booker Bense |
Subject: |
Re: [Ifile-discuss] Improving classification of spams |
Date: |
Fri, 10 Jan 2003 17:13:26 -0800 (PST) |
On Fri, 10 Jan 2003, Jack Bertram wrote:
> Hi all
>
> I use ifile to filter into about 30 different folders and it does a very
> good job on nearly all mail. However, it does a much less good job at
> correctly recognising spam email as spam. Now, I'm much happier with
> false negatives than false positives, so this isn't too much of a
> problem, but it does lead me to wonder why spam email in particular is a
> problem.
>
> My hypothesis is simple: my other folders are fairly homogenous, since
> they correspond to particular mailing lists, mail from particular people
> tending to talk about similar things, etc. But spam email falls into a
> number of different categories: Nigerian spam, porn, etc, yet I put it
> in one folder. Since ifile essentially computes an "average" for each
> folder, and compares an incoming email to that average, non-homogenous
> folders are harder to match correctly than homogenous ones.
>
> So, I'm asking two questions:
>
> 1. Is this hypothesis any good - does anyone else have the same
> experience as me, with non-spam categorised correctly but spam not
> recognised so well?
- So far, I've had very good luck with the spam/non-spam issue,
however I have hybrid system where I index everything, but only
use ifile as a last resort. (i.e. I have a bunch of prefilters
based on the sender/headers if none of those match see what ifile
suggests. ) Every message is indexed by ifile after it gets
filtered.
>
> 2. How many different sorts of spam do I have to distinguish in order to
> make spam matching work better? Will a porn/non-porn distinction work
> well, or do I need to use more spam categories in order to get good
> matching. What do other people on this list do?
>
- What I've done is distinguish between "ispam" and "spam". ispam
is basically anything that isn't plain ascii and spam is for
ascii readable spam.
- Also, I throw away the .idata file and reindex things every
couple of weeks. Not sure if this has any effect or not.
- Booker C. Bense
P.S. I have a Ruby module for using ifile w/the rmail package
if anybody is interested.