[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Debbugs-submit post from address@hidden requires approval
From: |
Bob Proulx |
Subject: |
Re: Debbugs-submit post from address@hidden requires approval |
Date: |
Thu, 7 Feb 2013 16:12:55 -0700 |
User-agent: |
Mutt/1.5.21 (2010-09-15) |
Glenn Morris wrote:
> BTW, this report was also discarded:
> http://debbugs.gnu.org/cgi/bugreport.cgi?bug=13633
Thank you for watching over things.
> I tried to forward that one to listhelper-discuss as well, but either
> I messed up or my forward was also discarded...
I haven't had a chance to look at that particular message yet. But I
thought I would say a few words about the general problem. But if the
original message was flagged automatically then if the message is
re-sent very soon then the re-send will probably also be flagged the
same way.
Perhaps we should set up a target address for sending false positive
hits such that it would be easier to train-on-error the Bayes engine.
The listhelpers do this routinely but with direct access to the
system. I acknowledge that for others there isn't currently a way to
do this without contacting us.
Here is the root problem of the current false postives. Just recently
for some reason the SpamAssassin Bayes engine has started classifying
more messages with BAYES_95 and BAYES_99 which has caused an increase
in false positives like these. Karl and I have seen it on a number of
messages just recently. Although normally we go for a long time
without ever seeing even one.
At one level the problem is that there is so much spam that to a first
level of approximation all email is spam. Therefore automated tools
can often get into the mode of being right often enough to avoid
correction by saying that all email is spam.
At another is the problem of runaway positive feedback. We feed most
of the mailing lists through 'sa-learn --ham' so that they learn what
is normally going to the mailing lists. And we review the 'negative'
caughtham queue and feed any false negatives to 'sa-learn --spam'.
But SpamAssassin also learns automatically from messages if it has a
low enough or high enough spam score.
There are two important scores for SpamAssassin Bayes spam training.
One of them is the BAYES_95/BAYES_99 scores. The other is the overall
"required_hits" score. When things are working well cranking up
BAYES_99 is a good way to reduce the human workload. When it is
classifying things correctly then that score matches reality very
closely. But when things start to get into a runaway feedback mode
these work against you. It has the potential and sometimes the
reality to automatically learn all messages as spam.
In order to combat the current Bayes classification training errors I
have reduced the values of BAYES_95 and BAYES_99 so that they will
have less impact on the overall score. I have raised the
required_hits a little bit. This means we will need to manually
process more false negatives (spam) but it will hopefully reduce the
pain from false negatives like these.
Hopefully this will correct the SpamAssassin Bayes engine learned
token set within a few days and then we can tune them back to normal
levels again.
Bob