[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Debbugs-submit post from address@hidden requires approval
From: |
Bob Proulx |
Subject: |
Re: Debbugs-submit post from address@hidden requires approval |
Date: |
Tue, 18 Sep 2012 12:40:59 -0600 |
User-agent: |
Mutt/1.5.21 (2010-09-15) |
Glenn Morris wrote:
> Bob Proulx wrote:
> > The listhelper spamassassin robot didn't discard it. Listhelper's
> > Spamassassin classified it as non-spam. I found the message in the
> > non-spam folder.
> >
> > However one of the humans, either myself or Karl, did mark it as spam
> > in the human review process! I don't know which of us. Sorry about
> > that. A slip of the fingers I am sure. I found it in the marked as
> > spam for Bayes training folder. So as Hal in the movie 2001 said, "It
> > can only be attributable to human error."
>
> No problem. I thought it was a mistake from the automatic system; I
> wouldn't have bothered to report a human error. Now I know you feed
> the results back for training I'll report it if I see it happen again
> (this was the first such mistake I ever saw).
It happens. Humans have a non-zero error rate. :-) But so far we
haven't found a way to replace us in the system yet.
We review messages in a mailbox and do nothing to messages that appear
to be correctly categorized. But for messages marked and discarded as
spam then we vivify them by re-sending the message. This is the false
positive case and we have tuned things to avoid false positives as
much as possible. But for messages that appear to be spam we mark as
such and save those messages in a marked-as-spam mailbox for Bayes
training. That would be the false negative case.
We have tuned things to be much more likely to be a false negative
than a false positive. Meaning that we only very rarely need to
vivify a discarded message. (That is the more painful case too.) But
we usually must mark many messages as spam every day. It isn't
unusual to see a run of a new type of spam for 10-50 messages in a row
in the mailbox. I am using mutt to view the mailbox and I have 'S'
bound to the action to save it to the spam folder for Bayes training.
I hold down the 'S' key and zip down through a long list of spam.
This description may make it sound more difficult than it is in
reality. It only takes a moment. It is email and we are using a mail
user agent (mutt) to view the messages. One of the biggest features
and improvements for us is that we get to view the entire stream of
all of the mailing lists as one mailbox. Not the way you would want
to read a mailing list but it is the way you want to deal with spam.
This makes identifying spam quite easy. Which is primarily the task.
To select out the spam messages and discard them from the combined
mail stream of all of the mailing lists.
We periodically run through the mailbox, handle all of the queued
messages, and then return to doing other things. The
training-on-error keeps the Bayes engine accurate and is a critical
part of the health of the classification system. The spam character
is always changing, always in motion. Being able to make training
corrections early will deflect a lot of spam. But if we are busy and
don't get to it until later that is okay too. It just means more 'S'
key action in the review later. Also it means that other humans
looking at the mailman web page interface to any particular mailing
list might see more spam there than they would if we were quicker on
the global side.
I have the review mailbox sorted by SpamAssassin score. This groups
non-spam at the top and spam at the bottom with the grey ones in the
middle. This makes sorting the messages relatively easy. But every
so often a mistake is made and a message gets filed into the spam
folder for training and is discarded by mistake.
After saving the message a different process runs periodically to read
the messages saved to the human marked as spam folder, pipes them
through SpamAssassin's Bayes engine for training as spam, and
generates the discard message. When a discard message comes from
listhelper it isn't possible at that point to tell if it came from the
initial SpamAssassin run or if it came from the second pass where it
was moved by a human for training. Both generate the same message.
Now that I know that they can be logged differently I might cause it
to use a different from address. I don't set it now. It is the
underlying user and hostname. At other times it has been on different
machines and would have had different addresses.
> > Is there a log that would tell you if it were discarded by web page or
> > by email md5sum control message? That would be cool to know about.
> > This would have had an md5sum email generated, although if a human
> > gets to the web page interface first then both might happen.
>
> I just look at /var/log/mailman/vette, which says in this case:
>
> Sep 11 10:10:43 2012 (706) debbugs-submit: Discarded posting:
> From: address@hidden
> Subject: Issue of the cp on Ubuntu 10.10
> Sep 11 10:10:43 2012 (706) -request/hold autoresponse discarded for:
> address@hidden
>
> (which I thought was a bot.) For things discarded from the web
> interface it says something slightly different, eg:
>
> Sep 17 11:58:57 2012 (2487) debbugs-submit: Discarded posting:
> From: address@hidden
> Subject: Lower price sale | Dubai | JLT | Fortune Executive | Office
> Reason: Your message was deemed inappropriate by the
> moderator.
Cool! Thanks for educating me about this. I am sure it will be
useful for understanding more cases like this. I never knew about
this logging previously.
> The main way I tell the difference is that I patched the Mailman web
> interface to always "preserve for administrator", and this was not
> preserved. (I wish I knew how to do that for email discard messages
> too.)
I always thought that preserve for administratory meant keep it in the
hold queue. What does it really do when a message is preserved for
the administrator?
Bob