[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [nmh-workers] INCing of email archives
From: |
Bakul Shah |
Subject: |
Re: [nmh-workers] INCing of email archives |
Date: |
Fri, 26 Jul 2019 07:42:02 -0700 |
On Jul 25, 2019, at 4:25 PM, Ken Hornstein <address@hidden> wrote:
>
>> Once in a while I download email archives of some mailing list
>> and unpack them using "inc -file <archive-file>". But more
>> than once I have seen that inc gets confused and doesn't
>> unpack the whole thing. The cause seems to be a line starting
>> with From in some message body. Ideally inc should look that
>> a "From ..." line is immediately followed by header lines.
>> And if this is not the case, assume it is in the message body.
>
> Ralph answered this, but let me expand a bit.
>
> The job of inc(1) is to incorporate messages from a 'mail drop' into your
> MH mailbox. Traditionally it handles mbox-style files and POP (it also
> does MMDF, but let us not speak of that).
>
> As you can see from the Wikipedia entry Ralph linked to, all of the
> various mbox formats use the same scheme: a line beginning with "From
> " is the mailbox delimiter (mboxcl and mboxcl2 uses a Content-Length
> header; I believe they are officially dead at this point). The big
> differences are in quoting rules. Unfortunately since we're kind of
> locked in to the mbox format in inc(1) at least, changing that would
> have some nasty consequences (Ralph gave you an example of a message
> that it would break on but I am sure there are others). I think your
> best bet is to preprocess these mailing list archives so they are valid
> mbox files.
Thanks, Ralph & Ken. The site from where I downloaded the latest
email archive uses mailman so I was a bit surprised. The method
I suggested would make inc able to handle a larger set of inputs.
While there can still be false positives, the number of messages
matching
From ... [0-9]$
<mail header>:
is likely to be much much smaller than a random line starting with
"From " and ending in a digit. Still, I can understand the reluctance
to add this logic to inc.