[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Pan-users] GNU javamail and article number
From: |
Duncan |
Subject: |
Re: [Pan-users] GNU javamail and article number |
Date: |
Tue, 19 Feb 2013 04:24:37 +0000 (UTC) |
User-agent: |
Pan/0.140 (Chocolate Salty Balls; GIT 038526b /usr/src/portage/src/egit-src/pan2) |
Thufir Hawat posted on Mon, 18 Feb 2013 01:38:03 +0000 as excerpted:
> [Developer question: NNTP app in Java]
FWIW, your message is a bit like some of mine, more a stream of conscious
than well edited and organized. It does make a response a bit harder,
without quoting the whole long post, anyway, as finding the actual bit to
reply to and getting the proper context is difficult. I guess I'm seeing
a bit how hard it must be to reply to some of my posts.
Anyway, I've slightly edited and rearranged order, etc...
> I'm looking at the source for:
> http://developer.classpath.org/inet/doc/gnu/inet/nntp/GroupResponse-
source.html
> 62: /*
> 63: * The last article number in the group.
> 64: */
> 65: public int last;
>
> which looks like last should be the number for the last article for a
> group. Now, when checking for new articles, what is that number
> compared to?
>
> http://cvs.savannah.gnu.org/viewvc/*checkout*/mail/source/gnu/mail/
providers/nntp/NNTPFolder.java?root=classpathx&content-type=text%2Fplain
> GroupResponse response = ns.connection.group(name);
> if (response.last > last)
> {
> hasNew = true;
> }
>
> I'm just try to figure out how, when connecting to a new server, do you
> know what was the article number for the latest article? Is that kept,
> generally, in the .newsrc perhaps?
FWIW, the newsrc tracks read messages, not already seen (but possibly
unread) messages. The (multi-app-standard) newsrc file assumes only a
single server, so multiple newsrc files must be used when there's more
than a single server. It's the newsgroups.xov file that tracks already
seen messages -- the server highwater marks. AFAIK, unlike the newsrc
file format, newsgroups.xov isn't common to other news clients, and it
contains entries for multiple servers. (A comment in the file lists the
specific format.)
> My concern is that the "id" isn't reliable:
>
> http://docs.oracle.com/javaee/6/api/javax/mail/
Message.html#getMessageNumber%28%29
> "Note that the message number for a particular Message can change during
> a session if other messages in the Folder are deleted and expunged."
>
> Because, when javax mail (which is utilized in this context) loads a
> folder, it simply *counts* the number of messages in a given folder.
>
> How does pan handle this? For simplicity, let's assume just one server
> is being accessed. Pan keeps the latest article number in a .newsrc
> file and then iterates up?
No... more later.
> What I'm after is not just a method, as above, to check for new articles
> but to return a range of articles which are new -- something along those
> lines.
>
> Or, maybe, another approach is to just keep the latest article number
> increment it, and request the article until errors are caught. However,
> that assumes there are no gaps in the article numbers. And, still, the
> article number *must* be stored somewhere.
>
> It all starts with *getting* the article number. Apparently
> NNTPFolder.java is using GroupResponse to handle the article number, so
> I should be also using GroupResponse and see what article numbers it
> gets?
I don't claim to be a coder, tho I can sort of follow along on many
coding discussions and do occasional limited patching, etc. And java...
wouldn't exactly be my choice were I to try to become a coder. After I
spent a couple hours last nite trying to make sense of the docs at the
various links you provided, I /think/ I have some sense of it, but I'm
more sure than ever that Java isn't my choice of coder's beverage, by a
LONG shot!
Anyway, it seems there's three sets of... article IDs... we're looking
at, two from the RFCs, and a third from the Java classes you're working
with. The classes ID is very similar in idea and function to one of the
two RFCs IDs, and *MAY* be identical to it in the FolderNNTP subclassing
of the general Folder class, but I'm not sure -- I couldn't find anything
that actually /said/ that, one way or the other.
But the similarity without knowing if they're identical makes things
extremely confusing, because reading the docs I had to keep reminding
myself that the article numbering the were talking about was the local-
client classes numbering, not the one from the server (RFC standard)...
unless they're identical in the case of FolderNNTP, which I never did
figure out.
> However,
> the GNU javamail NNTP API seems to have no provision for directly seeing
> those article numbers:
>
> http://www.gnu.org/software/classpathx/javamail/javadoc/gnu/mail/
> providers/nntp/NNTPFolder.html
>
> There' just no method listed for dealing with article numbers, they're
> encapsulated, which I guess is good. But they're encapsulated so well I
> don't see how to get *new* articles without re-fetching everything.
(In the below I think I use NNTPFolder and FolderNNTP interchangeably,
forgetting which one NNTPFolder, was actually used. So if you see a
reference to FolderNNTP that I missed changing, read it as NNTPFolder.)
There's references to article numbering in some of the methods. But as I
said, it's ANYTHING but clear whether the article numbering they refer to
is identical to the RFCs one the server's using, or if it's a local-only
classes version that works similarly, but is independent from the RFCs
article numbers the server is passing.
Were I working on a project using those classes, I'd now be hacking up
some experimental code to actually SEE the article numbers the classes
are using, and compare them to what I was seeing actually being passed
from the server, using ngrep or similar connection sniffing. That'd
answer once and for all whether they were identical, or not.
> As suggested here:
>
> I've been reading RFC's, but that doesn't help with determining what GNU
> javamail is actually doing, versus what it's supposed to do. (I really
> don't like the Apache API at all -- but if there's a Java API someone
> knows works for this, that would be interesting. The GNU API is very
> clean, just maybe **too** clean.)
I can't help but think about the various perl and python nntp handling
modules I've read about... I've never actually worked with them, but
I'd /hope/ they're easier to work with for people familiar with the RFCs.
And given that I believe there's actually several different nntp modules
to choose from, I expect I'd be more comfortable with at least ONE of
them, than these Java classes...
But be that as it may, you're working with what you're working with, so
let's try to deal with it...
FWIW, the RFC in question would appear to be rfc3977. The GROUP command
you referred to is covered in section 6.1.1, but it's worth reading about
the related LISTGROUP (6.1.2), LAST (6.1.3) and NEXT (6.1.4) commands in
section 6.1, Group and Article Selection, as well. That can be found
here:
http://tools.ietf.org/html/rfc3977#section-6.1
The GROUP command and its response codes (response codes are covered in
section 3.2) formats look like this (section 6.1.1.1):
Syntax
GROUP group
Responses
211 number low high group Group successfully selected
411 No such newsgroup
Parameters
group Name of newsgroup
number Estimated number of articles in the group
low Reported low water mark
high Reported high water mark
As background, it's worth explicitly noting here the three article ID
forms I mentioned earlier.
First, there's message-id, found as a header in the article, that's
designed to be a GUID, globally unique ID. Message-ID is covered in the
generic Internet Message RFCs covering both mail and news. Pan, BTW,
uses the fact that message-ids are GUIDs in its message caching -- pan's
message cache filenames are message-ids, with a bit of character
substitution where necessary in ordered to sanely manage filesystem
filename compatibility. That works out pretty well with multi-server as
well, since message-ids are supposed to be GUIDs and the same message
will have the same message-id regardless of which server you fetch it
from, so once the file is cached from one server, it's seen as already
there and the other server fetch threads simply skip on to the next
message.
The jave class methods do appear to accept message-id as a parameter in a
number of cases, as do various RFC/NNTP commands.
Second, there's the RFC message numbers, per-server per-group sequential
message numbering. It is these numbers that the GROUP command reports
for the low and high watermarks as listed above -- that's the first and
last messages potentially available on the server at the time the
response was issued.
These RFC-standard article numbers are what pan tracks in its newsrcs and
newsgroups.xov, and are extremely commonly used in all sorts of news
clients, because they're (nominally, see below) sequential and rather
less free-form than message-ids tend to be, and thus comparatively easy
to track and to work with. The down side is that they're per-server,
thus the need to reset them if a user changes news server, or if the news
server itself gets rebuilt and didn't have backups allowing it to restart
the numbering sequences where it left off.
Additionally, article numbers are /nominally/ sequential, but as rfc3977
explicitly points out in a number of places, that does NOT mean that
there's no gaps, or that there's a consistent persistence of articles by
number during a particular nntp session. In particular, common server
implementations assign article numbers on an incoming message server
before they've been locally processed, despammed, forwarded to the front-
ends the users (or rather their news clients) actually contact, etc.
Despamming and the like thus results in article sequence number gaps, and
additionally, there's no guarantee in terms of local server processing
order, so it's very common to see say 255346 come in and boost the
highwater mark from 255205, before numbers 255206 thru 255345 appear.
These late to transfer articles then "backfill" the sequence, and any
client which updated after the high number boost but before the backfill
that is NOT prepared for backfills, will simply miss those posts entirely!
(FWIW, from what I've seen pan does middling well with this. It either
catches most of the backfills or the backfill case isn't as common as
I've been lead to believe, but it can still be useful to manually "fetch
all headers", as opposed to just new headers, occasionally, as doing so
does seem to catch the occasional missed post. They weren't late to
server sequence numbering or they'd show up with a new headers fetch;
they were article sequence number backfills that pan didn't catch on its
own, that only show up with "fetch all headers". But pan does WAY better
at that than some other clients I've used, which would sometimes backfill
more messages than had been fetched the first time!)
The NNTP LISTGROUP command is similar to GROUP, returning the same 211
information, but in addition, it enumerates the articles actually
available within the range, listing them one per line in an extended
response after the initial 211 reply line.
The NNTP NEXT command can be used to iterate thru actually available
posts, letting the server decide what the next one it has is, instead of
the client having to guess. Do however note the above caveat, that
individual article numbers can appear and disappear over time within a
session, so the NEXT ordering within a particular range as seen by two
different clients that time their NEXT requests differently, isn't
necessarily going to be consistent. At minimum, I'd suggest an
implementation using NEXT iterate repeatedly over a range, until no
further articles appear. The alternative of course, if the LISTGROUP
command is available on a particular server, would be use that to check
the range again after the first run thru, to see if any further articles
have appeared.
The NNTP LAST command, counterintuitively, fetches the PREVIOUS (not the
last) article in the newsgroup, quoting rfc3977: "that is, the highest
existing article number less than the current article number". Again,
the dynamic actually available article status caveat applies.
That covers the two RFC article id types, "article number" and "message-
id".
Now we get to the NNTPFolder class article numbers. As mentioned above,
these appear to be very similar in idea and function to the rfc "article
numbers", but it's not AT ALL clear to me whether they're actually
identical, or whether the java classes do their own independent
numbering, acting as if they're a server of their own, with their own
numbers, instead of using the server numbers used in the rfc NNTP
protocol.
There are some cross-session "stateless" nntp client implementations.
One example is lynx, the text-based browser, which DOES do nntp, but
apparently without any way to save state between sessions, so what's
actually available on the server when you connect is what you get. The
whole idea of cross-session stateless/cacheless nntp seems rather strange
to me, but when you think about it, it's the way people /normally/ use
the web, so it sort of makes sense for a browser net-news
implementation. (I had no idea lynx did news at all until I read about
it somewhere and had to give it a go. Sure enough! Could come in handy
some day when X isn't working so I can't use pan, but I remember seeing
the problem discussed in a recent newsgroup post, I just have to get to
it, in ordered to see the steps I need to do to get back into X!
Actually, since I have my pan text instance set to unexpiring and a multi-
gig/multi-year cache, I could probably grep it out of there as well, but
firing up lynx and heading for the newsgroup would likely be easier if it
was recent and I remember subject and/or author details well enough to
find it quickly.)
Reading the NNTPFolder docs, it occurs to me that if the "article
numbers" they refer to are NOT identical to the rfc's server-supplied
"article-numbers, it may be that this javaclass implementation at least,
is designed to be just that, cross-session stateless, you get what the
server has available when you connect, no more, no memory of a previous
session to save or worry about.
That would certainly simplify the implementation! Unfortunately, it's
not particularly useful, in the way nntp is traditionally used, at
least. For a particular "browsing" session, sure, but forget about
saving state!
HOWEVER, it MAY be that the "article numbers" referred to are INDEED the
rfc-version, as supplied by the server, in which case saving and
recalling state DOES appear to be reasonable, since the current server
state as seen by the GROUP commands, etc, is then by definition matchable
against the previous session's state.
Some bits of the documentation hint at this, tho as I said, I never could
find anything explicitly STATING it.
So for instance, in regard to your deleted/expunged concern, note that
the delete/expunge methods don't apply to NNTPFolder as it's read-only.
At first, the read-only bit appears to back the stateless single-session
thing, but it DOES mean that you don't have to worry about THAT
particular renumbering issue.
And the "open" method, while the method enumeration at the top says it
doesn't apply to NNTPFolder, down in the description, it actually says
something different, that the "open" method is used to issue the GROUP
command and to update current state. So that's how NNTPFolder exposes
the GROUP command...
Various other bits I was able to infer, by looking at the methods
inherited from the parent Folder class, altho their NNTPFolder subclass
usage and implementation differences aren't explicitly documented.
GetMessages appears to be the method that exposes article numbers for use
in other commands. In the generic/mail folders case, there's the delete/
expunge and renumbering to worry about, but since delete/expunge doesn't
apply to NNTPFolder...
What I was looking for was something explicit that says that in NNTPFolder
subclass (as opposed to the Folder parent class), article numbers aren't
simply counted, but instead, the server sequence numbers are used. I
couldn't find it, but given the read-only nature of NNTPFolders and thus
the elimination of the delete/expunge, etc, issues, it would seem to be a
logical subclass extension.
And as best I can see, if the server article numbers ARE used, you'd then
have a chance to compare and update state in a new session against the
old one, since you'd have a means to measure current server status
against saved previous status, while if article numbers are local-only,
as you, I don't see any way to measure current session server state
against that of a previous session, so it may well be that this class
implementation at least, is "session state only", much like that of lynx
as I mentioned above, and what you see in that session is what you get,
no saved state between sessions at all.
So as I said, were I working on the project, I'd be tooling up some
experimental code right about now, to actually check out those article
numbers, comparing them against the server assigned article numbers as
seen in the actual net traffic sniffed with ngrep or the like.
Or, being your the coder that I'm not, since the code is available, you
can actually take a look at the implementing class code to see what it
does, instead of doing the reverse engineering and experimentation that
I'd do.
I don't know how well that answers your questions. Certainly not as well
as someone with actual experience with this javaclass code could have.
But even in the few hours I've spent looking at it, I THINK (famous last
words) I understand a bit more about it than the "frustrated and at a
loss" you seemed to be expressing in your message. Thus, hopefully, it's
at least /some/ help. =:^/
But really, if you're not wedded to java for some reason, do consider
looking at a python implementation. I believe you'll find a reasonable
amount of existing code, with multiple nntp helper modules to choose
from, and that you'll find at least one of them rather saner than the java
classes seen here. Maybe you're more comfortable with java, I don't
know, but I'd almost certainly be more comfortable with python, even tho
I don't claim to be a python coder either. And of course the same goes
for perl; multiple helper modules should be available, as well as
implementing code using them that you can study. Except I personally
prefer python, and have thought for several years that if I eventually
progress beyond bash, python is my logical next step.
(I actually did look into learning perl, but decided it was a bit /too/
flexible for me; python's enforced formatting due to its use of
formatting for block indication, among other things when compared against
perl, appeals to me. Plus, there's more python available in my
environment to study, not a small factor considering that my practical
knowledge of bash scripting originated with my taking apart and putting
back together the various initscripts in my first Linux installations,
Mandrake 8.x, back then. That's actually why I took a look at perl first
as well, as the Mandrake package manager was perl-based. Now of course
I'm on gentoo, with its portage package manager (as well as a second
gentoo PM implementation, pkgcore, there's a third as well, paludis, but
it's C++ based) being python based.)
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman