pan-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Pan-users] Re: Dealing with huge numbers of messages in a newsgroup


From: Duncan
Subject: [Pan-users] Re: Dealing with huge numbers of messages in a newsgroup
Date: Thu, 17 Feb 2005 03:22:00 -0700
User-agent: Pan/0.14.2.91 (As She Crawled Across the Table)

Wilbur Pan posted <address@hidden>, excerpted
below,  on Wed, 16 Feb 2005 17:23:13 -0500:

> I'm using Pan and using Giganews as my news server.  There are some groups
> with huge numbers of messages -- close to 1,500,000, in fact.  I've
> determined that I can set up Pan to load the 450,000 newest messages
> without my computer locking up on me.  However, I can't seem to get to the
> older messages.
> 
> I tried downloading 450,000 messages, then marking them read, then asking
> for another 450,000 messages, but the program seems to just lock up or
> reloads the first batch of messages.
> 
> I've got 512 MB of RAM, so it's a decent amount of memory, and my computer
> dual boots into Windows and Linux.  I've tried running Pan in both Windows
> and Linux.

Scalability is a known issue with PAN, at this point.  Part of the problem
is that it's trying to stuff millions of messages into a GTK+ tree widget
that is designed as a GUI widget with handy data properties, not a data
widget that happens to display in a GUI (or not).  It simply isn't
designed to scale to the level PAN is forcing it to.  Another part of the
problem is that PAN currently attempts to keep all that data in memory at
once, and doesn't do much in the way of compression.  If there are 50
multipart messages each of 5 parts, all by the same poster with almost
exactly the same subject, PAN still keeps all 250 copies of the entire
subject and author headers, all in memory.

The problem is actually better than it was.  Where with 3/4 to 1 GB of
RAM, PAN can usually handle a million or two articles now, but that's it.
It used to be that it would grind to a halt at 100K-200K articles, so it's
already better by an order of magnitude than it was.

The next major project within PAN is to change all this.  Where PAN uses
its own backend, currently, it's being rewritten to use the SQLite
library, a proper database library designed for better scaling.  At the
same time, a couple other changes designed to dramatically improve scaling
are being incorporated into the new code, as well.  Instead of those
identical strings being stored 250 times as above, they'll be stored only
once, with a 32-bit (on 32-bit platforms, anyway, that's equivalent to
four bytes, that is, a four letter word, in terms of storage) address
pointer substituting for the other 249 times.  This is a form of data
compression and will allow PAN to make much better use of the memory it
uses.  Also, a database windowing technique is being implemented, such
that instead of trying to fit /all/ the data from a group's article
overviews in memory at once, PAN will save it in an on-disk database
(using SQLite for management), and only keep a window of a few thousand
articles at once in active memory.  This will make PAN /much/ more
efficient in the way it uses and manages memory, and due to the windowing,
will allow nearly unlimited numbers of article overviews in a group,
scaling linearly past a couple hundred thousand overviews, instead of
slowing down geometrically somewhere between half a million and two
million overviews.  PAN should be able to handle well over 10 million
article groups at that point, probably well over 100 million, with a
decent amount of decently fast memory, and a high speed hard drive.

Unfortunately, all this is a lot of work, and we are probably looking out
toward late Q2 before the first betas, at the earliest, probably late Q3
or Q4 before it stabilizes to the general level PAN is at currently.

For that level of binary group access, try a different app, for now.  BNR2
or 3 is a Borland Delphi/Kylex based binary news harvester available for
both MSWormOS and Linux.  Because it's not fully libreware, however, I
won't run it.  There's a fairly new KDE based product, klibido, available,
however, for anything KDE runs on.  It's extremely fast and efficient in
its computer resource usage, while managing multiple connections to
multiple servers just as BNR2/3 do.  Because it's less than a year old,
however, it doesn't have anything like filtering, and is otherwise a bit
rough around the edges, still.  However, I use it here and was, quite
frankly, very amazed at just how efficient it is, particularly after
watching PAN struggle with things for a number of years.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman in
http://www.linuxdevcenter.com/pub/a/linux/2004/12/22/rms_interview.html






reply via email to

[Prev in Thread] Current Thread [Next in Thread]