[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Pan-users] size of newsrc-1 file
From: |
Duncan |
Subject: |
Re: [Pan-users] size of newsrc-1 file |
Date: |
Wed, 6 Jul 2016 06:22:11 +0000 (UTC) |
User-agent: |
Pan/0.141 (Tarzan's Death; GIT 188fd4bb4) |
Heinz Mezera posted on Tue, 05 Jul 2016 12:47:21 +0200 as excerpted:
> Hello pan-users,
>
> does the size of newsrc-1 influence pan's time to start, to quit or its
> performance?
>
> I use Ubuntu's 16.04 version of pan (0.139-5build1) and it takes rather
> long until pan appears on Ubuntu's desktop.
>
> Can I compact newsrc-1 or reduce its size somehow?
I suspect your problem isn't the newsrc file, but something else...
[discussed below, but first...]
To answer your question somewhat directly, however, the newsrc file(s,
one per configured server) can indeed be compacted some, and that /might/
affect startup time, tho in my own experience there's a far worse trigger
of startup delay that I suspect is the real problem. However, the newrc
files can be made more efficient.
These newsrc files follow a standard text-based format and can be edited
using a standard text editor. As always, making a backup of the
unaltered file before you begin is recommended, just in case you screw up
the edits.
Rather than describe in detail the format, I'll simply provide you a
google link...
https://www.google.com/search?q=newsrc+file+format
There is however one caveat about pan's usage. (Current) Pan doesn't use
the subscription info in the newsrc (tho old C-based pan, 0.14.x, did,
before the C++ rewrite), because a newsrc is inherently single-server,
and pan's subscriptions apply across all configured servers that carry
the group. So pan uses a different method to track group subscriptions.
What pan /does/ track in the newsrcs, however, is the per-server per-
newsgroup article sequence numbers, so it knows which ones on each server
you've already seen so it knows not to download those headers again.
It's this sequence of comma-separated article numbers that appears at the
end of the newsrc line for any group you've visited (or seen a cross-
posted message in).
And you can consolidate these article numbers lists by removing the gaps
and making the ranges continuous.
It's worth noting that news servers initially communicate what they
currently have using only a high-water and a low-water mark, plus an /
estimated/ count of the number of messages available, with that estimate
allowed to be /more/ than the number of currently available messages, but
never /less/. These are IOW the lowest numbered message still available
(unexpired), and the highest numbered message available (the latest
message to arrive), plus the estimate. Missing article numbers between
the high and low water marks are specifically allowed -- this lets
servers remove messages reported as spam or as copyright violations,
etc. Sometimes these missing messages will be filled in later (some
servers are infamous for doing this, infamous because it screws up some
news clients). Often they're not.
And it's these gaps in the server store, along with simply not visiting
the newsgroup for longer than its expiration period if your server does
expire messages (some dedicated news service providers effectively don't
expire messages, these days), that appear as gaps in pan's sequence
number lists -- because it never saw those messages.
Now, if you're reasonably sure your server doesn't fill in article
sequence numbers, only ever increasing them, or if you simply don't care
to see what are likely old messages if they are filled in, you can cut
out all the commas and make the list a single range, from 1 or whatever
the lowest number is in the existing list, to the highest number. If the
server does do fill-ins, you might still be able to make the oldest
messages a continuous range, while leaving the gaps in anything newer
than say a month old, just in case.
So, to take one example line from the linuxtopia google hit (the first
hit in the google above, as a write this, note that this page is from a
book copyrighted in 2003, and its mention of pan as an exception to the
newsrc format is... dated, pan does use the format now):
news.software.readers! 1-95504,137265,137274,140059,140091,140117
You can edit that to:
news.software.readers! 1-140117
Much shorter! =:^)
Unfortunately, if you follow a lot of groups, all that manual editing
could be a big chore (unless you can figure out a nice script to automate
the process, should be possible), with, I suspect, rather limited results
in terms of startup.
Instead, what I've found to take the real time, particularly on spinning
rust drives (I'm on SSD now and haven't had to worry about it since I
upgraded to SSD), is large message caches.
Note that pan's cache size is configurable, but defaults to 10 MB which
shouldn't be an issue, but also will start dumping already downloaded
articles to make room for more, particularly if you do binaries, rather
quickly. For a usage pattern that saves off attachments directly, with
no further use for the messages in cache after that, 10 MB is fine. For
a usage pattern more like mine, however, where I tend to download a bunch
of stuff to cache so it's local, and then go thru it later, a cache size
of several GB may be more appropriate. Similarly, if you have groups
that you effectively archive, keeping all messages without expiring them
at all, as I do with my text groups, a cache of several gigs will likely
hold several years worth of text-group messages. (I have text messages
going back to 2002 in some groups. My cache for my text-groups pan
instance[1] is, as of now, 1.4 GiB, so the average usage is 100 MB/year.)
Once that cache gets to a few hundred MiB, you'll start noticing pan
startup gets slower and slower on *first* startup, as the cache gets
bigger and bigger. (Pan will start up faster after the first start,
since everything's already cached. At least it will if you have enough
memory to cache into RAM the full pan message cache. If you're running 1
GiB or less of RAM... probably not so much.) This is because pan loads
those messages every time it starts, in ordered to rethread them -- it
keeps track of message threading in memory.
Back when I was on spinning rust, I found a few ways to deal with this.
One was, set pan to start with my X user session, so it could grind away
for several minutes loading stuff while I did other things. A few
minutes later when I had completed other tasks, pan would generally be up
(in the system tray) and ready to go. I'd normally keep pan running
constantly, in the system tray, until I was ready to end the user X
session.
Another I found quite by accident. I periodically do backups of the
multiple partitions on my system, and every few years, I'll boot to the
backup, wipe away the normal working partition, and copy things back from
the backup to the working copy, renewing it.
I found that at least with some filesystems (I was using reiserfs at the
time), pan evidently fragments the cache files rather heavily. I believe
this is most likely to happen when multiple threads are downloading files
at once, writing them in parallel and fragmenting them in the process.
By backing up the cache files, erasing the working cache copy, and
copying everything back into place, the new copy was defragmented due to
the copy process, and pan started up much faster after that, even tho it
still had the same size cache.
Of course over time it slowed down again as I added new messages to my
newsgroup archive, but now that I knew the trick, I could defrag the
cache any time the start time got too long, and pan would startup faster
again.
And of course as I mentioned, putting it on SSD sped things up
dramatically, because ssds have zero seek time, so fragmentation doesn't
affect them anything close to as badly (tho it can still have some effect
due to IOPs per file increasing with the number of fragments).
That's what definitely took the load time for me, pan reading all those
files from cache into memory, so it could rethread them.
There's a simple way to confirm whether this is your problem or not.
With pan closed, simply rename the article-cache directory to something
else, so pan will recreate a new, empty cache, when it starts. If the
cache is your slowdown, pan should start much faster, likely nearly
instantly, with no cache to load.
Tho of course if you've never upped your cache size from the default 10
MB, the cache is unlikely to be the problem, and you probably won't
notice a difference with the above test.
Finally, I should mention that a big scorefile will slow pan down at
startup. There are ways to dramatically optimize the scorefile, but
that's a different subject, that we can deal with later if you find it to
be the problem. Meanwhile, however, you can test it using the same
technique I suggested above for testing the cache. Simply rename the
scorefile and see if pan starts faster with an empty one. If the
scorefile turns out to be your problem, post back with the results and we
can deal with that, then.
---
[1] Text-groups pan instance: It is possible to have several separately
configured pan instances, each with their own configuration and cache.
~/.pan2/ is only the default location. If the $PAN_HOME variable is
found to be set in pan's environment as it starts, it will use the
location found in that variable as its configuration and cache home,
instead. I've taken advantage of this to setup a number of pan wrapper
scripts here, pan.text, pan.test, and pan.bin, that each point at a
different config and cache. This lets me manage my unexpiring text-group-
archive cache separately from my binaries cache, also unexpiring and set
rather large, but cleared manually from time to time.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman