Re: [Pan-users] size of newsrc-1 file

pan-users
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Pan-users] size of newsrc-1 file

From:	Heinz Mezera
Subject:	Re: [Pan-users] size of newsrc-1 file
Date:	Wed, 06 Jul 2016 11:09:26 +0200
Hello Duncan,

a detailed answer as ususal!

Forgive me if I do not follow standard answering methods and summerize
here:

Pan takes approx. 10 minutes to start,
ls -lh newsrc-1 --> 567M (a manual edit would be an enormous task and
I've _never_ done any scripting before)
renamed article-cache directory and reduced cache size to 10 MB (from
100)
only one news server
less than a dozen groups subscribed (a few text, others binaries)
my working style: read new headers, decide which ones might be of
interest and save them to disk; I never go back to yesterdays or older
articles
I don't use scores (or I'm not aware off)

Should I remove/rename .pan2 directory and start from scratch?

kr Heinz

Am Mittwoch, den 06.07.2016, 06:22 +0000 schrieb Duncan:
> Heinz Mezera posted on Tue, 05 Jul 2016 12:47:21 +0200 as excerpted:
> 
> > 
> > Hello pan-users,
> > 
> > does the size of newsrc-1 influence pan's time to start, to quit or
> > its
> > performance?
> > 
> > I use Ubuntu's 16.04 version of pan (0.139-5build1) and it takes
> > rather
> > long until pan appears on Ubuntu's desktop.
> > 
> > Can I compact newsrc-1 or reduce its size somehow?
> I suspect your problem isn't the newsrc file, but something else...
> [discussed below, but first...]
> 
> To answer your question somewhat directly, however, the newsrc
> file(s, 
> one per configured server) can indeed be compacted some, and that
> /might/ 
> affect startup time, tho in my own experience there's a far worse
> trigger 
> of startup delay that I suspect is the real problem.  However, the
> newrc 
> files can be made more efficient.
> 
> These newsrc files follow a standard text-based format and can be
> edited 
> using a standard text editor.  As always, making a backup of the 
> unaltered file before you begin is recommended, just in case you
> screw up 
> the edits.
> 
> Rather than describe in detail the format, I'll simply provide you a 
> google link...
> 
> https://www.google.com/search?q=newsrc+file+format
> 
> There is however one caveat about pan's usage.  (Current) Pan doesn't
> use 
> the subscription info in the newsrc (tho old C-based pan, 0.14.x,
> did, 
> before the C++ rewrite), because a newsrc is inherently single-
> server, 
> and pan's subscriptions apply across all configured servers that
> carry 
> the group.  So pan uses a different method to track group
> subscriptions.
> 
> What pan /does/ track in the newsrcs, however, is the per-server per-
> newsgroup article sequence numbers, so it knows which ones on each
> server 
> you've already seen so it knows not to download those headers again.
> 
> It's this sequence of comma-separated article numbers that appears at
> the 
> end of the newsrc line for any group you've visited (or seen a cross-
> posted message in).
> 
> And you can consolidate these article numbers lists by removing the
> gaps 
> and making the ranges continuous.
> 
> It's worth noting that news servers initially communicate what they 
> currently have using only a high-water and a low-water mark, plus an
> /
> estimated/ count of the number of messages available, with that
> estimate 
> allowed to be /more/ than the number of currently available messages,
> but 
> never /less/.  These are IOW the lowest numbered message still
> available 
> (unexpired), and the highest numbered message available (the latest 
> message to arrive), plus the estimate.  Missing article numbers
> between 
> the high and low water marks are specifically allowed -- this lets 
> servers remove messages reported as spam or as copyright violations, 
> etc.  Sometimes these missing messages will be filled in later (some 
> servers are infamous for doing this, infamous because it screws up
> some 
> news clients).  Often they're not.
> 
> And it's these gaps in the server store, along with simply not
> visiting 
> the newsgroup for longer than its expiration period if your server
> does 
> expire messages (some dedicated news service providers effectively
> don't 
> expire messages, these days), that appear as gaps in pan's sequence 
> number lists -- because it never saw those messages.
> 
> 
> Now, if you're reasonably sure your server doesn't fill in article 
> sequence numbers, only ever increasing them, or if you simply don't
> care 
> to see what are likely old messages if they are filled in, you can
> cut 
> out all the commas and make the list a single range, from 1 or
> whatever 
> the lowest number is in the existing list, to the highest number.  If
> the 
> server does do fill-ins, you might still be able to make the oldest 
> messages a continuous range, while leaving the gaps in anything
> newer 
> than say a month old, just in case.
> 
> So, to take one example line from the linuxtopia google hit (the
> first 
> hit in the google above, as a write this, note that this page is from
> a 
> book copyrighted in 2003, and its mention of pan as an exception to
> the 
> newsrc format is... dated, pan does use the format now):
> 
> news.software.readers! 1-95504,137265,137274,140059,140091,140117
> 
> You can edit that to:
> 
> news.software.readers! 1-140117
> 
> Much shorter! =:^)
> 
> Unfortunately, if you follow a lot of groups, all that manual
> editing 
> could be a big chore (unless you can figure out a nice script to
> automate 
> the process, should be possible), with, I suspect, rather limited
> results 
> in terms of startup.
> 
> 
> Instead, what I've found to take the real time, particularly on
> spinning 
> rust drives (I'm on SSD now and haven't had to worry about it since
> I 
> upgraded to SSD), is large message caches.
> 
> Note that pan's cache size is configurable, but defaults to 10 MB
> which 
> shouldn't be an issue, but also will start dumping already
> downloaded 
> articles to make room for more, particularly if you do binaries,
> rather 
> quickly.  For a usage pattern that saves off attachments directly,
> with 
> no further use for the messages in cache after that, 10 MB is
> fine.  For 
> a usage pattern more like mine, however, where I tend to download a
> bunch 
> of stuff to cache so it's local, and then go thru it later, a cache
> size 
> of several GB may be more appropriate.  Similarly, if you have
> groups 
> that you effectively archive, keeping all messages without expiring
> them 
> at all, as I do with my text groups, a cache of several gigs will
> likely 
> hold several years worth of text-group messages.  (I have text
> messages 
> going back to 2002 in some groups.  My cache for my text-groups pan 
> instance[1] is, as of now, 1.4 GiB, so the average usage is 100
> MB/year.)
> 
> Once that cache gets to a few hundred MiB, you'll start noticing pan 
> startup gets slower and slower on *first* startup, as the cache gets 
> bigger and bigger.  (Pan will start up faster after the first start, 
> since everything's already cached.  At least it will if you have
> enough 
> memory to cache into RAM the full pan message cache.  If you're
> running 1 
> GiB or less of RAM... probably not so much.)  This is because pan
> loads 
> those messages every time it starts, in ordered to rethread them --
> it 
> keeps track of message threading in memory.
> 
> Back when I was on spinning rust, I found a few ways to deal with
> this.  
> 
> One was, set pan to start with my X user session, so it could grind
> away 
> for several minutes loading stuff while I did other things.  A few 
> minutes later when I had completed other tasks, pan would generally
> be up 
> (in the system tray) and ready to go.  I'd normally keep pan running 
> constantly, in the system tray, until I was ready to end the user X 
> session.
> 
> Another I found quite by accident.  I periodically do backups of the 
> multiple partitions on my system, and every few years, I'll boot to
> the 
> backup, wipe away the normal working partition, and copy things back
> from 
> the backup to the working copy, renewing it.
> 
> I found that at least with some filesystems (I was using reiserfs at
> the 
> time), pan evidently fragments the cache files rather heavily.  I
> believe 
> this is most likely to happen when multiple threads are downloading
> files 
> at once, writing them in parallel and fragmenting them in the
> process.
> 
> By backing up the cache files, erasing the working cache copy, and 
> copying everything back into place, the new copy was defragmented due
> to 
> the copy process, and pan started up much faster after that, even tho
> it 
> still had the same size cache.
> 
> Of course over time it slowed down again as I added new messages to
> my 
> newsgroup archive, but now that I knew the trick, I could defrag the 
> cache any time the start time got too long, and pan would startup
> faster 
> again.
> 
> And of course as I mentioned, putting it on SSD sped things up 
> dramatically, because ssds have zero seek time, so fragmentation
> doesn't 
> affect them anything close to as badly (tho it can still have some
> effect 
> due to IOPs per file increasing with the number of fragments).
> 
> 
> That's what definitely took the load time for me, pan reading all
> those 
> files from cache into memory, so it could rethread them.
> 
> There's a simple way to confirm whether this is your problem or
> not.  
> With pan closed, simply rename the article-cache directory to
> something 
> else, so pan will recreate a new, empty cache, when it starts.  If
> the 
> cache is your slowdown, pan should start much faster, likely nearly 
> instantly, with no cache to load.
> 
> Tho of course if you've never upped your cache size from the default
> 10 
> MB, the cache is unlikely to be the problem, and you probably won't 
> notice a difference with the above test.
> 
> 
> Finally, I should mention that a big scorefile will slow pan down at 
> startup.  There are ways to dramatically optimize the scorefile, but 
> that's a different subject, that we can deal with later if you find
> it to 
> be the problem.  Meanwhile, however, you can test it using the same 
> technique I suggested above for testing the cache.  Simply rename
> the 
> scorefile and see if pan starts faster with an empty one.  If the 
> scorefile turns out to be your problem, post back with the results
> and we 
> can deal with that, then.
> 
> ---
> [1] Text-groups pan instance:  It is possible to have several
> separately 
> configured pan instances, each with their own configuration and
> cache.  
> ~/.pan2/ is only the default location.  If the $PAN_HOME variable is 
> found to be set in pan's environment as it starts, it will use the 
> location found in that variable as its configuration and cache home, 
> instead.  I've taken advantage of this to setup a number of pan
> wrapper 
> scripts here, pan.text, pan.test, and pan.bin, that each point at a 
> different config and cache.  This lets me manage my unexpiring text-
> group-
> archive cache separately from my binaries cache, also unexpiring and
> set 
> rather large, but cleared manually from time to time.
>
[Prev in Thread]
Current Thread
[Next in Thread]
[Pan-users] size of newsrc-1 file, Heinz Mezera, 2016/07/05
- Re: [Pan-users] size of newsrc-1 file, Duncan, 2016/07/06
  - Re: [Pan-users] size of newsrc-1 file, Heinz Mezera <=
    - Re: [Pan-users] size of newsrc-1 file, Duncan, 2016/07/06
    - Re: [Pan-users] size of newsrc-1 file, Heinz Mezera, 2016/07/07
Prev by Date: Re: [Pan-users] size of newsrc-1 file
Next by Date: Re: [Pan-users] size of newsrc-1 file
Previous by thread: Re: [Pan-users] size of newsrc-1 file
Next by thread: Re: [Pan-users] size of newsrc-1 file
Index(es):
- Date
- Thread