wp-mirror-list
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Wp-mirror-list] wpmirror optimizations (was: Re: [Xmldatadumps-l] [


From: wp mirror
Subject: Re: [Wp-mirror-list] wpmirror optimizations (was: Re: [Xmldatadumps-l] [WP-MIRROR] Questions regarding Metalink and SPDY)
Date: Wed, 2 Jan 2013 13:00:46 -0500

Dear Ariel,

Happy New Year.  Thank you for your email of 2012-12-31.

1) MEDIA TARBALLS

I looked at the media tarballs on <http://ftpmirror.your.org/> and am
quite impressed.  I also walked the directory tree
/pub/wikimedia/images/wikipedia/[language-code]/[0-9a-f]/.
WP-MIRROR 0.5 and prior place images under
$wgScriptPath/images/[0-9a-f].  I think that I should insert two more
directory layers to better match what you have done:
$wgScriptPath/images/wikipedia/[language-code]/[0-9a-f]/.

Action Item:  WP-MIRROR 0.6 will make use of media tarballs; and
reorganize the images directory tree to match your.org.

2) ACCESSIBILITY

Browsing <http://ftpmirror.your.org/> turns out to be almost
impossible for me.  The contrast is too low (cyan on white).  I have
to bring up the source code (Ctrl-U) and browse that.  Please ask
someone at your.org to edit the style sheet
/pub/misc/lighttpd-white-dir.css.  Black on white works for me.

3) MULTISTREAM BZ2

Very interesting.  For WP-MIRROR 0.5 and prior the first steps in
processing a dump are: download, verify, decompress, and then split
into xchunks (of 1000 pages each).   xchunks are then scraped for
image file names, are fed into importDump.php, etc.  I introduced
xchunks for reasons of robustness.  Every aspect of mirror building
has failure modes that prevent processing enwiki in one pass.  Failure
to process a few xchunks is however quite tolerable.

I am not yet clear as to the ramifications of using your tool.  At one
end of the scale it might obviate decompressing the dump.  At the
other end, it might obviate the use of xchunks entirely.

Action Item:  Study multistream bz2 for possible use.

4) MWDUMPER

I have run experiments with MWdumper.jar, and cannot conclude that
MWdumper.jar is usable.  My lab notes can be found in the WP-MIRROR
0.5 Reference Manual.
<http://www.nongnu.org/wp-mirror/manual/wp-mirror-0.5.pdf>.  Several
sections may be of interest to mirror builders:

Appendix E.7 Experiments with InnoDB (especially Figure E.1)
Appendix E.9 Experiments with MWdumper.jar -- Round 2.
Appendix E.11 Experiments with wikix
Appendix E.12 Experiments with Downloading Images
Appendix E.14 Experiments with Corrupt Images
Appendix E.19 Messages (this is a collection of error messages that I have seen)

No part of mirror building is easy.  The vision for WP-MIRROR is to
automate all the steps so that anyone with enough disk space can build
his own mirror.

5) MWIMPORT

Thanks for bringing this to my attention.

Action Item:  I will study mwimport for possible use.

6) POTY

I noticed, on this list, several e-mails regarding POTY collections.
I downloaded them all.  Very sweet.  Many thanks to whoever is making
this happen.  I look forward to seeing the 2012 collection.

I know that producing dump files is a big task.  Thanks for all the hard work.

Sincerely Yours,
Kent

On 12/31/12, Ariel T. Glenn <address@hidden> wrote:
> Hello wp mirror dev,
>
> If you are not already downloading media tarball bundles for the initial
> mirror setup, you should be. See
>
> http://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps#Current_Mirrors
>
> for these, available once a month and I hope to make incrementals
> regularly available mid-month as well.
>
> Additionally, you can use the 'multistream bz2' files with their indexes
> to manage concurrency, rather than needing to write out a pile of
> separate xml files.
>
> Lastly, if you are using importDump.php, I strongly recomment you use
> mwdumper or mwimport to create an sql file which you can feed to MySQL,
> and then stuff in the various link tables as well.  This will cut down
> the setup time immensely.
>
> Ariel
>
>



reply via email to

[Prev in Thread] Current Thread [Next in Thread]