[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Bug-wget] wget
Re: [Bug-wget] wget
Mon, 30 Apr 2012 10:05:25 +1000
The Sharepoint view offers a "next 100 page" button. Upon further reflection -
as always, soon after posting - if became apparent that it was a pretty tall
order to expect wget to be able to discern such a thing in the HTML it received
from the site. So of course it was only ever going to be able to see what the
HTML linked to and no more.
The site _does_ offer a so-called "Explorer View" which does indeed show _all_
the directories/files in the traditional scrolled rather than paged view, but
when I fed the URL displayed by IE (which had the form http://server/top
directory/Forms/WebFldr.aspx?RootFolder=top_directory) all I got was a mess of
HTML/JS files. Oh well.
<< You could make a similar mount in the Unix server (if it's eg. available
through smb) >>
Alas, HP CIFS knows nought about the wonders of Sharepoint, it only deals with
In the end I just used my desktop to trawl the site for the filenames (dir /s
/b \\server\top directory\*.pdf) and, with a bit of massaging, presented that
file list to wget, with no directory tree walking. It was all a pretty tacky
kludge but it got the job done in the end.
Rocket J. Squirrel: "... we're going to have to think!"
Bullwinkle J. Moose: "There must be an easier way than that."
Senior Unix Administrator
Information and Communication Systems
Corporate Support Division
Department of Community Safety
PHONE: 07 3635 3087
POSTAL: GPO Box 1425, Brisbane, QLD 4001 | EMAIL: address@hidden
P Please consider the environment before printing this email - then print it
From: Ángel González [mailto:address@hidden
Sent: Sunday, 29 April 2012 12:02 AM
To: Howard Bryden
Subject: Re: [Bug-wget] wget
On 27/04/12 06:25, Howard Bryden wrote:
> I'm using wget 1.13.4 to attempt to recursively download a Sharepoint site.
> The commandline is just the wget command verb; the contents of ~/.wgetrc are:
> Initially all appeared to work as expected yet it turns out I'm
> receiving only a subset of the filespace, namely
> a) only the first 100 directories are visited, and
> b) only the first 100 files from each directory are actually downloaded.
> This pretty much corresponds to the Internet Explorer view, which presents
> the site in pages of 100 items (directories and files within directories).
How are the next pages accessed?
I think the problem lies in the way those next pages are linked, so such a page
would be more helpful than the full list of files.
Also, if you can view the full site as mounted on the computer, do you really
need to crawl it with wget?
You could make a similar mount in the Unix server (if it's eg. available
through smb) or simply zip everything locally and transfer that to the HP
This correspondence is for the named persons only. It may contain confidential
or privileged information or both. No confidentiality or privilege is waived or
lost by any mis transmission. If you receive this correspondence in error
please delete it from your system immediately and notify the sender. You must
not disclose, copy or relay on any part of this correspondence, if you are not
the intended recipient. Any opinions expressed in this message are those of the
individual sender except where the sender expressly, and with the authority,
states them to be the opinions of the Department of Community Safety,
All reasonable precautions will be taken to respect the privacy of individuals
in accordance with the Information Privacy Act 2009 (Qld). Details on how
personal information may be used or disclosed by the Department of Community
Safety, Queensland are available from