Re: [Bug-wget] wget

bug-wget

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] wget

From:	Howard Bryden
Subject:	Re: [Bug-wget] wget
Date:	Mon, 30 Apr 2012 10:05:25 +1000

Ángel,

The Sharepoint view offers a "next 100 page" button.  Upon further reflection - 
as always, soon after posting - if became apparent that it was a pretty tall 
order to expect wget to be able to discern such a thing in the HTML it received 
from the site.  So of course it was only ever going to be able to see what the 
HTML linked to and no more.

The site _does_ offer a so-called "Explorer View" which does indeed show _all_ 
the directories/files in the traditional scrolled rather than paged view, but 
when I fed the URL displayed by IE (which had the form http://server/top 
directory/Forms/WebFldr.aspx?RootFolder=top_directory) all I got was a mess of 
HTML/JS files.  Oh well.

<< You could make a similar mount in the Unix server (if it's eg. available 
through smb) >>

Alas, HP CIFS knows nought about the wonders of Sharepoint, it only deals with 
Windows shares.

In the end I just used my desktop to trawl the site for the filenames (dir /s 
/b \\server\top directory\*.pdf) and, with a bit of massaging, presented that 
file list to wget, with no directory tree walking.  It was all a pretty tacky 
kludge but it got the job done in the end.

Thanks anyway.

Rocket J. Squirrel: "... we're going to have to think!"
Bullwinkle J. Moose: "There must be an easier way than that."

HOWARD BRYDEN

Senior Unix Administrator
Data Centre
Information and Communication Systems 
Corporate Support Division
Department of Community Safety

PHONE: 07 3635 3087 
POSTAL: GPO Box 1425, Brisbane, QLD 4001 | EMAIL: address@hidden
P Please consider the environment before printing this email - then print it

-----Original Message-----
From: Ángel González [mailto:address@hidden 
Sent: Sunday, 29 April 2012 12:02 AM
To: Howard Bryden
Cc: bug-wget
Subject: Re: [Bug-wget] wget

On 27/04/12 06:25, Howard Bryden wrote:
> Folks,
>
> I'm using wget 1.13.4 to attempt to recursively download a Sharepoint site.  
> The commandline is just the wget command verb; the contents of ~/.wgetrc are:
>
>
>
> Initially all appeared to work as expected yet it turns out I'm 
> receiving only a subset of the filespace, namely
>
> a) only the first 100 directories are visited, and
> b) only the first 100 files from each directory are actually downloaded.
>
> This pretty much corresponds to the Internet Explorer view, which presents 
> the site in pages of 100 items (directories and files within directories).

How are the next pages accessed?
Can you view those "next pages" if you disable javascript in your browser? 
(wget doesn't parse javascript)

I think the problem lies in the way those next pages are linked, so such a page 
would be more helpful than the full list of files.

Also, if you can view the full site as mounted on the computer, do you really 
need to crawl it with wget?
You could make a similar mount in the Unix server (if it's eg. available 
through smb) or simply zip everything locally and transfer that to the HP 
server.

This correspondence is for the named persons only. It may contain confidential 
or privileged information or both. No confidentiality or privilege is waived or 
lost by any mis transmission. If you receive this correspondence in error 
please delete it from your system immediately and notify the sender. You must 
not disclose, copy or relay on any part of this correspondence, if you are not 
the intended recipient. Any opinions expressed in this message are those of the 
individual sender except where the sender expressly, and with the authority, 
states them to be the opinions of the Department of Community Safety, 
Queensland.

All reasonable precautions will be taken to respect the privacy of individuals 
in accordance with the Information Privacy Act 2009 (Qld). Details on how 
personal information may be used or disclosed by the Department of Community 
Safety, Queensland are available from 
www.communitysafety.qld.gov.au/info/privacy.htm

[Prev in Thread]

Current Thread

[Next in Thread]

[Bug-wget] wget, Howard Bryden, 2012/04/27
- Re: [Bug-wget] wget, Ángel González, 2012/04/28
  - Re: [Bug-wget] wget, Howard Bryden <=

Prev by Date: Re: [Bug-wget] DLL conflict between wget and curl
Previous by thread: Re: [Bug-wget] wget
Next by thread: [Bug-wget] DLL conflict between wget and curl
Index(es):
- Date
- Thread