bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] [PATCH] Make wget capable of starting download from a spe


From: Yousong Zhou
Subject: Re: [Bug-wget] [PATCH] Make wget capable of starting download from a specified position.
Date: Sat, 21 Dec 2013 17:24:47 +0800
User-agent: Mutt/1.5.20 (2009-06-14)

On Sat, Dec 21, 2013 at 01:51:04PM +0530, Darshit Shah wrote:
> I have a few comments on the patch. Commenting inline.

Thank you.

> 
> On Sat, Dec 21, 2013 at 12:32 PM, Yousong Zhou <address@hidden>wrote:
> 
> > This patch adds an option `--start-pos' for specifying starting position
> > of a download, both for HTTP and FTP.  When specified, the newly added
> > option would override `--continue'.  Apart from that, no existing code
> > should be affected.
> >
> > Signed-off-by: Yousong Zhou <address@hidden>
> > ---
> > Hi,
> >
> > I found myself needed this feature when I was trying to tunnel the
> > download of
> > big file (several gigabytes) from a remote machine back to local through a
> > somewhat flaky connection.  It's a pain both for the server and local
> > network
> > users if we have to repeat the previously already downloaded part in case
> > that
> > the connection hangs or breaks.  Specifying 'Range: ' header is not an
> > option
> > for wget (integrity check in the code would fail), and curl is not fast
> > enough.
> > So I decided to make this patch in hope that this can also be useful to
> > someone
> > else.
> >
> > What integrity check would fail on using the Range Header? And if you
> already have a partially downloaded file why is using the --continue switch
> on an option?

`--continue` only works if there is already a partially downloaded file
on disk.  Otherwise, specifying `-c' will only tell wget to start from
scratch.

By 'Range: ' header I mean headers specified by `--header'.  If the
server sends back a 'Content-Range: ' header in the response, wget would
think that it's unexpected or not matching what's already on the disk
(would be zero if there is no file on disk).  If I get the code right,
the check is at `http.c:gethttp()':

        2744   if ((contrange != 0 && contrange != hs->restval)
        2745       || (H_PARTIAL (statcode) && !contrange))
        2746     {
        2747       /* The Range request was somehow misunderstood by the server.
        2748          Bail out.  */
        2749       xfree_null (type);
        2750       CLOSE_INVALIDATE (sock);
        2751       xfree (head);
        2752       return RANGEERR;
        2753     }

In my situation, wget was trigger on the remote machine like the
following:

    wget -O - --start-pos "$OFFSET" "$URL" | nc -lp 7193

Then on local machine, I would download with:

    nc localhost 7193

Before these, a local forwarding tunnel has been setup with ssh to make
this possible.  So in this case, there was no local file on the machine
where wget was triggerred and `--continue' will not work.  I am sure
there are other cases `--start-pos' would be useful and that
`--start-pos' would make wget more complete.

> 
>         yousong
> >
> >  doc/ChangeLog |    4 ++++
> >  doc/wget.texi |   14 ++++++++++++++
> >  src/ChangeLog |    9 +++++++++
> >  src/ftp.c     |    2 ++
> >  src/http.c    |    2 ++
> >  src/init.c    |    1 +
> >  src/main.c    |    1 +
> >  src/options.h |    1 +
> >  8 files changed, 34 insertions(+), 0 deletions(-)
> >
> > diff --git a/doc/ChangeLog b/doc/ChangeLog
> > index 3b05756..df103c8 100644
> > --- a/doc/ChangeLog
> > +++ b/doc/ChangeLog
> > @@ -1,3 +1,7 @@
> > +2013-12-21  Yousong Zhou  <address@hidden>
> > +
> > +       * wget.texi: Add documentation for --start-pos.
> > +
> >  2013-10-06  Tim Ruehsen  <address@hidden>
> >
> >         * wget.texi: add/explain quoting of wildcard patterns
> > diff --git a/doc/wget.texi b/doc/wget.texi
> > index 4a1f7f1..166ea08 100644
> > --- a/doc/wget.texi
> > +++ b/doc/wget.texi
> > @@ -701,6 +701,20 @@ Another instance where you'll get a garbled file if
> > you try to use
> >  Note that @samp{-c} only works with @sc{ftp} servers and with @sc{http}
> >  servers that support the @code{Range} header.
> >
> > address@hidden offset
> > address@hidden continue retrieval
> > address@hidden incomplete downloads
> > address@hidden resume download
> > address@hidden start position
> > address@hidden address@hidden
> > +Start the download at position @var{OFFSET}.  Offset may be expressed in
> > bytes,
> > +kilobytes with the `k' suffix, or megabytes with the `m' suffix.
> > +
> > +When specified, it would override the behavior of @samp{--continue}.  When
> > +using this option, you may also want to explicitly specify an output
> > filename
> > +with @samp{-O FILE} in order to not overwrite an existing partially
> > downloaded
> > +file.
> > +
> >  @cindex progress indicator
> >  @cindex dot style
> >  @item address@hidden
> > diff --git a/src/ChangeLog b/src/ChangeLog
> > index 42ce3e4..ab8a496 100644
> > --- a/src/ChangeLog
> > +++ b/src/ChangeLog
> > @@ -1,3 +1,12 @@
> > +2013-12-21  Yousong Zhou  <address@hidden>
> > +
> > +       * options.h: Add option --start-pos to specify start position of
> > +         a download.
> > +       * main.c: Same purpose as above.
> > +       * init.c: Same purpose as above.
> > +       * http.c: Utilize opt.start_pos for HTTP download.
> > +       * ftp.c: Utilize opt.start_pos for FTP retrieval.
> > +
> >  2013-11-02  Giuseppe Scrivano  <address@hidden>
> >
> >         * http.c (gethttp): Increase max header value length to 512.
> > diff --git a/src/ftp.c b/src/ftp.c
> > index c2522ca..c7ab6ef 100644
> > --- a/src/ftp.c
> > +++ b/src/ftp.c
> > @@ -1632,6 +1632,8 @@ ftp_loop_internal (struct url *u, struct fileinfo
> > *f, ccon *con, char **local_fi
> >        /* Decide whether or not to restart.  */
> >        if (con->cmd & DO_LIST)
> >          restval = 0;
> > +      else if (opt.start_pos)
> > +        restval = opt.start_pos;
> >        else if (opt.always_rest
> >            && stat (locf, &st) == 0
> >            && S_ISREG (st.st_mode))
> > diff --git a/src/http.c b/src/http.c
> > index 754b7ec..a354c6b 100644
> > --- a/src/http.c
> > +++ b/src/http.c
> > @@ -3098,6 +3098,8 @@ Spider mode enabled. Check if remote file
> > exists.\n"));
> >        /* Decide whether or not to restart.  */
> >        if (force_full_retrieve)
> >          hstat.restval = hstat.len;
> > +      else if (opt.start_pos)
> > +        hstat.restval = opt.start_pos;
> >        else if (opt.always_rest
> >            && got_name
> >            && stat (hstat.local_file, &st) == 0
> > diff --git a/src/init.c b/src/init.c
> > index 84ae654..7f7a34e 100644
> > --- a/src/init.c
> > +++ b/src/init.c
> > @@ -271,6 +271,7 @@ static const struct {
> >    { "showalldnsentries", &opt.show_all_dns_entries, cmd_boolean },
> >    { "spanhosts",        &opt.spanhost,          cmd_boolean },
> >    { "spider",           &opt.spider,            cmd_boolean },
> > +  { "startpos",         &opt.start_pos,         cmd_bytes },
> >    { "strictcomments",   &opt.strict_comments,   cmd_boolean },
> >    { "timeout",          NULL,                   cmd_spec_timeout },
> >    { "timestamping",     &opt.timestamping,      cmd_boolean },
> > diff --git a/src/main.c b/src/main.c
> > index 19d7253..4fbfaee 100644
> > --- a/src/main.c
> > +++ b/src/main.c
> > @@ -281,6 +281,7 @@ static struct cmdline_option option_data[] =
> >      { "server-response", 'S', OPT_BOOLEAN, "serverresponse", -1 },
> >      { "span-hosts", 'H', OPT_BOOLEAN, "spanhosts", -1 },
> >      { "spider", 0, OPT_BOOLEAN, "spider", -1 },
> > +    { "start-pos", 0, OPT_VALUE, "startpos", -1 },
> >      { "strict-comments", 0, OPT_BOOLEAN, "strictcomments", -1 },
> >      { "timeout", 'T', OPT_VALUE, "timeout", -1 },
> >      { "timestamping", 'N', OPT_BOOLEAN, "timestamping", -1 },
> > diff --git a/src/options.h b/src/options.h
> > index ad89627..9ba1760 100644
> > --- a/src/options.h
> > +++ b/src/options.h
> > @@ -115,6 +115,7 @@ struct options
> >    bool ask_passwd;              /* Ask for password? */
> >
> >    bool always_rest;            /* Always use REST. */
> > +  wgint start_pos;             /* Start position of a download. */
> >    char *ftp_user;              /* FTP username */
> >    char *ftp_passwd;            /* FTP password */
> >    bool netrc;                  /* Whether to read .netrc. */
> 
> 
> Simply setting the startpos and restval values in Wget will not help you if
> the server will keep sending the file from scratch on every request. The
> only way to partially download a file is by using the Range header. Because
> then, the server will send only the requested parts of the file. If a

Setting the value will let wget send an appropriate 'Range: ' header or 'REST'
command to the server.  Below is two demo for this patch.

        address@hidden:~/wget/src$ ./wget  -O - --start-pos=1213055 
ftp://ftp.gnu.org/gnu/wget/wget-1.10.2.tar.gz | wc -c
        --2013-12-21 17:15:05--  ftp://ftp.gnu.org/gnu/wget/wget-1.10.2.tar.gz
                   => '-'
        Resolving ftp.gnu.org... 2001:4830:134:3::b, 208.118.235.20
        Connecting to ftp.gnu.org|2001:4830:134:3::b|:21... connected.
        Logging in as anonymous ... Logged in!
        ==> SYST ... done.    ==> PWD ... done.
        ==> TYPE I ... done.  ==> CWD (1) /gnu/wget ... done.
        ==> SIZE wget-1.10.2.tar.gz ... 1213056
        ==> EPSV ... done.    ==> REST 1213055 ... done.
        ==> RETR wget-1.10.2.tar.gz ... done.
        Length: 1213056 (1.2M), 1 remaining (unauthoritative)

        
100%[+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++>]
 1,213,056   --.-K/s   in 0s

        2013-12-21 17:15:10 (136 KB/s) - written to stdout [1213056]

        1
        address@hidden:~/wget/src$ ./wget  -O - --start-pos=1213055 
http://ftp.gnu.org/gnu/wget/wget-1.10.2.tar.gz | wc -c
        --2013-12-21 17:18:45--  http://ftp.gnu.org/gnu/wget/wget-1.10.2.tar.gz
        Resolving ftp.gnu.org... 2001:4830:134:3::b, 208.118.235.20
        Connecting to ftp.gnu.org|2001:4830:134:3::b|:80... connected.
        HTTP request sent, awaiting response... 206 Partial Content
        Length: 1213056 (1.2M), 1 remaining [application/x-gzip]
        Saving to: 'STDOUT'

        
100%[+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++>]
 1,213,056   --.-K/s   in 0s

        2013-12-21 17:18:47 (112 KB/s) - written to stdout [1213056/1213056]

        1
        address@hidden:~/wget/src$


> server does not support the range header, you really have no other option
> but need to resume downloading the file from scratch. This is a limitation
> of the HTTP protocol and no client can help you get around it.

Yes.  I will add this to the doc.

> 
> 
> Also, please send patches as git format-patch attachments. Email clients
> often mangle with spaces and newlines and cause problems when applying them.

The patch was formatted with git-format-patch then sent with git-send-email.
I will send them as attachments if they are preferred here.

Thank you for reviewing this.  More suggestions?

                yousong

> 
> -- 
> Thanking You,
> Darshit Shah



reply via email to

[Prev in Thread] Current Thread [Next in Thread]