bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: wget and --reject-regex


From: Tim Rühsen
Subject: Re: wget and --reject-regex
Date: Mon, 28 Dec 2020 19:58:25 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.6.0

I wrote a fix (https://gitlab.com/gnuwget/wget/-/merge_requests/14) that you can review and/or test. It will likely be included in wget 1.21 soon.

Regards, Tim

On 26.12.20 21:18, Frans de Boer wrote:
Ok, top-posting it is.

The explanation in the manual speaks about filtering "on the complete URL". It speaks of the URL, so it must be a complete URL and "(artworks)" is just the last part of the URL. But your right, for FTP it is only working if enclosed in (). Since this does not discriminate between file and directories, it is far from ideal. I can only conclude - for now - that it is working for HTTP(S) connections, but not the FTP protocol. Looking at the manual and intended use of wget, I can't imaging that this is a "feature". So, it must be a bug.

Pointer: Using wget making a listing using the HTTP(S) protocol results in directories with a terminating forward slash as discriminator, as oppose for a FTP listing, where the discriminator is in the attributes only.

Ok, assuming the latter, I just wait until I get either confirmation or a workable solution (1.20.4?). I can use, however, the current workaround - with a modified string "(amd64|binaries|Debug|debug|deleted|OLD|old|Patches|patches|prev|previous|rpm|RPM|rpms|RPMS|temp|tmp|w32 |win32|.*(rc|RC|a|b|p)[[:digit:]]{1}.*)" - and continue testing until this issue has been resolved.

--- Frans


On 26-12-2020 19:20, Tim Rühsen wrote:
Hey,

more info on this... the regex (and other filters) are only applied to the (file or directory) names as they are read from the FTP listing.

E.g. when using --reject-regex="(artworks)", you'll see in the logs (with -d):

artworks is excluded/not-included through regex.

I this is not optimal and definitely worth improving :-)

Regards, Tim

On 26.12.20 17:57, Tim Rühsen wrote:
Hi Frans,

my apologies, maybe I stopped the download too fast.

The command line with the artworks regex indeed has no effect.
In fact, after looking into the code, I can confirm that I hardly see any of the filtering applied to FTP URLs that we apply to HTTP.

I am currently not sure if that is a regression or if that possibly never worked. Maybe that was intended / planned by the original authors. Sorry, this also puzzles me a bit... have to test with older versions when time allows.

Regards, Tim

On 26.12.20 15:12, Frans de Boer wrote:
On 25-12-2020 18:42, Tim Rühsen wrote:
Hello Franz,

tried with wget 1.20.3 and these both command work:

#1 Do not download smc/artworks/ directory:
wget -d -4 --mirror -nH -np --retr-symlinks=no --passive-ftp --no-verbose --cut-dirs=1 ftp://mirror.netcologne.de/savannah/smc/ --reject-regex=".*(/artworks/.*)"

#2 Do not download .bz2 and .rpm files
wget -d -4 --mirror -nH -np --retr-symlinks=no --passive-ftp --no-verbose --cut-dirs=1 ftp://mirror.netcologne.de/savannah/smc/ --reject-regex=".*(\.bz2|\.rpm)$"

(--regex-type=posix is default)
(the order of URL and options doesn't matter)

Regards, Tim

On 23.12.20 13:48, Frans de Boer wrote:
LS,

I found that wget 1.20 and later do support some basic regular expressions. I had good results with --accept=-regex but the reject part is more troublesome. I can't use ERE's since only BRE's is supported with the notion that the whole URL should be included.

I use wget to mirror some sites, but I do not want certain sub directories included in the download. You can think of sub directories named rpm, debug, temp etc.

Example:

wget -4 --mirror -nH -np --retr-symlinks=no --passive-ftp --no-verbose --cut-dirs=1 --regex-type posix --reject-regex "ftp\:\/\/mirror\.netcologne\.de\/savannah\/smc\/Screensaver\/" -P ./debugdir/nongnu ftp://mirror.netcologne.de/savannah/smc/

I tried this example with or without partial backslashes, but none is working. I tried this also with a single file, to no avail too. I understand that one can added multiple reject statements but would rather use the ERE .*(dir1|dir2|dir3|...|dirx|(..ERE..)), but that is rather cumbersome when I have to specify them by hand. I do have already a ERE string ready and would like to use that instead. Breaking down this string again into multiple reject statement might also not work if I can't even reject one file or sub directory.

Is there a way to accomplish above without having to resort to loops and sed as the filtering tool?

Regards, Frans

Hello Tim,

Alas, using wget version 1.20.3 under openSUSE 15.2 the line with excluding the artworks directory is not working. The whole artworks sub directory is loaded. To be sure, I also copied your line exactly to see if that makes a different. By the way, I tried this also under openSUSE Tumbleweed. The -d option does not indicate anything about the used regex.

The strange thing is that when I use a similar approach for python, I am able to use the following arguments to the reject statement: ".*/(amd64|binaries|Debug|debug|deleted|OLD|old|Patches|patches|prev|previous|rpm|RPM|rpms|RPMS|temp|tmp|w32 |win32|.*(rc|RC|a|b|p)[[:digit:]]{1}.*)/.*" - my universal string for all other projects too.

With this I have to add that I also use an --accept-regex for python and no such addition for nongnu.

So, I wonder why it seems to work on your side and not at my side.

--- Frans





Attachment: OpenPGP_signature
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]