bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: wget and --reject-regex


From: Tim Rühsen
Subject: Re: wget and --reject-regex
Date: Sat, 26 Dec 2020 17:57:56 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.6.0

Hi Frans,

my apologies, maybe I stopped the download too fast.

The command line with the artworks regex indeed has no effect.
In fact, after looking into the code, I can confirm that I hardly see any of the filtering applied to FTP URLs that we apply to HTTP.

I am currently not sure if that is a regression or if that possibly never worked. Maybe that was intended / planned by the original authors. Sorry, this also puzzles me a bit... have to test with older versions when time allows.

Regards, Tim

On 26.12.20 15:12, Frans de Boer wrote:
On 25-12-2020 18:42, Tim Rühsen wrote:
Hello Franz,

tried with wget 1.20.3 and these both command work:

#1 Do not download smc/artworks/ directory:
wget -d -4 --mirror -nH -np --retr-symlinks=no --passive-ftp --no-verbose --cut-dirs=1 ftp://mirror.netcologne.de/savannah/smc/ --reject-regex=".*(/artworks/.*)"

#2 Do not download .bz2 and .rpm files
wget -d -4 --mirror -nH -np --retr-symlinks=no --passive-ftp --no-verbose --cut-dirs=1 ftp://mirror.netcologne.de/savannah/smc/ --reject-regex=".*(\.bz2|\.rpm)$"

(--regex-type=posix is default)
(the order of URL and options doesn't matter)

Regards, Tim

On 23.12.20 13:48, Frans de Boer wrote:
LS,

I found that wget 1.20 and later do support some basic regular expressions. I had good results with --accept=-regex but the reject part is more troublesome. I can't use ERE's since only BRE's is supported with the notion that the whole URL should be included.

I use wget to mirror some sites, but I do not want certain sub directories included in the download. You can think of sub directories named rpm, debug, temp etc.

Example:

wget -4 --mirror -nH -np --retr-symlinks=no --passive-ftp --no-verbose --cut-dirs=1 --regex-type posix --reject-regex "ftp\:\/\/mirror\.netcologne\.de\/savannah\/smc\/Screensaver\/" -P ./debugdir/nongnu ftp://mirror.netcologne.de/savannah/smc/

I tried this example with or without partial backslashes, but none is working. I tried this also with a single file, to no avail too. I understand that one can added multiple reject statements but would rather use the ERE .*(dir1|dir2|dir3|...|dirx|(..ERE..)), but that is rather cumbersome when I have to specify them by hand. I do have already a ERE string ready and would like to use that instead. Breaking down this string again into multiple reject statement might also not work if I can't even reject one file or sub directory.

Is there a way to accomplish above without having to resort to loops and sed as the filtering tool?

Regards, Frans

Hello Tim,

Alas, using wget version 1.20.3 under openSUSE 15.2 the line with excluding the artworks directory is not working. The whole artworks sub directory is loaded. To be sure, I also copied your line exactly to see if that makes a different. By the way, I tried this also under openSUSE Tumbleweed. The -d option does not indicate anything about the used regex.

The strange thing is that when I use a similar approach for python, I am able to use the following arguments to the reject statement: ".*/(amd64|binaries|Debug|debug|deleted|OLD|old|Patches|patches|prev|previous|rpm|RPM|rpms|RPMS|temp|tmp|w32 |win32|.*(rc|RC|a|b|p)[[:digit:]]{1}.*)/.*" - my universal string for all other projects too.

With this I have to add that I also use an --accept-regex for python and no such addition for nongnu.

So, I wonder why it seems to work on your side and not at my side.

--- Frans


Attachment: OpenPGP_signature
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]