bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: wget and --reject-regex


From: Frans de Boer
Subject: Re: wget and --reject-regex
Date: Sat, 26 Dec 2020 15:12:27 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.6.0

On 25-12-2020 18:42, Tim Rühsen wrote:
Hello Franz,

tried with wget 1.20.3 and these both command work:

#1 Do not download smc/artworks/ directory:
wget -d -4 --mirror -nH -np --retr-symlinks=no --passive-ftp --no-verbose --cut-dirs=1 ftp://mirror.netcologne.de/savannah/smc/ --reject-regex=".*(/artworks/.*)"

#2 Do not download .bz2 and .rpm files
wget -d -4 --mirror -nH -np --retr-symlinks=no --passive-ftp --no-verbose --cut-dirs=1 ftp://mirror.netcologne.de/savannah/smc/ --reject-regex=".*(\.bz2|\.rpm)$"

(--regex-type=posix is default)
(the order of URL and options doesn't matter)

Regards, Tim

On 23.12.20 13:48, Frans de Boer wrote:
LS,

I found that wget 1.20 and later do support some basic regular expressions. I had good results with --accept=-regex but the reject part is more troublesome. I can't use ERE's since only BRE's is supported with the notion that the whole URL should be included.

I use wget to mirror some sites, but I do not want certain sub directories included in the download. You can think of sub directories named rpm, debug, temp etc.

Example:

wget -4 --mirror -nH -np --retr-symlinks=no --passive-ftp --no-verbose --cut-dirs=1 --regex-type posix --reject-regex "ftp\:\/\/mirror\.netcologne\.de\/savannah\/smc\/Screensaver\/" -P ./debugdir/nongnu ftp://mirror.netcologne.de/savannah/smc/

I tried this example with or without partial backslashes, but none is working. I tried this also with a single file, to no avail too. I understand that one can added multiple reject statements but would rather use the ERE .*(dir1|dir2|dir3|...|dirx|(..ERE..)), but that is rather cumbersome when I have to specify them by hand. I do have already a ERE string ready and would like to use that instead. Breaking down this string again into multiple reject statement might also not work if I can't even reject one file or sub directory.

Is there a way to accomplish above without having to resort to loops and sed as the filtering tool?

Regards, Frans

Hello Tim,

Alas, using wget version 1.20.3 under openSUSE 15.2 the line with excluding the artworks directory is not working. The whole artworks sub directory is loaded. To be sure, I also copied your line exactly to see if that makes a different. By the way, I tried this also under openSUSE Tumbleweed. The -d option does not indicate anything about the used regex.

The strange thing is that when I use a similar approach for python, I am able to use the following arguments to the reject statement: ".*/(amd64|binaries|Debug|debug|deleted|OLD|old|Patches|patches|prev|previous|rpm|RPM|rpms|RPMS|temp|tmp|w32 |win32|.*(rc|RC|a|b|p)[[:digit:]]{1}.*)/.*" - my universal string for all other projects too.

With this I have to add that I also use an --accept-regex for python and no such addition for nongnu.

So, I wonder why it seems to work on your side and not at my side.

--- Frans

--
A: Yes, just like that                            A: Ja, net zo
Q: Oh, Just like reading a book backwards         Q: Oh, net als een boek 
achterstevoren lezen
A: Because it upsets the natural flow of a story  A: Omdat het de natuurlijke 
gang uit het verhaal haalt
Q: Why is top-posting annoying?                   Q: Waarom is Top-posting zo 
irritant?




reply via email to

[Prev in Thread] Current Thread [Next in Thread]