[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Bug-wget] [bug #45803] More URI filters (regex etc) from commandline, f
[Bug-wget] [bug #45803] More URI filters (regex etc) from commandline, file, and program
Fri, 21 Aug 2015 06:14:16 +0000
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:39.0) Gecko/20100101 Firefox/39.0
Summary: More URI filters (regex etc) from commandline, file,
Project: GNU Wget
Submitted by: grarpamp
Submitted on: Fri 21 Aug 2015 06:14:15 AM GMT
Category: Feature Request
Severity: 3 - Normal
Priority: 5 - Normal
Assigned to: None
Originator Name: grarpamp
Discussion Lock: Any
Operating System: None
Fixed Release: None
Planned Release: None
Work Required: None
Patch Included: None
Adding the regex accept / reject URI filter was cool and very useful,
however it now needs to handle multiple regex.
The current regex only supports one expression. And even though
that expression can be compacted by factoring out common elements,
it still becomes very long very fast, and it's also contextually
and programmatically unmanageable... and thus it's not as powerful
as it should be.
So while this is nice and typical ...
This is also typical ... and the limitation of a single regex clearly
causes it to grow out of control into a visually useless and
--reject-regex='... ad nauseum ...'
The current semantics ...
- if both accept and reject are specified, in any order, the URI
"must fall through all" to be fetched.
- only the last of multiple --accept-regex are consulted.
- only the last of multiple --reject-regex are consulted.
... and the inability to do anything other than "POSIX" regexp, are
simply too limiting for complex requirements.
Therefore wget needs to support multiple regex expressions, new
sources for those expressions, and general filter capability.
The easy to implement enhancements below will allow whatever script
or human is calling wget to effectively program and option (on or
off) various regexes, and to add all sorts of external intelligience
to fetching decisions. I have listed them in order of implementation
priority... 1, 2, 3.
1) Call an external program which returns 0 or 1 to signal acceptance
or rejectance of each proposed URI. Wget shall wait for the program
to return. Any other exit status shall cause wget to terminate.
Since this is the most powerful and abstract method, yet potentially
slower than the others, it should be processed last, after all the
other filters before passing the URI to the network.  
The full URI including protocol, FQDN or IP, path, and parameters
shall be passed to the program in the environment variable
WGET_FILTER_URI. This variable shall contain exactly what is passed
to the current commandline regexes today, ie:
However, to support future flexibility, the following optional set
of variables should also be passed to the program if readily
implementable today (each of them can result in different serving
hierarchy contexts, and smart filter programs will utilize them
The full path to the directory into which wget will begin writing
its output shall be passed in WGET_FILTER_BASEDIR, typically "pwd"
or option --directory-prefix.
The protocol shall be passed in WGET_FILTER_PROTO, without "://".
Any URI specified username and password strings shall be passed in
WGET_FILTER_USER and WGET_FILTER_PASS, these two must be set but
may be empty.
The FQDN or IP shall be passed in WGET_FILTER_HOST, without "".
The port shall be passed in WGET_FILTER_PORT, without ":", the value
shall be all numeric unless wget was unable to convert using
/etc/services, in which case it will remain as found in the original
URI or commandline.
The URI path and params shall be passed in WGET_FILTER_URIPATH, any
leading leading slashes (/) shall be as found in the original URI
or commandline and shall not be added or removed (examples of zero,
one, and multiple slashes do exist in the wild).
If wget intends to write a pathname that does not match the original
URI or commandline (such as the "index.html" in /pathname 302 to
/pathname/, or /pathname/, or generated iteratives, or backup files,
etc), that pathname shall be passed in WGET_FILTER_FAKEPATH, this
must be set but may be empty.
2) Implement multiple regex files via the commandline, as in the
notion of "egrep -f FILE". They shall be read for each proposed URI
to permit dynamic editing on the fly as may be needed during long
spidering / infinite recursion operation, but may be preloaded into
wget for performance. Follows the existing "must fall through all"
--regex-file-mode=(dynamic|preload), default dynamic.
3) Implement multiple regex strings on the commandline. Follows the
existing "must fall through all" semantic.
 Since a single program can implement anything, it is not necessary
to support multiple programs and logic such as:
--filter-prog-op=(or|and), default logical OR of all returns.
--filter-prog-spawn=(serial|parallel), default serial.
 Future features may specify the order in which each filter
method is applied:
--filter-order=regex:regex-file:filter-prog, default as shown.
Or to skip "must fall through all" handling:
--filter-fast-accept=regex:regex-file, default as shown.
 Since the user can place nonmatching "comments" in such files,
only one accept and one reject file are needed, these other files
are not necessary:
Reply to this item at:
Message sent via/by Savannah
- [Bug-wget] [bug #45803] More URI filters (regex etc) from commandline, file, and program,