bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Bug-wget] [bug #45803] More URI filters (regex etc) from commandline, f


From: grarpamp
Subject: [Bug-wget] [bug #45803] More URI filters (regex etc) from commandline, file, and program
Date: Fri, 21 Aug 2015 06:14:16 +0000
User-agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:39.0) Gecko/20100101 Firefox/39.0

URL:
  <http://savannah.gnu.org/bugs/?45803>

                 Summary: More URI filters (regex etc) from commandline, file,
and program
                 Project: GNU Wget
            Submitted by: grarpamp
            Submitted on: Fri 21 Aug 2015 06:14:15 AM GMT
                Category: Feature Request
                Severity: 3 - Normal
                Priority: 5 - Normal
                  Status: None
                 Privacy: Public
             Assigned to: None
         Originator Name: grarpamp
        Originator Email: 
             Open/Closed: Open
         Discussion Lock: Any
                 Release: 1.16.3
        Operating System: None
         Reproducibility: None
           Fixed Release: None
         Planned Release: None
              Regression: None
           Work Required: None
          Patch Included: None

    _______________________________________________________

Details:

Adding the regex accept / reject URI filter was cool and very useful,
however it now needs to handle multiple regex.

The current regex only supports one expression. And even though
that expression can be compacted by factoring out common elements,
it still becomes very long very fast, and it's also contextually
and programmatically unmanageable... and thus it's not as powerful
as it should be.

So while this is nice and typical ...

--accept-regex='^http://www\.gnu\.org/(foo|bar)/.*$'

This is also typical ... and the limitation of a single regex clearly
causes it to grow out of control into a visually useless and
unmanageable blob.

--accept-regex='^http://www\.gnu\.org/(foo/do(g|t)/rea(l|d/...|ping(\.jpg)?/...)|bar/(mon/[tmbp]ent?/...|red/[a-z0-9][^-]+-q))/.*$|^ftp://ftp\.gnu\.org/....*$'
--reject-regex='... ad nauseum ...'

The current semantics ...
 - if both accept and reject are specified, in any order, the URI
   "must fall through all" to be fetched.
 - only the last of multiple --accept-regex are consulted.
 - only the last of multiple --reject-regex are consulted.
... and the inability to do anything other than "POSIX" regexp, are
simply too limiting for complex requirements.

Therefore wget needs to support multiple regex expressions, new
sources for those expressions, and general filter capability.

The easy to implement enhancements below will allow whatever script
or human is calling wget to effectively program and option (on or
off) various regexes, and to add all sorts of external intelligience
to fetching decisions. I have listed them in order of implementation
priority... 1, 2, 3.



1) Call an external program which returns 0 or 1 to signal acceptance
or rejectance of each proposed URI. Wget shall wait for the program
to return. Any other exit status shall cause wget to terminate.
Since this is the most powerful and abstract method, yet potentially
slower than the others, it should be processed last, after all the
other filters before passing the URI to the network. [1] [2]

--uri-filter-prog=prog1

The full URI including protocol, FQDN or IP, path, and parameters
shall be passed to the program in the environment variable
WGET_FILTER_URI. This variable shall contain exactly what is passed
to the current commandline regexes today, ie:

 https://www.example.com/foo/bar?a=b&c=d#123
 ftp://[::1]/foo/bar


However, to support future flexibility, the following optional set
of variables should also be passed to the program if readily
implementable today (each of them can result in different serving
hierarchy contexts, and smart filter programs will utilize them
accordingly):

The full path to the directory into which wget will begin writing
 its output shall be passed in WGET_FILTER_BASEDIR, typically "pwd"
 or option --directory-prefix.
The protocol shall be passed in WGET_FILTER_PROTO, without "://".
Any URI specified username and password strings shall be passed in
 WGET_FILTER_USER and WGET_FILTER_PASS, these two must be set but
 may be empty.
The FQDN or IP shall be passed in WGET_FILTER_HOST, without "[]".
The port shall be passed in WGET_FILTER_PORT, without ":", the value
 shall be all numeric unless wget was unable to convert using
 /etc/services, in which case it will remain as found in the original
 URI or commandline.
The URI path and params shall be passed in WGET_FILTER_URIPATH, any
 leading leading slashes (/) shall be as found in the original URI
 or commandline and shall not be added or removed (examples of zero,
 one, and multiple slashes do exist in the wild).
If wget intends to write a pathname that does not match the original
 URI or commandline (such as the "index.html" in /pathname 302 to
 /pathname/, or /pathname/, or generated iteratives, or backup files,
 etc), that pathname shall be passed in WGET_FILTER_FAKEPATH, this
 must be set but may be empty.



2) Implement multiple regex files via the commandline, as in the
notion of "egrep -f FILE". They shall be read for each proposed URI
to permit dynamic editing on the fly as may be needed during long
spidering / infinite recursion operation, but may be preloaded into
wget for performance. Follows the existing "must fall through all"
semantic. [3]

--accept-regex-file=file1 \
--reject-regex-file=file2 \
--regex-file-mode=(dynamic|preload), default dynamic.



3) Implement multiple regex strings on the commandline. Follows the
existing "must fall through all" semantic.

--accept-regex=regex1 \
--accept-regex=regex2 \
--reject-regex=regex3 \
--reject-regex=regex4 \
--......-regex=regexN [...]



Notes:

[1] Since a single program can implement anything, it is not necessary
to support multiple programs and logic such as:

--uri-filter-prog=progN [...]
--filter-prog-op=(or|and), default logical OR of all returns.
--filter-prog-spawn=(serial|parallel), default serial.

[2] Future features may specify the order in which each filter
method is applied:

--filter-order=regex:regex-file:filter-prog, default as shown.

Or to skip "must fall through all" handling:

--filter-fast-accept=regex:regex-file, default as shown.

[3] Since the user can place nonmatching "comments" in such files,
only one accept and one reject file are needed, these other files
are not necessary:

--accept-regex-file=file3 \
--reject-regex-file=file4 \
--......-regex-file=fileN [...]





    _______________________________________________________

Reply to this item at:

  <http://savannah.gnu.org/bugs/?45803>

_______________________________________________
  Message sent via/by Savannah
  http://savannah.gnu.org/




reply via email to

[Prev in Thread] Current Thread [Next in Thread]