[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: Exclusion failures
From: |
Roger Brooks |
Subject: |
RE: Exclusion failures |
Date: |
Tue, 13 Jul 2021 15:12:25 +0200 |
Thanks for the further tips. Adding
--regex-type=pcre
resolved the problem with
"event-4193082@CalendarViewType=1&SelectedDate=6%2F27%2F2021.html", even
though I am using wget 1.19.1
I am running wget on a Synology NAS, so the newest Windows build won't help.
I am using --restrict-file-names=windows to allow the resulting mirrored
website to be viewed on a Windows client.
The advice in this forum post:
https://serverfault.com/questions/324555/how-to-exclude-certain-directories-while-using-wget
made me realize that --exclude-directories probably didn't work for "fonts"
and "Fonts" because they are subdirectories.
The workaround suggested there of using --reject-regex instead is working
satisfactorily for me. That said, I am still curious as to why directories
of the form "Fonts_ADMIN_<date>_Conflict" are being created at all.
Their parent directory is being recreated with a new GUID more often than I
anticipated, so I will pursue that question under a different title.
Here is the script with the working exclusions:
>>
wget -EkKrNpH \
--output-file=wget.log \
--domains=imcz.club,sf.wildapricot.org \
--exclude-domains=webmail.imcz.club \
--exclude-directories=calendar,Club-Events,External-Events,Fonts,fonts,Sys
\
--ignore-case \
--level=2 \
--no-parent \
--no-proxy \
--random-wait \
--regex-type=pcre \
--reject=ashx,"overlay*" \
--reject-regex="calendar[@\?].*|Club-Events[@\?].*|External-Events[@\?].*|event-\d+[@\?].*|/[Ff]onts"
\
--rejected-log=wget-rejected.log \
--restrict-file-names=windows \
--wait=1 \
https://imcz.club/
<<
Thanks for your help!
Regards, Roger
-----Original Message-----
From: Tim Rühsen <tim.ruehsen@gmx.de>
Sent: Thursday, July 8, 2021 7:54 PM
To: Roger Brooks <r.s.brooks@ieee.org>
Cc: bug-wget@gnu.org
Subject: Re: Exclusion failures
I think i don't understand your font/ problem correctly, sorry.
The regex issue seems to be that wget is using POSIX regex by default.
Please try to use --regex-type=pcre for PCRE regex.
You can get the latest version of wget built for Windows (incl. PCRE
support) at https://eternallybored.org/misc/wget/.
Regards, Tim
On 08.07.21 16:26, Roger Brooks wrote:
> Thanks for the explanations. Unfortunately, I don't find them convincing:
>
>>>
> So the fonts/ directory is not automatically deleted by wget when it
> is empty. It was used for temporary files during the download.
> <<
> Actually, the "fonts" directory is *not* empty, nor are the "Fonts_*
> _Conflict" directories.
>
>>>
> Why should '@CalendarView' match 'calendar[@/?]' ?
> <<
> The component of the regex which should match is not "calendar[@\?].*"
> (the first term in the regex). It is "event-\d+[@\?].*" (the fourth
> and last term in the regex).
> Once again, https://regex101.com/ confirms that
> "event-4193082@CalendarViewType=1&SelectedDate=6%2F27%2F2021.html"
> matches this term.
>
> Thanks for your support.
>
> -----Original Message-----
> From: Tim Rühsen <tim.ruehsen@gmx.de>
> Sent: Monday, July 5, 2021 4:09 PM
> To: Roger Brooks <r.s.brooks@ieee.org>; bug-wget@gnu.org
> Subject: Re: Exclusion failures
>
> On 28.06.21 19:36, Roger Brooks wrote:
>> I am trying to use wget 1.19.1 to back up a club website. Here is a
>> reduced version of my wget command, which only accesses the public
>> parts of the
>> website:
>>>>
>> cd /volume1/Backup/
>> wget -EkKrNpH \
>> --output-file=wget.log \
>> --domains=imcz.club,sf.wildapricot.org \
>> --exclude-domains=webmail.imcz.club \
>>
>> --exclude-directories=calendar,Club-Events,External-Events,Sys,Fonts,
>> f
>> onts
>> \
>> --ignore-case \
>> --level=2 \
>> --no-parent \
>> --no-proxy \
>> --random-wait \
>> --reject=ashx,"overlay*" \
>>
>> --reject-regex="calendar[@\?].*|Club-Events[@\?].*|External-Events[@\?].*|event-\d+[@\?].*"
>> \
>> --rejected-log=wget-rejected.log \
>> --restrict-file-names=windows \
>> --wait=1 \
>> https://imcz.club/
>> <<
>>
>> Two of the exclusions in the command are failing:
>>
>> 1. -exclude-directories=Fonts, fonts
>> This is a workaround for wget’s creation of spurious font directories.
>> The server has only one such directory, but the website’s backend
>> platform (over which I have no control) sometimes addresses it as
>> “fonts” and sometimes as “Fonts”.
>> I expected that the option "--ignore-case" in the absence of
>> "--no-clobber"
>> would take care of this problem, but since the contents are static, I
>> don’t need to back it up regularly. Despite the exclusion, wget
>> still insists on creating the following directories:
>> "W:\imcz.club\BuiltTheme\whiteboard_maya_blue.v3.0\10e4279e\fonts"
>> "W:\imcz.club\BuiltTheme\whiteboard_maya_blue.v3.0\10e4279e\Fonts_ADMIN_Jun-27-230456-2021_Conflict"
>> "W:\imcz.club\BuiltTheme\whiteboard_maya_blue.v3.0\10e4279e\Fonts_ADMIN_Jun-27-230459-2021_Conflict"
>> "W:\imcz.club\BuiltTheme\whiteboard_maya_blue.v3.0\10e4279e\Fonts_ADMIN_Jun-27-230501-2021_Conflict"
>> "W:\imcz.club\BuiltTheme\whiteboard_maya_blue.v3.0\10e4279e\Fonts_ADMIN_Jun-27-230504-2021_Conflict"
>> The resulting backup website does not find the fonts in the "_Conflict"
>> directories; they have to be copied into the "fonts" directory for
>> the pages in the mirrored site to display properly.
>
> So the fonts/ directory is not automatically deleted by wget when it
> is empty. It was used for temporary files during the download.
> This is a known "issue", but since an empty directory doesn't eat too
> much space on a disk, it wasn't fixed yet (maybe nobody thought it is
> relevant).
> Wget2 doesn't have this issue.
>
> I don't know where the *_Conflict/ directories are from. Seems like a
> server thing.
>
>
>> 2.
>> --reject-regex="calendar[@\?].*|Club-Events[@\?].*|External-Events[@\?].*|event-\d+[@\?].*"
>> \
>> This is an attempt to prevent duplicate downloading of files. The
>> following file is downloaded, even though https://regex101.com says
>> that it matches my
>> regex:
>> "W:\imcz.club\event-4193082@CalendarViewType=1&SelectedDate=6%2F27%2F2021.html"
>> It is effectively a duplicate of:
>> "W:\imcz.club\event-4193082.html"
>> Increasing "--level" produces additional examples.
>
> Why should '@CalendarView' match 'calendar[@/?]' ?
> Maybe your regex should be '[@\?]calendar.*' !?
>
> Regards, Tim
>