bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: Exclusion failures


From: Roger Brooks
Subject: RE: Exclusion failures
Date: Tue, 13 Jul 2021 15:12:25 +0200

Thanks for the further tips.  Adding
--regex-type=pcre
resolved the problem with
"event-4193082@CalendarViewType=1&SelectedDate=6%2F27%2F2021.html", even
though I am using wget 1.19.1
I am running wget on a Synology NAS, so the newest Windows build won't help.
I am using --restrict-file-names=windows to allow the resulting mirrored
website to be viewed on a Windows client.
The advice in this forum post:
https://serverfault.com/questions/324555/how-to-exclude-certain-directories-while-using-wget
made me realize that --exclude-directories probably didn't work for "fonts"
and "Fonts" because they are subdirectories.
The workaround suggested there of using --reject-regex instead is working
satisfactorily for me.  That said, I am still curious as to why directories
of the form "Fonts_ADMIN_<date>_Conflict" are being created at all.
Their parent directory is being recreated with a new GUID more often than I
anticipated, so I will pursue that question under a different title.
Here is the script with the working exclusions:
>>
wget -EkKrNpH \
     --output-file=wget.log \
     --domains=imcz.club,sf.wildapricot.org \
     --exclude-domains=webmail.imcz.club \
     --exclude-directories=calendar,Club-Events,External-Events,Fonts,fonts,Sys
\
     --ignore-case \
     --level=2 \
     --no-parent \
     --no-proxy \
     --random-wait \
     --regex-type=pcre \
     --reject=ashx,"overlay*" \
     
--reject-regex="calendar[@\?].*|Club-Events[@\?].*|External-Events[@\?].*|event-\d+[@\?].*|/[Ff]onts"
\
     --rejected-log=wget-rejected.log \
     --restrict-file-names=windows \
     --wait=1 \
     https://imcz.club/
<<
Thanks for your help!
Regards, Roger

-----Original Message-----
From: Tim Rühsen <tim.ruehsen@gmx.de>
Sent: Thursday, July 8, 2021 7:54 PM
To: Roger Brooks <r.s.brooks@ieee.org>
Cc: bug-wget@gnu.org
Subject: Re: Exclusion failures

I think i don't understand your font/ problem correctly, sorry.

The regex issue seems to be that wget is using POSIX regex by default.
Please try to use --regex-type=pcre for PCRE regex.

You can get the latest version of wget built for Windows (incl. PCRE
support) at https://eternallybored.org/misc/wget/.

Regards, Tim

On 08.07.21 16:26, Roger Brooks wrote:
> Thanks for the explanations. Unfortunately, I don't find them convincing:
>
>>>
> So the fonts/ directory is not automatically deleted by wget when it
> is empty. It was used for temporary files during the download.
> <<
> Actually, the "fonts" directory is *not* empty, nor are the "Fonts_*
> _Conflict" directories.
>
>>>
> Why should '@CalendarView' match 'calendar[@/?]' ?
> <<
> The component of the regex which should match is not "calendar[@\?].*"
> (the first term in the regex). It is "event-\d+[@\?].*" (the fourth
> and last term in the regex).
> Once again, https://regex101.com/ confirms that
> "event-4193082@CalendarViewType=1&SelectedDate=6%2F27%2F2021.html"
> matches this term.
>
> Thanks for your support.
>
> -----Original Message-----
> From: Tim Rühsen <tim.ruehsen@gmx.de>
> Sent: Monday, July 5, 2021 4:09 PM
> To: Roger Brooks <r.s.brooks@ieee.org>; bug-wget@gnu.org
> Subject: Re: Exclusion failures
>
> On 28.06.21 19:36, Roger Brooks wrote:
>> I am trying to use wget 1.19.1 to back up a club website.  Here is a
>> reduced version of my wget command, which only accesses the public
>> parts of the
>> website:
>>>>
>> cd /volume1/Backup/
>> wget -EkKrNpH \
>>        --output-file=wget.log \
>>        --domains=imcz.club,sf.wildapricot.org \
>>        --exclude-domains=webmail.imcz.club \
>>
>> --exclude-directories=calendar,Club-Events,External-Events,Sys,Fonts,
>> f
>> onts
>> \
>>        --ignore-case \
>>        --level=2 \
>>        --no-parent \
>>        --no-proxy \
>>        --random-wait \
>>        --reject=ashx,"overlay*" \
>>        
>> --reject-regex="calendar[@\?].*|Club-Events[@\?].*|External-Events[@\?].*|event-\d+[@\?].*"
>> \
>>        --rejected-log=wget-rejected.log \
>>        --restrict-file-names=windows \
>>        --wait=1 \
>>        https://imcz.club/
>> <<
>>
>> Two of the exclusions in the command are failing:
>>
>> 1. -exclude-directories=Fonts, fonts
>> This is a workaround for wget’s creation of spurious font directories.
>> The server has only one such directory, but the website’s backend
>> platform (over which I have no control) sometimes addresses it as
>> “fonts” and sometimes as “Fonts”.
>> I expected that the option "--ignore-case" in the absence of
>> "--no-clobber"
>> would take care of this problem, but since the contents are static, I
>> don’t need to back it up regularly.  Despite the exclusion, wget
>> still insists on creating the following directories:
>> "W:\imcz.club\BuiltTheme\whiteboard_maya_blue.v3.0\10e4279e\fonts"
>> "W:\imcz.club\BuiltTheme\whiteboard_maya_blue.v3.0\10e4279e\Fonts_ADMIN_Jun-27-230456-2021_Conflict"
>> "W:\imcz.club\BuiltTheme\whiteboard_maya_blue.v3.0\10e4279e\Fonts_ADMIN_Jun-27-230459-2021_Conflict"
>> "W:\imcz.club\BuiltTheme\whiteboard_maya_blue.v3.0\10e4279e\Fonts_ADMIN_Jun-27-230501-2021_Conflict"
>> "W:\imcz.club\BuiltTheme\whiteboard_maya_blue.v3.0\10e4279e\Fonts_ADMIN_Jun-27-230504-2021_Conflict"
>> The resulting backup website does not find the fonts in the "_Conflict"
>> directories; they have to be copied into the "fonts" directory for
>> the pages in the mirrored site to display properly.
>
> So the fonts/ directory is not automatically deleted by wget when it
> is empty. It was used for temporary files during the download.
> This is a known "issue", but since an empty directory doesn't eat too
> much space on a disk, it wasn't fixed yet (maybe nobody thought it is
> relevant).
> Wget2 doesn't have this issue.
>
> I don't know where the *_Conflict/ directories are from. Seems like a
> server thing.
>
>
>> 2. 
>> --reject-regex="calendar[@\?].*|Club-Events[@\?].*|External-Events[@\?].*|event-\d+[@\?].*"
>> \
>> This is an attempt to prevent duplicate downloading of files. The
>> following file is downloaded, even though https://regex101.com says
>> that it matches my
>> regex:
>> "W:\imcz.club\event-4193082@CalendarViewType=1&SelectedDate=6%2F27%2F2021.html"
>> It is effectively a duplicate of:
>> "W:\imcz.club\event-4193082.html"
>> Increasing "--level" produces additional examples.
>
> Why should '@CalendarView' match 'calendar[@/?]' ?
> Maybe your regex should be '[@\?]calendar.*' !?
>
> Regards, Tim
>



reply via email to

[Prev in Thread] Current Thread [Next in Thread]