bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] wget mirror site failing due to file / directory name cla


From: Ángel González
Subject: Re: [Bug-wget] wget mirror site failing due to file / directory name clashes
Date: Sat, 13 Oct 2012 15:44:46 +0200
User-agent: Thunderbird

On 12/10/12 15:38, Paul Beckett (ITCS) wrote:
> I am attempting to use wget to create a mirrored copy of a CMS (Liferay) 
> website. I want to be able to failover to this static copy in case the 
> application server goes offline. I therefore need the URL's to remain 
> absolutely identical. The problem I have is that I cannot figure out how I 
> can configure wget in a way that will cope with:
> http://www.example.com/about
> http://www.example.com/about/something
>
> In this case either the file or directory 'about' already exists at prevents 
> the second being created.
>
> Initially I though the most obvious solution, was to rely on Apache's 
> DirectoryIndex, and save the files as:
> /about/index.html
> /about/something/index.html
>
> But, currently I can't figure out how I can do this in a way that doesn't 
> break either the relative path to other pages or create links to the 
> index.html rather than the original location. I need the links (a href etc.) 
> to still go to /about and not explicitly call /index.html - as this will mean 
> people may bookmark things that won't exist when the CMS came back.
>
> If anyone can offer me any advice on how I can achieve this (either correct 
> options), or how I could patch the source code to achieve this, I would be 
> extremely grateful.
>
> Thanks,
> Paul
>
>
>
> /usr/local/bin/wget --background --append-output=/tmp/wget-log --no-verbose 
> --tries=20 --waitretry=10 --retry-connrefused --limit-rate=100m 
> --quota=10000m --timestamping 
> --directory-prefix=/usr/local/apache2/content/uk.ac.uea.www_flat2 
> --protocol-directories --user-agent="UEA WebSite Flattener" 
> --backup-converted -e robots=off --page-requisites --convert-links 
> --recursive --level=inf --trust-server-names --domains example.com 
> www.example.com
Download with --adjust-extension
This way, you will get:

/about.html
/about/something.html


Then configure the root of the static copy:
RewriteEngine On
RewriteCond  %{SCRIPT_FILENAME} !\.html$
RewriteRule ^(.*[^/])/?$ $1.html

to append the .html extension to the requested urls.
If your CMS returns non-html contents on some urls you
will need to adjust this to exclude them from the rewrite.

Also, I'd remove --convert-links from the command line, since you want the same 
page contents as the real pages.







reply via email to

[Prev in Thread] Current Thread [Next in Thread]