bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: WARC-Target-URI issue


From: Darshit Shah
Subject: Re: WARC-Target-URI issue
Date: Fri, 15 Nov 2024 22:47:10 +0100

Hi,

Thanks for the bug report.

Wget's WARC support is rudimentary. And as of now, it only supports the older 
WARC/1.0 standard.
Under the WARC 1.0 specification, the URI should be printed with the `<` and 
`>` characters. This was changed in the WARC/1.1 specification. 

Looks like the Wayback machine does not like WARC/1.0 style archives. I 
unfortunately cannot apply your patch as-is, since it would break compatibility 
with WARC/1.0.

Sadly, while we've wanted to update the implementation to WARC/1.1, there 
hasn't been much interest in people wanting to contribute that code. 

On Thu, Oct 31, 2024, at 16:04, ferencz.marton@icore.ro wrote:
> Good afternoon,
>
> 
>
> We had an issue with creating correct warc files with wget (even with the
> latest one 1.24.5). The issue was caused by Wget saving the WARC-Target-URI
> record with starting < and ending > characters. This could not be processed
> by wayback machine on the replay.
>
> Reading the wiki, noticed that WARC-Target-URI should not contain <>
> characters
>
> https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1
> .1/
>
> So, I've updated the source of warc.c file with:
>
> 
>
> //after 
>
> static bool
>
> warc_write_header_uri (const char *name, const char *value)
>
> {
>
>   if (value)
>
>     {
>
>       warc_write_string (name);
>
>       warc_write_string (": <");
>
>       warc_write_string (value);
>
>       warc_write_string (">\r\n");
>
>     }
>
>   return warc_write_ok;
>
> }
>
> 
>
> //added
>
> static bool
>
> warc_write_header_url (const char *name, const char *value)
>
> {
>
>   if (value)
>
>     {
>
>       warc_write_string (name);
>
>       warc_write_string (": ");
>
>       warc_write_string (value);
>
>       warc_write_string ("\r\n");
>
>     }
>
>   return warc_write_ok;
>
> }
>
> 
>
> Where I found WARC-Target-URI
>
> Like: warc_write_header_uri ("WARC-Target-URI", url);
>
> I've changed it to:
>
> warc_write_header_url ("WARC-Target-URI", url);
>
> 
>
> This way the newly compiled wget did create good warc files.
>
> Maybe it could be included in the upcoming release.
>
> Thank you.
>
> Best regards,
>
>
> Ferencz Marton 
>
> 
>
>
>
>
> CEO, iCore Outsourcing SRL
> Mobile: <tel:+40721275853>  +40721275853
> Phone: <tel:+40368426655>  +40368426655
> Email: <mailto:ferencz.marton@icore.ro>  ferencz.marton@icore.ro
> Website: https://www.icore.ro
> Address: Str. Dr. Victor Babes Nr. 36 Birou 1.10
> 500073 Brasov Romania
>
> 
>
> 
>
>
> Attachments:
> * image003.png



reply via email to

[Prev in Thread] Current Thread [Next in Thread]