[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: WARC-Target-URI issue
From: |
Darshit Shah |
Subject: |
Re: WARC-Target-URI issue |
Date: |
Fri, 15 Nov 2024 22:47:10 +0100 |
Hi,
Thanks for the bug report.
Wget's WARC support is rudimentary. And as of now, it only supports the older
WARC/1.0 standard.
Under the WARC 1.0 specification, the URI should be printed with the `<` and
`>` characters. This was changed in the WARC/1.1 specification.
Looks like the Wayback machine does not like WARC/1.0 style archives. I
unfortunately cannot apply your patch as-is, since it would break compatibility
with WARC/1.0.
Sadly, while we've wanted to update the implementation to WARC/1.1, there
hasn't been much interest in people wanting to contribute that code.
On Thu, Oct 31, 2024, at 16:04, ferencz.marton@icore.ro wrote:
> Good afternoon,
>
>
>
> We had an issue with creating correct warc files with wget (even with the
> latest one 1.24.5). The issue was caused by Wget saving the WARC-Target-URI
> record with starting < and ending > characters. This could not be processed
> by wayback machine on the replay.
>
> Reading the wiki, noticed that WARC-Target-URI should not contain <>
> characters
>
> https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1
> .1/
>
> So, I've updated the source of warc.c file with:
>
>
>
> //after
>
> static bool
>
> warc_write_header_uri (const char *name, const char *value)
>
> {
>
> if (value)
>
> {
>
> warc_write_string (name);
>
> warc_write_string (": <");
>
> warc_write_string (value);
>
> warc_write_string (">\r\n");
>
> }
>
> return warc_write_ok;
>
> }
>
>
>
> //added
>
> static bool
>
> warc_write_header_url (const char *name, const char *value)
>
> {
>
> if (value)
>
> {
>
> warc_write_string (name);
>
> warc_write_string (": ");
>
> warc_write_string (value);
>
> warc_write_string ("\r\n");
>
> }
>
> return warc_write_ok;
>
> }
>
>
>
> Where I found WARC-Target-URI
>
> Like: warc_write_header_uri ("WARC-Target-URI", url);
>
> I've changed it to:
>
> warc_write_header_url ("WARC-Target-URI", url);
>
>
>
> This way the newly compiled wget did create good warc files.
>
> Maybe it could be included in the upcoming release.
>
> Thank you.
>
> Best regards,
>
>
> Ferencz Marton
>
>
>
>
>
>
> CEO, iCore Outsourcing SRL
> Mobile: <tel:+40721275853> +40721275853
> Phone: <tel:+40368426655> +40368426655
> Email: <mailto:ferencz.marton@icore.ro> ferencz.marton@icore.ro
> Website: https://www.icore.ro
> Address: Str. Dr. Victor Babes Nr. 36 Birou 1.10
> 500073 Brasov Romania
>
>
>
>
>
>
> Attachments:
> * image003.png