[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
WARC-Target-URI issue
From: |
ferencz.marton |
Subject: |
WARC-Target-URI issue |
Date: |
Thu, 31 Oct 2024 15:04:46 -0000 |
Good afternoon,
We had an issue with creating correct warc files with wget (even with the
latest one 1.24.5). The issue was caused by Wget saving the WARC-Target-URI
record with starting < and ending > characters. This could not be processed
by wayback machine on the replay.
Reading the wiki, noticed that WARC-Target-URI should not contain <>
characters
https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1
.1/
So, I've updated the source of warc.c file with:
//after
static bool
warc_write_header_uri (const char *name, const char *value)
{
if (value)
{
warc_write_string (name);
warc_write_string (": <");
warc_write_string (value);
warc_write_string (">\r\n");
}
return warc_write_ok;
}
//added
static bool
warc_write_header_url (const char *name, const char *value)
{
if (value)
{
warc_write_string (name);
warc_write_string (": ");
warc_write_string (value);
warc_write_string ("\r\n");
}
return warc_write_ok;
}
Where I found WARC-Target-URI
Like: warc_write_header_uri ("WARC-Target-URI", url);
I've changed it to:
warc_write_header_url ("WARC-Target-URI", url);
This way the newly compiled wget did create good warc files.
Maybe it could be included in the upcoming release.
Thank you.
Best regards,
Ferencz Marton
CEO, iCore Outsourcing SRL
Mobile: <tel:+40721275853> +40721275853
Phone: <tel:+40368426655> +40368426655
Email: <mailto:ferencz.marton@icore.ro> ferencz.marton@icore.ro
Website: https://www.icore.ro
Address: Str. Dr. Victor Babes Nr. 36 Birou 1.10
500073 Brasov Romania
- WARC-Target-URI issue,
ferencz.marton <=