bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

WARC-Target-URI issue


From: ferencz.marton
Subject: WARC-Target-URI issue
Date: Thu, 31 Oct 2024 15:04:46 -0000

Good afternoon,

 

We had an issue with creating correct warc files with wget (even with the
latest one 1.24.5). The issue was caused by Wget saving the WARC-Target-URI
record with starting < and ending > characters. This could not be processed
by wayback machine on the replay.

Reading the wiki, noticed that WARC-Target-URI should not contain <>
characters

https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1
.1/

So, I've updated the source of warc.c file with:

 

//after 

static bool

warc_write_header_uri (const char *name, const char *value)

{

  if (value)

    {

      warc_write_string (name);

      warc_write_string (": <");

      warc_write_string (value);

      warc_write_string (">\r\n");

    }

  return warc_write_ok;

}

 

//added

static bool

warc_write_header_url (const char *name, const char *value)

{

  if (value)

    {

      warc_write_string (name);

      warc_write_string (": ");

      warc_write_string (value);

      warc_write_string ("\r\n");

    }

  return warc_write_ok;

}

 

Where I found WARC-Target-URI

Like: warc_write_header_uri ("WARC-Target-URI", url);

I've changed it to:

warc_write_header_url ("WARC-Target-URI", url);

 

This way the newly compiled wget did create good warc files.

Maybe it could be included in the upcoming release.

Thank you.

Best regards,


Ferencz Marton 

 




CEO, iCore Outsourcing SRL
Mobile: <tel:+40721275853>  +40721275853
Phone: <tel:+40368426655>  +40368426655
Email: <mailto:ferencz.marton@icore.ro>  ferencz.marton@icore.ro
Website: https://www.icore.ro
Address: Str. Dr. Victor Babes Nr. 36 Birou 1.10
500073 Brasov Romania

 

 

PNG image


reply via email to

[Prev in Thread] Current Thread [Next in Thread]