[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[RFC] Make WARC output closer to WARC 1.1 standard
From: |
Darshit Shah |
Subject: |
[RFC] Make WARC output closer to WARC 1.1 standard |
Date: |
Fri, 15 Nov 2024 23:31:29 +0100 |
I'd like to apply the following patch to GNU Wget master branch.
I'm looking for comments, especially by those that care about Wget's
WARC implementation. Do you think this is okay to apply in terms of
the broader WARC ecosystem?
With this patch, Wget will continue to generate WARC 1.0 files, but
with the angled brackets removed. Should we change the version to
WARC 1.1 now? While not feature complete, I do believe that the files
Wget generates are WARC 1.1 compliant.
===
Wget has historically been one of the only implementations of the
WARC 1.0 standard that actually printed the URI enclosed in the
angled brackets. This was identified as an errata and removed from
the WARC 1.1 specification. However, since Wget hasn't updated its
implementation it has continued to create old-style WARC files with
the angled brackets. Let's remove this and start generated WARC files
without the angled brackets. This does mean that Wget is now no longer
completely compliant with either the WARC 1.0 or WARC 1.1 standards.
But since most WARC libraries support the reading of such files, it
should not be a problem.
* src/warc.c: Remove `warc_write_header_uri` and replace all usages
with `warc_write_header`
---
src/warc.c | 24 ++++--------------------
1 file changed, 4 insertions(+), 20 deletions(-)
diff --git a/src/warc.c b/src/warc.c
index 230bd36f..bbc825f7 100644
--- a/src/warc.c
+++ b/src/warc.c
@@ -270,22 +270,6 @@ warc_write_header (const char *name, const char *value)
return warc_write_ok;
}
-/* Writes a WARC header with a URI as value to the current WARC record.
- This method may be run after warc_write_start_record and
- before warc_write_block_from_file. */
-static bool
-warc_write_header_uri (const char *name, const char *value)
-{
- if (value)
- {
- warc_write_string (name);
- warc_write_string (": <");
- warc_write_string (value);
- warc_write_string (">\r\n");
- }
- return warc_write_ok;
-}
-
/* Copies the contents of DATA_IN to the WARC record.
Adds a Content-Length header to the WARC record.
Run this method after warc_write_header,
@@ -1339,7 +1323,7 @@ warc_write_request_record (const char *url, const char
*timestamp_str,
{
warc_write_start_record ();
warc_write_header ("WARC-Type", "request");
- warc_write_header_uri ("WARC-Target-URI", url);
+ warc_write_header ("WARC-Target-URI", url);
warc_write_header ("Content-Type", "application/http;msgtype=request");
warc_write_date_header (timestamp_str);
warc_write_header ("WARC-Record-ID", record_uuid);
@@ -1448,7 +1432,7 @@ warc_write_revisit_record (const char *url, const char
*timestamp_str,
warc_write_header ("WARC-Refers-To", refers_to);
warc_write_header ("WARC-Profile",
"http://netpreserve.org/warc/1.0/revisit/identical-payload-digest");
warc_write_header ("WARC-Truncated", "length");
- warc_write_header_uri ("WARC-Target-URI", url);
+ warc_write_header ("WARC-Target-URI", url);
warc_write_date_header (timestamp_str);
warc_write_ip_header (ip);
warc_write_header ("Content-Type", "application/http;msgtype=response");
@@ -1540,7 +1524,7 @@ warc_write_response_record (const char *url, const char
*timestamp_str,
warc_write_header ("WARC-Record-ID", response_uuid);
warc_write_header ("WARC-Warcinfo-ID", warc_current_warcinfo_uuid_str);
warc_write_header ("WARC-Concurrent-To", concurrent_to_uuid);
- warc_write_header_uri ("WARC-Target-URI", url);
+ warc_write_header ("WARC-Target-URI", url);
warc_write_date_header (timestamp_str);
warc_write_ip_header (ip);
warc_write_header ("WARC-Block-Digest", block_digest);
@@ -1597,7 +1581,7 @@ warc_write_record (const char *record_type, const char
*resource_uuid,
warc_write_header ("WARC-Record-ID", resource_uuid);
warc_write_header ("WARC-Warcinfo-ID", warc_current_warcinfo_uuid_str);
warc_write_header ("WARC-Concurrent-To", concurrent_to_uuid);
- warc_write_header_uri ("WARC-Target-URI", url);
+ warc_write_header ("WARC-Target-URI", url);
warc_write_date_header (timestamp_str);
warc_write_ip_header (ip);
warc_write_digest_headers (body, payload_offset);
--
2.47.0
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- [RFC] Make WARC output closer to WARC 1.1 standard,
Darshit Shah <=