[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
branch master updated: Encode more file names for epub
From: |
Patrice Dumas |
Subject: |
branch master updated: Encode more file names for epub |
Date: |
Sat, 05 Mar 2022 11:11:15 -0500 |
This is an automated email from the git hooks/post-receive script.
pertusus pushed a commit to branch master
in repository texinfo.
The following commit(s) were added to refs/heads/master by this push:
new 2973fdcd7f Encode more file names for epub
2973fdcd7f is described below
commit 2973fdcd7f8c6d7a090f3d452dcfbc384561c47d
Author: Patrice Dumas <pertusus@free.fr>
AuthorDate: Sat Mar 5 17:11:02 2022 +0100
Encode more file names for epub
* tp/init/epub3.pm: set OUTPUT_ENCODING_NAME and
LOCALE_OUTPUT_FILE_NAME_ENCODING to utf-8 as mandated by the
specification.
* tp/init/epub3.pm (epub_setup, epub_finish): encode file and
directory names. Check more errors from Archive::Zip.
---
ChangeLog | 11 ++++++++
tp/TODO | 80 ++++++++++++++++++++++++++++++++++++++------------------
tp/init/epub3.pm | 74 +++++++++++++++++++++++++++++++++++++++++++--------
3 files changed, 128 insertions(+), 37 deletions(-)
diff --git a/ChangeLog b/ChangeLog
index d271cc4570..4b4dcc1e0c 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,14 @@
+2022-03-05 Patrice Dumas <pertusus@free.fr>
+
+ Encode more file names for epub
+
+ * tp/init/epub3.pm: set OUTPUT_ENCODING_NAME and
+ LOCALE_OUTPUT_FILE_NAME_ENCODING to utf-8 as mandated by the
+ specification.
+
+ * tp/init/epub3.pm (epub_setup, epub_finish): encode file and
+ directory names. Check more errors from Archive::Zip.
+
2022-03-04 Gavin Smith <gavinsmith0123@gmail.com>
File name encoding variables for XS parser
diff --git a/tp/TODO b/tp/TODO
index 69bfe26085..2ea2715a3a 100644
--- a/tp/TODO
+++ b/tp/TODO
@@ -22,6 +22,10 @@ for @example args, use *-user as class?
Test of init/highlight_syntax.pm with non ascii characters in highlighted
@example contents.
+Convert/Converter.pm
+move create_destination_directory error message out of
create_destination_directory
+or pass directory as a character string, or encoding?
+
bytes. To check that they can never be upgraded + document
* texi2any.pl
@@ -34,7 +38,7 @@ bytes. To check that they can never be upgraded + document
Tests in non utf8 locales
-Tests ok
+Tests with correct or acceptable results
t/formats_encodings.t manual_simple_utf8_with_error
utf8 manual with errors involving non ascii strings
@@ -72,38 +76,65 @@ Issue to add 'tests/included_lat'$'\356''n1.texi' in make
dist
tests/other manual_include_accented_file_name_latin1
./texi2any.pl --force -I tests/
tests/other/manual_include_accented_file_name_latin1.texi
+latin1 encoded and latex2html in latin1 locale
+./texi2any.pl --html --init init/latex2html.pm
tests/tex_html/tex_encode_latin1.texi
+
+latin1 encoded and tex4ht in latin1 locale
+./texi2any.pl --html --init init/tex4ht.pm
tests/tex_html/tex_encode_latin1.texi
+
+./texi2any.pl --html --init init/tex4ht.pm tex_encodé_latin1.texi
+Firefox can't find tex_encod%uFFFD_latin1_html/Chapter.html (?)
+Opened from within the directory, still can't find the image file:
+tex_encod%E9_latin1_html/tex_encod%C3%A9_latin1_tex4ht_tex0x.png
+The file names and file contents looks right, though, with latin1 only
+encoded characters.
+
+epub for utf8 encoded manual in latin1 locale
+./texi2any.pl --force -I tests/ --init init/epub3.pm tests/formatting/os*.texi
-rename with utf8
-#./texi2any.pl tests/tex_html/tex_encodé.texi
+epub for latin1 encoded manual in latin1 locale
+cp tests/tex_html/tex_encode_latin1.texi tex_encodé_latin1.texi
+./texi2any.pl --init init/epub3.pm tex_encodé_latin1.texi
-Not ok:
-./texi2any.pl --html --init init/latex2html.pm tests/tex_html/tex_encod??.texi
+
+Tests with incorrect results, though not bugs
+
+utf8 encoded manual name and latex2html in latin1 locale
+./texi2any.pl --html --init init/latex2html.pm
tests/tex_html/tex_encod*_utf8.texi
-> the input file name is decoded using DATA_INPUT_ENCODING_NAME
which is set to ISO-8859-1, although the file name is in utf8.
-./texi2any.pl -c 'DATA_INPUT_ENCODING_NAME utf-8' --html --init
init/latex2html.pm tests/tex_html/tex_encod??.texi
+./texi2any.pl -c 'DATA_INPUT_ENCODING_NAME utf-8' -c 'L2H_CLEAN 0' --html
--init init/latex2html.pm tests/tex_html/tex_encod*_utf8.texi
+Errors out with
+utf8 "\xE9" does not map to Unicode at ./init/latex2html.pm line 515,
<L2H_HTML> line 6.
+which corresponds to tex_encodé_utf8_html/tex_encodé_utf8_l2h.html
+<TITLE>tex_encodé_utf8_l2h</TITLE>, with the é encoded in latin1.
+Probably latex2html outputs the latin1 file names as it gets from the command
+line binary encoded. There are also utf8 encoded characters in the file.
+
+latin1 encoded manual name and latex2html in latin1 locale
+cp tests/tex_html/tex_encode_latin1.texi tex_encodé_latin1.texi
+./texi2any.pl -c 'L2H_CLEAN 0' --html --init init/latex2html.pm
tex_encodé_latin1.texi
Errors out with
-# l2h: use tex_encodé_html/tex_encodé_l2h.html as html file
utf8 "\xE9" does not map to Unicode at ./init/latex2html.pm line 515,
<L2H_HTML> line 6.
-which corresponds to <TITLE>tex_encodé_l2h</TITLE>, with the é encoded in
latin1.
+which corresponds to tex_encodé_latin1_html/tex_encodé_latin1_l2h.html
+<TITLE>tex_encodé_latin1_l2h</TITLE> with the é encoded in latin1.
Probably latex2html outputs the latin1 file names as it gets from the command
-line.
+line binary encoded. There are also utf8 encoded characters in the file.
+
+utf8 encoded manual name and tex4ht in latin1 locale
+./texi2any.pl --html --init init/tex4ht.pm tests/tex_html/tex_encod*_utf8.texi
+html file generated by tex4ht with content="text/html; charset=iso-8859-1">,
+with character encoded in utf8 <img src="tex_encodé_utf8_tex4ht_tex0x.png"
...>
+firefox opens tex_encodé_utf8_html/Chapter.html but does not find the image
+and shows a path like tex_encodé_utf8_html/tex_encodé_utf8_tex4ht_tex0x.png
+mixing latin1 and utf8.
-additional tests TODO
-+ need to have file names/file content encoded in the non utf8 locale
-+ the test needs also to be checked in non utf8 locale
-tests/many_input_files/tex_l2h_output_dir_non_ascii.sh
-tests/many_input_files/tex_t4ht_output_dir_non_ascii.sh
-test l2h and tex4ht with non ascii file names in non utf8 locale.
-corresponds to tex_l2h_output_dir_non_ascii.sh
-tex_t4ht_output_dir_non_ascii.sh.
-Maybe with non utf8 input files too?
+Tests in utf8 locales. The archive epub file is not tested in the automated
tests.
+
+epub for utf8 encoded manual in utf8 locale
+./texi2any.pl --force -I tests/ --init init/epub3.pm tests/formatting/os*.texi
-epub tests would also be interesting in 8 byte locale, with
-8 byte documents. In the EPUB specification it is said
-6.1.3 File Paths and File Names
- File Names and Paths MUST be UTF-8 [Unicode] encoded.
-First step would be to implement that.
Test more interesting in non utf8 locale
Add tests even if not as interesting in UTF8 locale as in non UTF8?
@@ -115,9 +146,6 @@ Texinfo/Convert/Text.pm output()
checks decoded/encoded and fix. Need to verify input available information
-Check Archive::Zip/EPUB_CREATE_CONTAINER in epub with non ascii
-
-
Associated code to check, requires bytes in input both for directory and
file name and return bytes
locate_init_file
diff --git a/tp/init/epub3.pm b/tp/init/epub3.pm
index 9ac691743c..65a390bbee 100644
--- a/tp/init/epub3.pm
+++ b/tp/init/epub3.pm
@@ -37,6 +37,9 @@
use strict;
+# for accented character in a comment
+use utf8;
+
use File::Path;
use File::Spec;
use File::Copy;
@@ -63,7 +66,9 @@ texinfo_set_from_init_file('EPUB_CREATE_CONTAINER', 1);
texinfo_set_format_from_init_file('html');
-# output valid XHTML
+# output valid XHTML as per the specification
+# Any Publication Resource that is an XML-Based Media Type MUST
+# be a conformant XML 1.0 Document ... MUST be encoded in UTF-8 or UTF-16.
texinfo_set_from_init_file('HTML_ROOT_ELEMENT_ATTRIBUTES',
'xmlns="http://www.w3.org/1999/xhtml"');
texinfo_set_from_init_file('NO_CUSTOM_HTML_ATTRIBUTE', 1);
@@ -71,6 +76,16 @@ texinfo_set_from_init_file('USE_XML_SYNTAX', 1);
texinfo_set_from_init_file('DOCTYPE', '<?xml version="1.0"
encoding="UTF-8"?>'."\n"
.'<!DOCTYPE html>');
texinfo_set_from_init_file('USE_NUMERIC_ENTITY', 1);
+texinfo_set_from_init_file('OUTPUT_ENCODING_NAME', 'utf-8');
+
+# this is actually the default
+texinfo_set_from_init_file('DOC_ENCODING_FOR_OUTPUT_FILE_NAME', 0);
+# the specification says "File Names and Paths MUST be UTF-8 [Unicode]
encoded."
+# This is also needed for Archive::Zip in case there are non ascii
+# file name.
+# As a conséquence, the epub file file name is also always utf-8 encoded.
+texinfo_set_from_init_file('LOCALE_OUTPUT_FILE_NAME_ENCODING', 'utf-8');
+
# the copiable anchor paragraph sign is always present and no link is
# shown in the calibre epub reader. Since it looks strange, unset.
@@ -172,7 +187,7 @@ sub epub_convert_image_command($$$$)
if (! -d $encoded_images_destination_dir) {
if (!mkdir($encoded_images_destination_dir, oct(755))) {
$self->document_error($self, sprintf(__(
- "could not create directory `%s': %s"),
+ "could not create images directory `%s': %s"),
$images_destination_dir, $!));
return $result;
}
@@ -294,7 +309,9 @@ sub epub_setup($)
}
my $err_remove_tree;
- File::Path::remove_tree($epub_destination_directory,
+ my ($encoded_epub_destination_directory, $epub_destination_dir_encoding)
+ = $self->encoded_output_file_name($epub_destination_directory);
+ File::Path::remove_tree($encoded_epub_destination_directory,
{'error' => $err_remove_tree});
if ($err_remove_tree and scalar(@$err_remove_tree)) {
for my $diag (@$err_remove_tree) {
@@ -313,7 +330,9 @@ sub epub_setup($)
return 0;
}
my $err_make_path;
- File::Path::make_path($epub_document_destination_directory,
+ my ($encoded_epub_document_destination_directory,
$epub_doc_dest_dir_encoding)
+ = $self->encoded_output_file_name($epub_document_destination_directory);
+ File::Path::make_path($encoded_epub_document_destination_directory,
{'mode' => 0755, 'error' => $err_make_path});
if ($err_make_path and scalar(@$err_make_path)) {
for my $diag (@$err_make_path) {
@@ -358,7 +377,7 @@ sub epub_finish($$)
= $self->encoded_output_file_name($meta_inf_directory);
if (!mkdir($encoded_meta_inf_directory, oct(755))) {
$self->document_error($self, sprintf(__(
- "could not create directory `%s': %s"),
+ "could not create meta informations directory `%s': %s"),
$meta_inf_directory, $!));
return 0;
}
@@ -691,14 +710,47 @@ EOT
if ($self->get_conf('EPUB_CREATE_CONTAINER')) {
require Archive::Zip;
+ # this is needed if there are non ascii file names, otherwise, for instance
+ # with calibre the files cannot be read, one get
+ # "There is no item named 'EPUB/osé.opf' in the archive"
+ # even though unzip -l lists the file well. More testing is probably
+ # needed on other plaforms.
+ local $Archive::Zip::UNICODE = 1;
my $zip = Archive::Zip->new();
- $zip->addFile($mimetype_file_path_name, $mimetype_filename);
- $zip->addTree($meta_inf_directory, $meta_inf_directory_name);
- $zip->addTree(File::Spec->catdir($epub_destination_directory,
- $epub_document_dir_name),
- $epub_document_dir_name);
+ my $mimetype_added
+ = $zip->addFile($encoded_mimetype_file_path_name, $mimetype_filename);
+ if (not(defined($mimetype_added))) {
+ $self->document_error($self,
+ sprintf(__("epub3.pm: error adding %s to archive"),
+ $mimetype_file_path_name));
+ return 0;
+ }
+
+ my $meta_inf_directory_ret_code
+ = $zip->addTree($encoded_meta_inf_directory, $meta_inf_directory_name);
+ if ($meta_inf_directory_ret_code != Archive::Zip->AZ_OK) {
+ $self->document_error($self,
+ sprintf(__("epub3.pm: error adding %s to archive"),
+ $meta_inf_directory));
+ return 0;
+ }
+
+ my $epub_document_dir_path =
File::Spec->catdir($epub_destination_directory,
+ $epub_document_dir_name);
+ my ($encoded_epub_document_dir_path, $epub_document_dir_path_encoding)
+ = $self->encoded_output_file_name($epub_document_dir_path);
+ my $epub_document_dir_name_ret_code
+ = $zip->addTree($encoded_epub_document_dir_path,
$epub_document_dir_name);
+ if ($epub_document_dir_name_ret_code != Archive::Zip->AZ_OK) {
+ $self->document_error($self,
+ sprintf(__("epub3.pm: error adding %s to archive"),
+ $epub_document_dir_path));
+ return 0;
+ }
- unless ($zip->writeToFileNamed($epub_outfile) == Archive::Zip->AZ_OK) {
+ my ($encoded_epub_outfile, $epub_outfile_encoding)
+ = $self->encoded_output_file_name($epub_outfile);
+ unless ($zip->writeToFileNamed($encoded_epub_outfile) ==
Archive::Zip->AZ_OK) {
$self->document_error($self,
sprintf(__("epub3.pm: error writing archive %s"),
$epub_outfile));
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- branch master updated: Encode more file names for epub,
Patrice Dumas <=