branch master updated: Encode more file names for epub

texinfo-commits
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
branch master updated: Encode more file names for epub

From:	Patrice Dumas
Subject:	branch master updated: Encode more file names for epub
Date:	Sat, 05 Mar 2022 11:11:15 -0500
This is an automated email from the git hooks/post-receive script.

pertusus pushed a commit to branch master
in repository texinfo.

The following commit(s) were added to refs/heads/master by this push:
     new 2973fdcd7f Encode more file names for epub
2973fdcd7f is described below

commit 2973fdcd7f8c6d7a090f3d452dcfbc384561c47d
Author: Patrice Dumas <pertusus@free.fr>
AuthorDate: Sat Mar 5 17:11:02 2022 +0100

    Encode more file names for epub
    
    * tp/init/epub3.pm: set OUTPUT_ENCODING_NAME and
    LOCALE_OUTPUT_FILE_NAME_ENCODING to utf-8 as mandated by the
    specification.
    
    * tp/init/epub3.pm (epub_setup, epub_finish): encode file and
    directory names.  Check more errors from Archive::Zip.
---
 ChangeLog        | 11 ++++++++
 tp/TODO          | 80 ++++++++++++++++++++++++++++++++++++++------------------
 tp/init/epub3.pm | 74 +++++++++++++++++++++++++++++++++++++++++++--------
 3 files changed, 128 insertions(+), 37 deletions(-)

diff --git a/ChangeLog b/ChangeLog
index d271cc4570..4b4dcc1e0c 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,14 @@
+2022-03-05  Patrice Dumas  <pertusus@free.fr>
+
+       Encode more file names for epub
+
+       * tp/init/epub3.pm: set OUTPUT_ENCODING_NAME and
+       LOCALE_OUTPUT_FILE_NAME_ENCODING to utf-8 as mandated by the
+       specification.
+
+       * tp/init/epub3.pm (epub_setup, epub_finish): encode file and
+       directory names.  Check more errors from Archive::Zip.
+
 2022-03-04  Gavin Smith  <gavinsmith0123@gmail.com>
 
        File name encoding variables for XS parser
diff --git a/tp/TODO b/tp/TODO
index 69bfe26085..2ea2715a3a 100644
--- a/tp/TODO
+++ b/tp/TODO
@@ -22,6 +22,10 @@ for @example args, use *-user as class?
 Test of init/highlight_syntax.pm with non ascii characters in highlighted
 @example contents.
 
+Convert/Converter.pm
+move create_destination_directory error message out of 
create_destination_directory
+or pass directory as a character string, or encoding?
+
 
 bytes.  To check that they can never be upgraded + document
 * texi2any.pl
@@ -34,7 +38,7 @@ bytes.  To check that they can never be upgraded + document
 
 Tests in non utf8 locales
 
-Tests ok
+Tests with correct or acceptable results
 
 t/formats_encodings.t manual_simple_utf8_with_error
 utf8 manual with errors involving non ascii strings
@@ -72,38 +76,65 @@ Issue to add 'tests/included_lat'$'\356''n1.texi' in make 
dist
 tests/other manual_include_accented_file_name_latin1
 ./texi2any.pl --force -I tests/ 
tests/other/manual_include_accented_file_name_latin1.texi
 
+latin1 encoded and latex2html in latin1 locale
+./texi2any.pl --html --init init/latex2html.pm 
tests/tex_html/tex_encode_latin1.texi
+
+latin1 encoded and tex4ht in latin1 locale
+./texi2any.pl --html --init init/tex4ht.pm 
tests/tex_html/tex_encode_latin1.texi
+
+./texi2any.pl --html --init init/tex4ht.pm tex_encodé_latin1.texi
+Firefox can't find tex_encod%uFFFD_latin1_html/Chapter.html (?)
+Opened from within the directory, still can't find the image file:
+tex_encod%E9_latin1_html/tex_encod%C3%A9_latin1_tex4ht_tex0x.png
+The file names and file contents looks right, though, with latin1 only
+encoded characters.
+
+epub for utf8 encoded manual in latin1 locale
+./texi2any.pl --force -I tests/ --init init/epub3.pm tests/formatting/os*.texi
 
-rename with utf8
-#./texi2any.pl tests/tex_html/tex_encodé.texi
+epub for latin1 encoded manual in latin1 locale
+cp tests/tex_html/tex_encode_latin1.texi tex_encodé_latin1.texi
+./texi2any.pl --init init/epub3.pm tex_encodé_latin1.texi
 
-Not ok:
-./texi2any.pl --html --init init/latex2html.pm tests/tex_html/tex_encod??.texi
+
+Tests with incorrect results, though not bugs
+
+utf8 encoded manual name and latex2html in latin1 locale
+./texi2any.pl --html --init init/latex2html.pm 
tests/tex_html/tex_encod*_utf8.texi
  -> the input file name is decoded using DATA_INPUT_ENCODING_NAME
     which is set to ISO-8859-1, although the file name is in utf8.
-./texi2any.pl -c 'DATA_INPUT_ENCODING_NAME utf-8' --html --init 
init/latex2html.pm tests/tex_html/tex_encod??.texi
+./texi2any.pl -c 'DATA_INPUT_ENCODING_NAME utf-8' -c 'L2H_CLEAN 0' --html 
--init init/latex2html.pm tests/tex_html/tex_encod*_utf8.texi
+Errors out with
+utf8 "\xE9" does not map to Unicode at ./init/latex2html.pm line 515, 
<L2H_HTML> line 6.
+which corresponds to tex_encodé_utf8_html/tex_encodé_utf8_l2h.html
+<TITLE>tex_encodé_utf8_l2h</TITLE>, with the é encoded in latin1.
+Probably latex2html outputs the latin1 file names as it gets from the command
+line binary encoded.  There are also utf8 encoded characters in the file.
+
+latin1 encoded manual name and latex2html in latin1 locale
+cp tests/tex_html/tex_encode_latin1.texi tex_encodé_latin1.texi
+./texi2any.pl -c 'L2H_CLEAN 0' --html --init init/latex2html.pm 
tex_encodé_latin1.texi
 Errors out with
-# l2h: use tex_encodé_html/tex_encodé_l2h.html as html file
 utf8 "\xE9" does not map to Unicode at ./init/latex2html.pm line 515, 
<L2H_HTML> line 6.
-which corresponds to <TITLE>tex_encodé_l2h</TITLE>, with the é encoded in 
latin1.
+which corresponds to tex_encodé_latin1_html/tex_encodé_latin1_l2h.html
+<TITLE>tex_encodé_latin1_l2h</TITLE> with the é encoded in latin1.
 Probably latex2html outputs the latin1 file names as it gets from the command
-line.
+line binary encoded.  There are also utf8 encoded characters in the file.
+
+utf8 encoded manual name and tex4ht in latin1 locale
+./texi2any.pl --html --init init/tex4ht.pm tests/tex_html/tex_encod*_utf8.texi
+html file generated by tex4ht with content="text/html; charset=iso-8859-1">,
+with character encoded in utf8 <img src="tex_encodÃ©_utf8_tex4ht_tex0x.png" 
...>
+firefox opens tex_encodÃ©_utf8_html/Chapter.html but does not find the image
+and shows a path like tex_encodé_utf8_html/tex_encodÃ©_utf8_tex4ht_tex0x.png
+mixing latin1 and utf8.
 
-additional tests TODO
-+ need to have file names/file content encoded in the non utf8 locale
-+ the test needs also to be checked in non utf8 locale
 
-tests/many_input_files/tex_l2h_output_dir_non_ascii.sh
-tests/many_input_files/tex_t4ht_output_dir_non_ascii.sh
-test l2h and tex4ht with non ascii file names in non utf8 locale.
-corresponds to tex_l2h_output_dir_non_ascii.sh
-tex_t4ht_output_dir_non_ascii.sh.
-Maybe with non utf8 input files too?
+Tests in utf8 locales.  The archive epub file is not tested in the automated 
tests.
+
+epub for utf8 encoded manual in utf8 locale
+./texi2any.pl --force -I tests/ --init init/epub3.pm tests/formatting/os*.texi
 
-epub tests would also be interesting in 8 byte locale, with
-8 byte documents.  In the EPUB specification it is said
-6.1.3 File Paths and File Names
- File Names and Paths MUST be UTF-8 [Unicode] encoded.
-First step would be to implement that.
 
 Test more interesting in non utf8 locale
 Add tests even if not as interesting in UTF8 locale as in non UTF8?
@@ -115,9 +146,6 @@ Texinfo/Convert/Text.pm output()
 checks decoded/encoded and fix.  Need to verify input available information
 
 
-Check Archive::Zip/EPUB_CREATE_CONTAINER in epub with non ascii
-
-
 Associated code to check, requires bytes in input both for directory and
 file name and return bytes
 locate_init_file
diff --git a/tp/init/epub3.pm b/tp/init/epub3.pm
index 9ac691743c..65a390bbee 100644
--- a/tp/init/epub3.pm
+++ b/tp/init/epub3.pm
@@ -37,6 +37,9 @@
 
 use strict;
 
+# for accented character in a comment
+use utf8;
+
 use File::Path;
 use File::Spec;
 use File::Copy;
@@ -63,7 +66,9 @@ texinfo_set_from_init_file('EPUB_CREATE_CONTAINER', 1);
 
 texinfo_set_format_from_init_file('html');
 
-# output valid XHTML
+# output valid XHTML as per the specification
+# Any Publication Resource that is an XML-Based Media Type MUST
+# be a conformant XML 1.0 Document ... MUST be encoded in UTF-8 or UTF-16.
 texinfo_set_from_init_file('HTML_ROOT_ELEMENT_ATTRIBUTES',
                            'xmlns="http://www.w3.org/1999/xhtml";');
 texinfo_set_from_init_file('NO_CUSTOM_HTML_ATTRIBUTE', 1);
@@ -71,6 +76,16 @@ texinfo_set_from_init_file('USE_XML_SYNTAX', 1);
 texinfo_set_from_init_file('DOCTYPE', '<?xml version="1.0" 
encoding="UTF-8"?>'."\n"
                                       .'<!DOCTYPE html>');
 texinfo_set_from_init_file('USE_NUMERIC_ENTITY', 1);
+texinfo_set_from_init_file('OUTPUT_ENCODING_NAME', 'utf-8');
+
+# this is actually the default
+texinfo_set_from_init_file('DOC_ENCODING_FOR_OUTPUT_FILE_NAME', 0);
+# the specification says "File Names and Paths MUST be UTF-8 [Unicode] 
encoded."
+# This is also needed for Archive::Zip in case there are non ascii
+# file name.
+# As a conséquence, the epub file file name is also always utf-8 encoded.
+texinfo_set_from_init_file('LOCALE_OUTPUT_FILE_NAME_ENCODING', 'utf-8');
+
 
 # the copiable anchor paragraph sign is always present and no link is
 # shown in the calibre epub reader.  Since it looks strange, unset.
@@ -172,7 +187,7 @@ sub epub_convert_image_command($$$$)
       if (! -d $encoded_images_destination_dir) {
         if (!mkdir($encoded_images_destination_dir, oct(755))) {
           $self->document_error($self, sprintf(__(
-                                 "could not create directory `%s': %s"),
+                             "could not create images directory `%s': %s"),
                                          $images_destination_dir, $!));
           return $result;
         }
@@ -294,7 +309,9 @@ sub epub_setup($)
   }
 
   my $err_remove_tree;
-  File::Path::remove_tree($epub_destination_directory,
+  my ($encoded_epub_destination_directory, $epub_destination_dir_encoding)
+    = $self->encoded_output_file_name($epub_destination_directory);
+  File::Path::remove_tree($encoded_epub_destination_directory,
                           {'error' => $err_remove_tree});
   if ($err_remove_tree and scalar(@$err_remove_tree)) {
     for my $diag (@$err_remove_tree) {
@@ -313,7 +330,9 @@ sub epub_setup($)
     return 0;
   }
   my $err_make_path;
-  File::Path::make_path($epub_document_destination_directory,
+  my ($encoded_epub_document_destination_directory, 
$epub_doc_dest_dir_encoding)
+    = $self->encoded_output_file_name($epub_document_destination_directory);
+  File::Path::make_path($encoded_epub_document_destination_directory,
                         {'mode' => 0755, 'error' => $err_make_path});
   if ($err_make_path and scalar(@$err_make_path)) {
     for my $diag (@$err_make_path) {
@@ -358,7 +377,7 @@ sub epub_finish($$)
     = $self->encoded_output_file_name($meta_inf_directory);
   if (!mkdir($encoded_meta_inf_directory, oct(755))) {
     $self->document_error($self, sprintf(__(
-                                 "could not create directory `%s': %s"),
+                   "could not create meta informations directory `%s': %s"),
                                          $meta_inf_directory, $!));
     return 0;
   }
@@ -691,14 +710,47 @@ EOT
   if ($self->get_conf('EPUB_CREATE_CONTAINER')) {
     require Archive::Zip;
 
+    # this is needed if there are non ascii file names, otherwise, for instance
+    # with calibre the files cannot be read, one get
+    # "There is no item named 'EPUB/osé.opf' in the archive"
+    # even though unzip -l lists the file well.  More testing is probably
+    # needed on other plaforms.
+    local $Archive::Zip::UNICODE = 1;
     my $zip = Archive::Zip->new();
-    $zip->addFile($mimetype_file_path_name, $mimetype_filename);
-    $zip->addTree($meta_inf_directory, $meta_inf_directory_name);
-    $zip->addTree(File::Spec->catdir($epub_destination_directory,
-                                     $epub_document_dir_name),
-                  $epub_document_dir_name);
+    my $mimetype_added
+      = $zip->addFile($encoded_mimetype_file_path_name, $mimetype_filename);
+    if (not(defined($mimetype_added))) {
+      $self->document_error($self,
+        sprintf(__("epub3.pm: error adding %s to archive"),
+               $mimetype_file_path_name));
+      return 0;
+    }
+
+    my $meta_inf_directory_ret_code
+      = $zip->addTree($encoded_meta_inf_directory, $meta_inf_directory_name);
+    if ($meta_inf_directory_ret_code != Archive::Zip->AZ_OK) {
+      $self->document_error($self,
+        sprintf(__("epub3.pm: error adding %s to archive"),
+               $meta_inf_directory));
+      return 0;
+    }
+
+    my $epub_document_dir_path = 
File::Spec->catdir($epub_destination_directory,
+                                                    $epub_document_dir_name);
+    my ($encoded_epub_document_dir_path, $epub_document_dir_path_encoding)
+      = $self->encoded_output_file_name($epub_document_dir_path);
+    my $epub_document_dir_name_ret_code
+      = $zip->addTree($encoded_epub_document_dir_path, 
$epub_document_dir_name);
+    if ($epub_document_dir_name_ret_code != Archive::Zip->AZ_OK) {
+      $self->document_error($self,
+        sprintf(__("epub3.pm: error adding %s to archive"),
+               $epub_document_dir_path));
+      return 0;
+    }
 
-    unless ($zip->writeToFileNamed($epub_outfile) == Archive::Zip->AZ_OK) {
+    my ($encoded_epub_outfile, $epub_outfile_encoding)
+      = $self->encoded_output_file_name($epub_outfile);
+    unless ($zip->writeToFileNamed($encoded_epub_outfile) == 
Archive::Zip->AZ_OK) {
       $self->document_error($self,
            sprintf(__("epub3.pm: error writing archive %s"),
                    $epub_outfile));
[Prev in Thread]
Current Thread
[Next in Thread]
branch master updated: Encode more file names for epub, Patrice Dumas <=
Prev by Date: branch master updated: Add latin1 encoded tex_html test, rename a test file
Next by Date: branch master updated: Pass created directory charcter string for error messages
Previous by thread: branch master updated: Add latin1 encoded tex_html test, rename a test file
Next by thread: branch master updated: Pass created directory charcter string for error messages
Index(es):
- Date
- Thread