[groff] 01/01: Partially revert previous preconv change.

groff-commit
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[groff] 01/01: Partially revert previous preconv change.

From:	G. Branden Robinson
Subject:	[groff] 01/01: Partially revert previous preconv change.
Date:	Fri, 8 May 2020 17:57:46 -0400 (EDT)
gbranden pushed a commit to branch master
in repository groff.

commit 7add969faba8c1d91385bb74a1fd99554201b57d
Author: G. Branden Robinson <address@hidden>
AuthorDate: Fri May 8 18:05:30 2020 +1000

    Partially revert previous preconv change.
    
    The implementation was not completely baked, and some objected to the
    feature on principle.
    
    '...there is a saying, "If a thing is not worth doing, it is not worth
    doing well."' -- Carol J. Loomis
    
    * src/preproc/preconv/preconv.cpp: Revert logic changes.
    
    * src/preproc/preconv/tests/smoke-test.sh: Test each of the steps in the
      detection algorithm.
    
    * src/preproc/preconv/preconv.1.man:
      + Note which detection methods don't work on unseekable input (pipes).
      + Offer recommendations for those struggling with encoding detection.
      + Fix which/that usage problems.
      + Add cross-references to iconv(3) and locale(7) man pages.
---
 ChangeLog                                          |  44 +++-----
 NEWS                                               |  12 --
 src/preproc/preconv/preconv.1.man                  | 115 +++++++++----------
 src/preproc/preconv/preconv.am                     |   2 +-
 src/preproc/preconv/preconv.cpp                    | 124 +--------------------
 src/preproc/preconv/tests/late_coding_tags_work.sh |  44 --------
 src/preproc/preconv/tests/smoke-test.sh            |  68 +++++++++++
 7 files changed, 151 insertions(+), 258 deletions(-)

diff --git a/ChangeLog b/ChangeLog
index f0a77bd..93827d8 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -23,38 +23,28 @@
 
 2020-05-06  G. Branden Robinson <address@hidden>
 
-       preconv: Support Emacs local variable lists at ends of files.
-
-       * src/preproc/preconv/preconv.cpp (get_tag_lines): Rename to...
-       (get_early_tag_lines): ...this.
-       (get_late_coding_tag): Add new function.  Search last 3000 bytes
-       {or region after last form-feed control} of file for "coding:"
-       within a region bracketed by "Local Variables:" and "End:".
-       Give up on seek, read, or memory allocation failures.
-       (check_coding_tag): Rename to...
-       (check_early_coding_tag): ...this.  Call newly-named
-       get_early_tag_lines().  Update comments.
-       (check_coding_tag): Add new function.  Try get_late_coding_tag()
-       first, then fall back to check_early_coding_tag().
+       Undocument plans to support end-of-file GNU Emacs coding tags.
+
+       * src/preproc/preconv/preconv.cpp (check_coding_tag):  Update
+       comments.
        (detect_file_encoding): Alter debugging output so it's easier to
        grep and verify Emacs coding tag detection.
 
        * src/preproc/preconv/preconv.1.man (Bugs): Delete; its sole
        concern was the absence of this feature.
-       (Usage): Document alterations to algorithm.
-       (Usage/Coding Tags): Add discussion of "late" (in the file)
-       coding tags.  Restyle early tag example.  Stop manipulating
-       adjustment.  Use hyphen-minus (\- escape) characters in coding
-       tag names, since they are literals that one might paste into an
-       editor window.
-
-       Stop referencing XEmacs, whose development is moribund as far as
-       I know.
-
-       Add "us-ascii" coding tag to page; while not strictly necessary,
-       it facilitates testing (see below).
-
-       * src/preproc/preconv/tests/late_coding_tags_work.sh: Test.
+       (Usage): Document detection algorithm in more detail.  Note
+       which detection methods don't work on unseekable input (pipes).
+       Offer recommendations for those struggling with encoding
+       detection.
+       (Usage/Coding Tags): Stop manipulating line adjustment.  Use
+       hyphen-minus (\- escape) characters in coding tag names, since
+       they are literals that one might copy and paste.  Stop
+       referencing XEmacs, whose development appears moribund.
+       (See Also): Add cross-references to iconv(3) and locale(7) man
+       pages.
+
+       * src/preproc/preconv/tests/smoke-test.sh: Test each of the
+       steps in the detection algorithm.
        * src/preproc/preconv/preconv.am: Run test.  Wrap long lines.
 
 2020-05-05  G. Branden Robinson <address@hidden>
diff --git a/NEWS b/NEWS
index 1c0715e..db4ddc6 100644
--- a/NEWS
+++ b/NEWS
@@ -86,18 +86,6 @@ o The new option -V emits the constructed groff command that 
nroff would
   prompt; this is a historical deficiency of the Bourne shell family not
   yet corrected by the POSIX standard.
 
-Preconv
--------
-
-o preconv now supports coding tgs in "late" GNU Emacs file-local
-  variable regions, that is, those which appear at ends of files.  If a
-  valid coding tag is found, one in the "early" style is not consulted.
-  Example:
-    .\" Local Variables:
-    .\" coding: utf-8
-    .\" mode: nroff
-    .\" End:
-
 Macro Packages
 --------------
 
diff --git a/src/preproc/preconv/preconv.1.man 
b/src/preproc/preconv/preconv.1.man
index 2f4c261..357ecaf 100644
--- a/src/preproc/preconv/preconv.1.man
+++ b/src/preproc/preconv/preconv.1.man
@@ -145,47 +145,41 @@ check whether the input starts with a Unicode Byte Order 
Mark
 (BOM,
 see below).
 .
-If found, use it.
+If found,
+use it.
 .
 .
 .IP 3.
 Otherwise,
-check whether there is a recognized Emacs coding tag
+if the input stream is seekable,
+check whether there is a recognized GNU\~Emacs coding tag
 (see below)
-in a file-local variables region at the end of the file.
+in either the first or second input line.
 .
-If found, use it.
+If found,
+use it.
 .
 .
 .IP 4.
 Otherwise,
-check whether there is a recognized Emacs coding tag in either the first
-or second input line.
-.
-If found, use it.
-.
-.
-.IP 5.
-Otherwise,
+if the input stream is seekable,
 if the
 .I uchardet
-library
-(a character-encoding detector library available on most major
-distributions)
-is available on the system,
+library is available on the system,
 use it to try to infer the encoding of the file.
 .
 .
-.IP 6.
+.IP 5.
 If
 .I uchardet
 fails,
-use the encoding specified by the
+and the
 .B \-D
-option.
+option specifies an encoding,
+use it.
 .
 .
-.IP 7.
+.IP 6.
 Use the encoding specified by the current locale
 .RI ( LC_CTYPE ),
 unless the locale is
@@ -198,11 +192,29 @@ as the input file encoding.
 .
 .
 .PP
-Note that the
-.B groff
-program supports a
+Note that the coding tag and
+.I uchardet
+methods in the above procedure rely upon a seekable input stream;
+when
+.I preconv
+reads from a pipe,
+the stream is not seekable,
+and these detection methods are skipped.
+.
+If character encoding detection of your input files is unreliable,
+arrange for one of the other methods to succeed by using
+.IR preconv 's
+.B \-D
+or
+.B \-e
+options,
+or by configuring your locale appropriately.
+.
+Furthermore,
+.I groff
+supports a
 .I \%GROFF_ENCODING
-environment variable which is eventually expanded to option
+environment variable which is equivalent to its option
 .BR \-k .
 .
 .
@@ -241,53 +253,36 @@ space\[cq] character \[en] something not needed normally 
in
 .SS "Coding tags"
 .\" ====================================================================
 .
-Text editors which support more than a single character encoding need
+Text editors that support more than a single character encoding need
 tags within the input files to mark the file's encoding.
 .
 While it is possible to guess the right input encoding with the help of
-heuristics which are reliable for a preponderance of natural language
+heuristics that are reliable for a preponderance of natural language
 texts,
-it is still just a guess.
+they are not absolutely reliable.
 .
-Additionally,
-heuristics can fail on inputs that are too short or don't represent a
+Heuristics can fail on inputs that are too short or don't represent a
 natural language.
 .
 .
 .PP
-For these reasons,
+Consequently,
 .I preconv
 supports the coding tag convention
 (with some restrictions)
 used by GNU\~Emacs.
 .
+Coding tags in GNU\~Emacs are indicated in specially-marked regions of
+an input file designated for \[lq]file-local variables\[rq].
 .
-.PP
-Coding tags in GNU Emacs are indicated in specially-marked regions of an
-input file designated for \[lq]file-local variables\[rq].
 .
+.PP
 .I preconv
-recognizes two syntax forms which should be put into
+recognizes the following syntax form if it occurs in a
 .I roff
-comments.
+comment
+in the first or second line of the input file.
 .
-The fist must be placed within the last 3,000 bytes of the file,
-and must come after the last
-(if any)
-form-feed control character.
-.
-.RS
-.EX
-\&.\[rs]" Local Variables:
-\&.\[rs]" coding: \c
-.I encoding
-\&.\[rs]" End:
-.EE
-.RE
-.
-.
-.PP
-The other form must occur within the first two lines of the file.
 .
 .RS
 .EX
@@ -302,7 +297,14 @@ The other form must occur within the first two lines of 
the file.
 .
 .
 .PP
-The following list gives all MIME coding tags
+The only tag
+.I preconv
+interprets is \[lq]coding\[rq],
+which can take the values listed below.
+.
+.
+.PP
+The following list comprises all MIME \[lq]charset\[rq] tags
 (either lowercase or uppercase)
 supported by
 .IR preconv .
@@ -346,7 +348,7 @@ Trailing
 and
 \[lq]\-mac\[rq]
 suffixes on coding tags
-(which give the end-of-line convention used in the file)
+(which indicate the end-of-line convention used in the file)
 are disregarded for the purpose of comparison with the above tags.
 .
 .
@@ -371,7 +373,9 @@ is used.
 .SH "See Also"
 .\" ====================================================================
 .
-.IR groff (@MAN1EXT@)
+.IR groff (@MAN1EXT@),
+.IR iconv (3),
+.IR locale (7)
 .
 .
 .\" Restore compatibility mode (for, e.g., Solaris 10/11).
@@ -379,7 +383,6 @@ is used.
 .
 .
 .\" Local Variables:
-.\" coding: us-ascii
 .\" mode: nroff
 .\" End:
 .\" vim: set filetype=groff:
diff --git a/src/preproc/preconv/preconv.am b/src/preproc/preconv/preconv.am
index 7fd7046..02313f8 100644
--- a/src/preproc/preconv/preconv.am
+++ b/src/preproc/preconv/preconv.am
@@ -24,7 +24,7 @@ man1_MANS += src/preproc/preconv/preconv.1
 EXTRA_DIST += src/preproc/preconv/preconv.1.man
 
 preconv_TESTS = \
-  src/preproc/preconv/tests/late_coding_tags_work.sh
+  src/preproc/preconv/tests/smoke-test.sh
 TESTS += $(preconv_TESTS)
 
 
diff --git a/src/preproc/preconv/preconv.cpp b/src/preproc/preconv/preconv.cpp
index a6e1b00..62c0f4d 100644
--- a/src/preproc/preconv/preconv.cpp
+++ b/src/preproc/preconv/preconv.cpp
@@ -813,8 +813,8 @@ get_BOM(FILE *fp, string &BOM, string &data)
 // or NULL in case no coding tag can occur in the data
 // (which is stored unmodified in 'data').
 // ---------------------------------------------------------
-static char *
-get_early_tag_lines(FILE *fp, string &data)
+char *
+get_tag_lines(FILE *fp, string &data)
 {
   int newline_count = 0;
   int c, prev = -1;
@@ -934,111 +934,8 @@ get_variable_value_pair(char *d1, char **variable, char 
**value)
   return NULL;
 }
 
-// Get coding tag from Emacs local variables list at end of file.
-//
-// The region looks like this:
-//
-// Local Variables:
-// coding: latin-2
-// mode: nroff
-// End:
-//
-// Like Emacs, we search at most 3000 bytes from the end of the file, or
-// from the last form-feed control (^L) that occurs.
-//
-// Our string class doesn't support reverse searches so just use C
-// strings.
-static char *
-get_late_coding_tag(FILE *fp)
-{
-  char *coding_tag = NULL;
-  const int limit = 3000;
-  if (fseek(fp, 0, SEEK_END) != 0)
-    return NULL;
-  // Seek to `limit` bytes from the end of the buffer, or the beginning.
-  if (fseek(fp, -limit, SEEK_END) != 0)
-    if (errno == EINVAL)
-      rewind(fp);
-    else
-      return NULL;
-  char *tmpbuf = (char *) calloc(1, limit + 1 /* trailing '\0' */);
-  if (!tmpbuf) {
-    error("unable to allocate memory");
-    rewind(fp);
-    return NULL;
-  }
-  (void) fread(tmpbuf, 1, limit, fp);
-  if (ferror(fp)) {
-    error("file read error");
-    free(tmpbuf);
-    rewind(fp);
-    return NULL;
-  }
-  char *start = tmpbuf;
-  char *end = tmpbuf + strlen(tmpbuf);
-  char *ff = strrchr(tmpbuf, '\f');
-  if (ff)
-    start = ff;
-  // Find the _last_ occurrence of a local-variables section in the
-  // buffer, because the document might have Emacs file-local variables
-  // as a discussion topic, as our roff(7) man page does.
-  //
-  // strcasestr() is a GNU extension we're not using.  TODO: Gnulib has
-  // it, so we can have it, too.
-  char *lv = NULL, *nextlv = NULL;
-  const char lvstr[] = "Local Variables:";
-  // Declare these now because GCC 8 doesn't like `goto`s crossing them.
-  const char codingstr[] = "coding:";
-  // From here we must 'goto cleanup' to free our buffer and rewind the
-  // file position instead of returning early.
-  lv = strstr(start, lvstr);
-  if (!lv)
-    goto cleanup;
-  else
-    do {
-      start += strlen(lvstr);
-      nextlv = strstr(start, lvstr);
-      if (nextlv) {
-       lv = nextlv;
-       start = lv;
-      }
-    } while(nextlv);
-  end = strstr(start, "End:");
-  if (!end)
-    end = strstr(start, "end:");
-  if (!end)
-    goto cleanup;
-  // Tighten [start, end) bracket until only the coding string remains.
-  // Locate "coding:".
-  start = strstr(start, codingstr);
-  if (!start)
-    goto cleanup;
-  // Move past it.
-  start += strlen(codingstr);
-  // Skip horizontal whitespace.
-  while (strchr(" \t", *start))
-    start++;
-  // Find the next newline and advance the end pointer to it.
-  end = strchr(start, '\n');
-  if (!end)
-    end = strchr(start, '\r');
-  if (!end)
-    goto cleanup;
-  // Back up over any trailing whitespace.
-  do {
-    *end = '\0';
-    end--;
-  } while ((end > start) && strchr(" \t", *end));
-  if (start < end)
-    coding_tag = start;
-cleanup:
-  free(tmpbuf);
-  rewind(fp);
-  return coding_tag;
-}
-
 // ---------------------------------------------------------
-// Check for coding tag near the beginning of the read buffer.
+// Check coding tag in the read buffer.
 //
 // We search for the following line:
 //
@@ -1069,10 +966,10 @@ cleanup:
 // the algorithm.  This should work even with files encoded as
 // UTF-16 or UTF-32 (or its siblings) in most cases.
 // ---------------------------------------------------------
-static char *
-check_early_coding_tag(FILE *fp, string &data)
+char *
+check_coding_tag(FILE *fp, string &data)
 {
-  char *inbuf = get_early_tag_lines(fp, data);
+  char *inbuf = get_tag_lines(fp, data);
   char *lineend;
   for (char *p = inbuf; is_comment_line(p); p = lineend + 1) {
     if ((lineend = strchr(p, '\n')) == NULL)
@@ -1102,15 +999,6 @@ check_early_coding_tag(FILE *fp, string &data)
   return NULL;
 }
 
-static char *
-check_coding_tag(FILE *fp, string &data)
-{
-  char *tag = get_late_coding_tag(fp);
-  if (!tag)
-    tag = check_early_coding_tag(fp, data);
-  return tag;
-}
-
 char *
 detect_file_encoding(FILE *fp)
 {
diff --git a/src/preproc/preconv/tests/late_coding_tags_work.sh 
b/src/preproc/preconv/tests/late_coding_tags_work.sh
deleted file mode 100755
index d6020b9..0000000
--- a/src/preproc/preconv/tests/late_coding_tags_work.sh
+++ /dev/null
@@ -1,44 +0,0 @@
-#!/bin/sh
-#
-# Copyright (C) 2020 Free Software Foundation, Inc.
-#
-# This file is part of groff.
-#
-# groff is free software; you can redistribute it and/or modify it under
-# the terms of the GNU General Public License as published by the Free
-# Software Foundation, either version 3 of the License, or (at your
-# option) any later version.
-#
-# groff is distributed in the hope that it will be useful, but WITHOUT
-# ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
-# FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
-# for more details.
-#
-# You should have received a copy of the GNU General Public License
-# along with this program. If not, see <http://www.gnu.org/licenses/>.
-#
-
-# Ensure a predictable character encoding.
-export LC_ALL=C
-
-set -e
-
-preconv="${abs_top_builddir:-.}/preconv"
-
-# We do not find a coding tag on piped input because it isn't seekable.
-echo "testing preconv on document read from pipe" >&2
-"$preconv" -d 2>&1 > /dev/null <<EOF | grep "no coding tag"
-abc
-EOF
-
-# Instead of using temporary files, which in all fastidiousness means
-# cleaning them up even if we're interrupted, which in turn means
-# setting up signal handlers, we use files in the build tree.
-
-doc=contrib/mm/mmroff.1
-echo "testing preconv on Latin-1 document $doc" >&2
-"$preconv" -d 2>&1 > /dev/null $doc | grep "coding tag: 'latin-1'"
-
-doc=src/preproc/preconv/preconv.1
-echo "testing preconv on US-ASCII document $doc" >&2
-"$preconv" -d 2>&1 > /dev/null $doc | grep "coding tag: 'us-ascii'"
diff --git a/src/preproc/preconv/tests/smoke-test.sh 
b/src/preproc/preconv/tests/smoke-test.sh
new file mode 100755
index 0000000..bd9343f
--- /dev/null
+++ b/src/preproc/preconv/tests/smoke-test.sh
@@ -0,0 +1,68 @@
+#!/bin/sh
+#
+# Copyright (C) 2020 Free Software Foundation, Inc.
+#
+# This file is part of groff.
+#
+# groff is free software; you can redistribute it and/or modify it under
+# the terms of the GNU General Public License as published by the Free
+# Software Foundation, either version 3 of the License, or (at your
+# option) any later version.
+#
+# groff is distributed in the hope that it will be useful, but WITHOUT
+# ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+# FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+# for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program. If not, see <http://www.gnu.org/licenses/>.
+#
+
+# Ensure a predictable character encoding.
+export LC_ALL=C
+
+set -e
+
+preconv="${abs_top_builddir:-.}/preconv"
+
+echo "testing -e flag override of BOM detection" >&2
+printf '\376\377foobar\n' \
+    | "$preconv" -d -e euc-kr 2>&1 > /dev/null \
+    | grep -q "no search for coding tag"
+
+echo "testing detection of (big-endian) BOM" >&2
+printf '\376\377foobar\n' \
+    | "$preconv" -d 2>&1 > /dev/null \
+    | grep -q "found BOM"
+
+# We do not find a coding tag on piped input because it isn't seekable.
+echo "testing detection of Emacs coding tag in piped input" >&2
+printf '.\\" -*- coding: euc-kr; -*-\\n' \
+    | "$preconv" -d 2>&1 >/dev/null \
+    | grep -q "no coding tag"
+
+# We need uchardet to work to get past this point.
+echo "testing uchardet detection of encoding" >&2
+"$preconv" -v | grep -q 'with uchardet support' || exit 77
+
+# Instead of using temporary files, which in all fastidiousness means
+# cleaning them up even if we're interrupted, which in turn means
+# setting up signal handlers, we use files in the build tree.
+
+doc=contrib/mm/groff_mmse.7
+echo "testing uchardet detection on Latin-1 document $doc" >&2
+"$preconv" -d -D us-ascii 2>&1 >/dev/null $doc \
+    | grep -q 'charset: ISO-8859-1'
+
+# uchardet can't seek on a pipe either.
+echo "testing uchardet detection on pipe (expect fallback to -D)" >&2
+printf 'Eat at the caf\351.\n' \
+    | "$preconv" -d -D euc-kr 2>&1 > /dev/null \
+    | grep -q "encoding used: 'EUC-KR'"
+
+# Fall back to the locale.  preconv assumes Latin-1 for C instead of
+# US-ASCII.
+echo "testing fallback to locale setting in environment" >&2
+printf 'Eat at the caf\351.\n' \
+    | "$preconv" -d 2>&1 > /dev/null \
+    | grep -q "encoding used: 'ISO-8859-1'"
[Prev in Thread]
Current Thread
[Next in Thread]
[groff] 01/01: Partially revert previous preconv change., G. Branden Robinson <=
Prev by Date: [groff] 03/03: Update documentation of .ss request.
Next by Date: [groff] 01/02: Further improve .ss documentation.
Previous by thread: [groff] 03/03: Update documentation of .ss request.
Next by thread: [groff] 01/02: Further improve .ss documentation.
Index(es):
- Date
- Thread