[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[groff] 01/01: Partially revert previous preconv change.
From: |
G. Branden Robinson |
Subject: |
[groff] 01/01: Partially revert previous preconv change. |
Date: |
Fri, 8 May 2020 17:57:46 -0400 (EDT) |
gbranden pushed a commit to branch master
in repository groff.
commit 7add969faba8c1d91385bb74a1fd99554201b57d
Author: G. Branden Robinson <address@hidden>
AuthorDate: Fri May 8 18:05:30 2020 +1000
Partially revert previous preconv change.
The implementation was not completely baked, and some objected to the
feature on principle.
'...there is a saying, "If a thing is not worth doing, it is not worth
doing well."' -- Carol J. Loomis
* src/preproc/preconv/preconv.cpp: Revert logic changes.
* src/preproc/preconv/tests/smoke-test.sh: Test each of the steps in the
detection algorithm.
* src/preproc/preconv/preconv.1.man:
+ Note which detection methods don't work on unseekable input (pipes).
+ Offer recommendations for those struggling with encoding detection.
+ Fix which/that usage problems.
+ Add cross-references to iconv(3) and locale(7) man pages.
---
ChangeLog | 44 +++-----
NEWS | 12 --
src/preproc/preconv/preconv.1.man | 115 +++++++++----------
src/preproc/preconv/preconv.am | 2 +-
src/preproc/preconv/preconv.cpp | 124 +--------------------
src/preproc/preconv/tests/late_coding_tags_work.sh | 44 --------
src/preproc/preconv/tests/smoke-test.sh | 68 +++++++++++
7 files changed, 151 insertions(+), 258 deletions(-)
diff --git a/ChangeLog b/ChangeLog
index f0a77bd..93827d8 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -23,38 +23,28 @@
2020-05-06 G. Branden Robinson <address@hidden>
- preconv: Support Emacs local variable lists at ends of files.
-
- * src/preproc/preconv/preconv.cpp (get_tag_lines): Rename to...
- (get_early_tag_lines): ...this.
- (get_late_coding_tag): Add new function. Search last 3000 bytes
- {or region after last form-feed control} of file for "coding:"
- within a region bracketed by "Local Variables:" and "End:".
- Give up on seek, read, or memory allocation failures.
- (check_coding_tag): Rename to...
- (check_early_coding_tag): ...this. Call newly-named
- get_early_tag_lines(). Update comments.
- (check_coding_tag): Add new function. Try get_late_coding_tag()
- first, then fall back to check_early_coding_tag().
+ Undocument plans to support end-of-file GNU Emacs coding tags.
+
+ * src/preproc/preconv/preconv.cpp (check_coding_tag): Update
+ comments.
(detect_file_encoding): Alter debugging output so it's easier to
grep and verify Emacs coding tag detection.
* src/preproc/preconv/preconv.1.man (Bugs): Delete; its sole
concern was the absence of this feature.
- (Usage): Document alterations to algorithm.
- (Usage/Coding Tags): Add discussion of "late" (in the file)
- coding tags. Restyle early tag example. Stop manipulating
- adjustment. Use hyphen-minus (\- escape) characters in coding
- tag names, since they are literals that one might paste into an
- editor window.
-
- Stop referencing XEmacs, whose development is moribund as far as
- I know.
-
- Add "us-ascii" coding tag to page; while not strictly necessary,
- it facilitates testing (see below).
-
- * src/preproc/preconv/tests/late_coding_tags_work.sh: Test.
+ (Usage): Document detection algorithm in more detail. Note
+ which detection methods don't work on unseekable input (pipes).
+ Offer recommendations for those struggling with encoding
+ detection.
+ (Usage/Coding Tags): Stop manipulating line adjustment. Use
+ hyphen-minus (\- escape) characters in coding tag names, since
+ they are literals that one might copy and paste. Stop
+ referencing XEmacs, whose development appears moribund.
+ (See Also): Add cross-references to iconv(3) and locale(7) man
+ pages.
+
+ * src/preproc/preconv/tests/smoke-test.sh: Test each of the
+ steps in the detection algorithm.
* src/preproc/preconv/preconv.am: Run test. Wrap long lines.
2020-05-05 G. Branden Robinson <address@hidden>
diff --git a/NEWS b/NEWS
index 1c0715e..db4ddc6 100644
--- a/NEWS
+++ b/NEWS
@@ -86,18 +86,6 @@ o The new option -V emits the constructed groff command that
nroff would
prompt; this is a historical deficiency of the Bourne shell family not
yet corrected by the POSIX standard.
-Preconv
--------
-
-o preconv now supports coding tgs in "late" GNU Emacs file-local
- variable regions, that is, those which appear at ends of files. If a
- valid coding tag is found, one in the "early" style is not consulted.
- Example:
- .\" Local Variables:
- .\" coding: utf-8
- .\" mode: nroff
- .\" End:
-
Macro Packages
--------------
diff --git a/src/preproc/preconv/preconv.1.man
b/src/preproc/preconv/preconv.1.man
index 2f4c261..357ecaf 100644
--- a/src/preproc/preconv/preconv.1.man
+++ b/src/preproc/preconv/preconv.1.man
@@ -145,47 +145,41 @@ check whether the input starts with a Unicode Byte Order
Mark
(BOM,
see below).
.
-If found, use it.
+If found,
+use it.
.
.
.IP 3.
Otherwise,
-check whether there is a recognized Emacs coding tag
+if the input stream is seekable,
+check whether there is a recognized GNU\~Emacs coding tag
(see below)
-in a file-local variables region at the end of the file.
+in either the first or second input line.
.
-If found, use it.
+If found,
+use it.
.
.
.IP 4.
Otherwise,
-check whether there is a recognized Emacs coding tag in either the first
-or second input line.
-.
-If found, use it.
-.
-.
-.IP 5.
-Otherwise,
+if the input stream is seekable,
if the
.I uchardet
-library
-(a character-encoding detector library available on most major
-distributions)
-is available on the system,
+library is available on the system,
use it to try to infer the encoding of the file.
.
.
-.IP 6.
+.IP 5.
If
.I uchardet
fails,
-use the encoding specified by the
+and the
.B \-D
-option.
+option specifies an encoding,
+use it.
.
.
-.IP 7.
+.IP 6.
Use the encoding specified by the current locale
.RI ( LC_CTYPE ),
unless the locale is
@@ -198,11 +192,29 @@ as the input file encoding.
.
.
.PP
-Note that the
-.B groff
-program supports a
+Note that the coding tag and
+.I uchardet
+methods in the above procedure rely upon a seekable input stream;
+when
+.I preconv
+reads from a pipe,
+the stream is not seekable,
+and these detection methods are skipped.
+.
+If character encoding detection of your input files is unreliable,
+arrange for one of the other methods to succeed by using
+.IR preconv 's
+.B \-D
+or
+.B \-e
+options,
+or by configuring your locale appropriately.
+.
+Furthermore,
+.I groff
+supports a
.I \%GROFF_ENCODING
-environment variable which is eventually expanded to option
+environment variable which is equivalent to its option
.BR \-k .
.
.
@@ -241,53 +253,36 @@ space\[cq] character \[en] something not needed normally
in
.SS "Coding tags"
.\" ====================================================================
.
-Text editors which support more than a single character encoding need
+Text editors that support more than a single character encoding need
tags within the input files to mark the file's encoding.
.
While it is possible to guess the right input encoding with the help of
-heuristics which are reliable for a preponderance of natural language
+heuristics that are reliable for a preponderance of natural language
texts,
-it is still just a guess.
+they are not absolutely reliable.
.
-Additionally,
-heuristics can fail on inputs that are too short or don't represent a
+Heuristics can fail on inputs that are too short or don't represent a
natural language.
.
.
.PP
-For these reasons,
+Consequently,
.I preconv
supports the coding tag convention
(with some restrictions)
used by GNU\~Emacs.
.
+Coding tags in GNU\~Emacs are indicated in specially-marked regions of
+an input file designated for \[lq]file-local variables\[rq].
.
-.PP
-Coding tags in GNU Emacs are indicated in specially-marked regions of an
-input file designated for \[lq]file-local variables\[rq].
.
+.PP
.I preconv
-recognizes two syntax forms which should be put into
+recognizes the following syntax form if it occurs in a
.I roff
-comments.
+comment
+in the first or second line of the input file.
.
-The fist must be placed within the last 3,000 bytes of the file,
-and must come after the last
-(if any)
-form-feed control character.
-.
-.RS
-.EX
-\&.\[rs]" Local Variables:
-\&.\[rs]" coding: \c
-.I encoding
-\&.\[rs]" End:
-.EE
-.RE
-.
-.
-.PP
-The other form must occur within the first two lines of the file.
.
.RS
.EX
@@ -302,7 +297,14 @@ The other form must occur within the first two lines of
the file.
.
.
.PP
-The following list gives all MIME coding tags
+The only tag
+.I preconv
+interprets is \[lq]coding\[rq],
+which can take the values listed below.
+.
+.
+.PP
+The following list comprises all MIME \[lq]charset\[rq] tags
(either lowercase or uppercase)
supported by
.IR preconv .
@@ -346,7 +348,7 @@ Trailing
and
\[lq]\-mac\[rq]
suffixes on coding tags
-(which give the end-of-line convention used in the file)
+(which indicate the end-of-line convention used in the file)
are disregarded for the purpose of comparison with the above tags.
.
.
@@ -371,7 +373,9 @@ is used.
.SH "See Also"
.\" ====================================================================
.
-.IR groff (@MAN1EXT@)
+.IR groff (@MAN1EXT@),
+.IR iconv (3),
+.IR locale (7)
.
.
.\" Restore compatibility mode (for, e.g., Solaris 10/11).
@@ -379,7 +383,6 @@ is used.
.
.
.\" Local Variables:
-.\" coding: us-ascii
.\" mode: nroff
.\" End:
.\" vim: set filetype=groff:
diff --git a/src/preproc/preconv/preconv.am b/src/preproc/preconv/preconv.am
index 7fd7046..02313f8 100644
--- a/src/preproc/preconv/preconv.am
+++ b/src/preproc/preconv/preconv.am
@@ -24,7 +24,7 @@ man1_MANS += src/preproc/preconv/preconv.1
EXTRA_DIST += src/preproc/preconv/preconv.1.man
preconv_TESTS = \
- src/preproc/preconv/tests/late_coding_tags_work.sh
+ src/preproc/preconv/tests/smoke-test.sh
TESTS += $(preconv_TESTS)
diff --git a/src/preproc/preconv/preconv.cpp b/src/preproc/preconv/preconv.cpp
index a6e1b00..62c0f4d 100644
--- a/src/preproc/preconv/preconv.cpp
+++ b/src/preproc/preconv/preconv.cpp
@@ -813,8 +813,8 @@ get_BOM(FILE *fp, string &BOM, string &data)
// or NULL in case no coding tag can occur in the data
// (which is stored unmodified in 'data').
// ---------------------------------------------------------
-static char *
-get_early_tag_lines(FILE *fp, string &data)
+char *
+get_tag_lines(FILE *fp, string &data)
{
int newline_count = 0;
int c, prev = -1;
@@ -934,111 +934,8 @@ get_variable_value_pair(char *d1, char **variable, char
**value)
return NULL;
}
-// Get coding tag from Emacs local variables list at end of file.
-//
-// The region looks like this:
-//
-// Local Variables:
-// coding: latin-2
-// mode: nroff
-// End:
-//
-// Like Emacs, we search at most 3000 bytes from the end of the file, or
-// from the last form-feed control (^L) that occurs.
-//
-// Our string class doesn't support reverse searches so just use C
-// strings.
-static char *
-get_late_coding_tag(FILE *fp)
-{
- char *coding_tag = NULL;
- const int limit = 3000;
- if (fseek(fp, 0, SEEK_END) != 0)
- return NULL;
- // Seek to `limit` bytes from the end of the buffer, or the beginning.
- if (fseek(fp, -limit, SEEK_END) != 0)
- if (errno == EINVAL)
- rewind(fp);
- else
- return NULL;
- char *tmpbuf = (char *) calloc(1, limit + 1 /* trailing '\0' */);
- if (!tmpbuf) {
- error("unable to allocate memory");
- rewind(fp);
- return NULL;
- }
- (void) fread(tmpbuf, 1, limit, fp);
- if (ferror(fp)) {
- error("file read error");
- free(tmpbuf);
- rewind(fp);
- return NULL;
- }
- char *start = tmpbuf;
- char *end = tmpbuf + strlen(tmpbuf);
- char *ff = strrchr(tmpbuf, '\f');
- if (ff)
- start = ff;
- // Find the _last_ occurrence of a local-variables section in the
- // buffer, because the document might have Emacs file-local variables
- // as a discussion topic, as our roff(7) man page does.
- //
- // strcasestr() is a GNU extension we're not using. TODO: Gnulib has
- // it, so we can have it, too.
- char *lv = NULL, *nextlv = NULL;
- const char lvstr[] = "Local Variables:";
- // Declare these now because GCC 8 doesn't like `goto`s crossing them.
- const char codingstr[] = "coding:";
- // From here we must 'goto cleanup' to free our buffer and rewind the
- // file position instead of returning early.
- lv = strstr(start, lvstr);
- if (!lv)
- goto cleanup;
- else
- do {
- start += strlen(lvstr);
- nextlv = strstr(start, lvstr);
- if (nextlv) {
- lv = nextlv;
- start = lv;
- }
- } while(nextlv);
- end = strstr(start, "End:");
- if (!end)
- end = strstr(start, "end:");
- if (!end)
- goto cleanup;
- // Tighten [start, end) bracket until only the coding string remains.
- // Locate "coding:".
- start = strstr(start, codingstr);
- if (!start)
- goto cleanup;
- // Move past it.
- start += strlen(codingstr);
- // Skip horizontal whitespace.
- while (strchr(" \t", *start))
- start++;
- // Find the next newline and advance the end pointer to it.
- end = strchr(start, '\n');
- if (!end)
- end = strchr(start, '\r');
- if (!end)
- goto cleanup;
- // Back up over any trailing whitespace.
- do {
- *end = '\0';
- end--;
- } while ((end > start) && strchr(" \t", *end));
- if (start < end)
- coding_tag = start;
-cleanup:
- free(tmpbuf);
- rewind(fp);
- return coding_tag;
-}
-
// ---------------------------------------------------------
-// Check for coding tag near the beginning of the read buffer.
+// Check coding tag in the read buffer.
//
// We search for the following line:
//
@@ -1069,10 +966,10 @@ cleanup:
// the algorithm. This should work even with files encoded as
// UTF-16 or UTF-32 (or its siblings) in most cases.
// ---------------------------------------------------------
-static char *
-check_early_coding_tag(FILE *fp, string &data)
+char *
+check_coding_tag(FILE *fp, string &data)
{
- char *inbuf = get_early_tag_lines(fp, data);
+ char *inbuf = get_tag_lines(fp, data);
char *lineend;
for (char *p = inbuf; is_comment_line(p); p = lineend + 1) {
if ((lineend = strchr(p, '\n')) == NULL)
@@ -1102,15 +999,6 @@ check_early_coding_tag(FILE *fp, string &data)
return NULL;
}
-static char *
-check_coding_tag(FILE *fp, string &data)
-{
- char *tag = get_late_coding_tag(fp);
- if (!tag)
- tag = check_early_coding_tag(fp, data);
- return tag;
-}
-
char *
detect_file_encoding(FILE *fp)
{
diff --git a/src/preproc/preconv/tests/late_coding_tags_work.sh
b/src/preproc/preconv/tests/late_coding_tags_work.sh
deleted file mode 100755
index d6020b9..0000000
--- a/src/preproc/preconv/tests/late_coding_tags_work.sh
+++ /dev/null
@@ -1,44 +0,0 @@
-#!/bin/sh
-#
-# Copyright (C) 2020 Free Software Foundation, Inc.
-#
-# This file is part of groff.
-#
-# groff is free software; you can redistribute it and/or modify it under
-# the terms of the GNU General Public License as published by the Free
-# Software Foundation, either version 3 of the License, or (at your
-# option) any later version.
-#
-# groff is distributed in the hope that it will be useful, but WITHOUT
-# ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
-# FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
-# for more details.
-#
-# You should have received a copy of the GNU General Public License
-# along with this program. If not, see <http://www.gnu.org/licenses/>.
-#
-
-# Ensure a predictable character encoding.
-export LC_ALL=C
-
-set -e
-
-preconv="${abs_top_builddir:-.}/preconv"
-
-# We do not find a coding tag on piped input because it isn't seekable.
-echo "testing preconv on document read from pipe" >&2
-"$preconv" -d 2>&1 > /dev/null <<EOF | grep "no coding tag"
-abc
-EOF
-
-# Instead of using temporary files, which in all fastidiousness means
-# cleaning them up even if we're interrupted, which in turn means
-# setting up signal handlers, we use files in the build tree.
-
-doc=contrib/mm/mmroff.1
-echo "testing preconv on Latin-1 document $doc" >&2
-"$preconv" -d 2>&1 > /dev/null $doc | grep "coding tag: 'latin-1'"
-
-doc=src/preproc/preconv/preconv.1
-echo "testing preconv on US-ASCII document $doc" >&2
-"$preconv" -d 2>&1 > /dev/null $doc | grep "coding tag: 'us-ascii'"
diff --git a/src/preproc/preconv/tests/smoke-test.sh
b/src/preproc/preconv/tests/smoke-test.sh
new file mode 100755
index 0000000..bd9343f
--- /dev/null
+++ b/src/preproc/preconv/tests/smoke-test.sh
@@ -0,0 +1,68 @@
+#!/bin/sh
+#
+# Copyright (C) 2020 Free Software Foundation, Inc.
+#
+# This file is part of groff.
+#
+# groff is free software; you can redistribute it and/or modify it under
+# the terms of the GNU General Public License as published by the Free
+# Software Foundation, either version 3 of the License, or (at your
+# option) any later version.
+#
+# groff is distributed in the hope that it will be useful, but WITHOUT
+# ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+# FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
+# for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program. If not, see <http://www.gnu.org/licenses/>.
+#
+
+# Ensure a predictable character encoding.
+export LC_ALL=C
+
+set -e
+
+preconv="${abs_top_builddir:-.}/preconv"
+
+echo "testing -e flag override of BOM detection" >&2
+printf '\376\377foobar\n' \
+ | "$preconv" -d -e euc-kr 2>&1 > /dev/null \
+ | grep -q "no search for coding tag"
+
+echo "testing detection of (big-endian) BOM" >&2
+printf '\376\377foobar\n' \
+ | "$preconv" -d 2>&1 > /dev/null \
+ | grep -q "found BOM"
+
+# We do not find a coding tag on piped input because it isn't seekable.
+echo "testing detection of Emacs coding tag in piped input" >&2
+printf '.\\" -*- coding: euc-kr; -*-\\n' \
+ | "$preconv" -d 2>&1 >/dev/null \
+ | grep -q "no coding tag"
+
+# We need uchardet to work to get past this point.
+echo "testing uchardet detection of encoding" >&2
+"$preconv" -v | grep -q 'with uchardet support' || exit 77
+
+# Instead of using temporary files, which in all fastidiousness means
+# cleaning them up even if we're interrupted, which in turn means
+# setting up signal handlers, we use files in the build tree.
+
+doc=contrib/mm/groff_mmse.7
+echo "testing uchardet detection on Latin-1 document $doc" >&2
+"$preconv" -d -D us-ascii 2>&1 >/dev/null $doc \
+ | grep -q 'charset: ISO-8859-1'
+
+# uchardet can't seek on a pipe either.
+echo "testing uchardet detection on pipe (expect fallback to -D)" >&2
+printf 'Eat at the caf\351.\n' \
+ | "$preconv" -d -D euc-kr 2>&1 > /dev/null \
+ | grep -q "encoding used: 'EUC-KR'"
+
+# Fall back to the locale. preconv assumes Latin-1 for C instead of
+# US-ASCII.
+echo "testing fallback to locale setting in environment" >&2
+printf 'Eat at the caf\351.\n' \
+ | "$preconv" -d 2>&1 > /dev/null \
+ | grep -q "encoding used: 'ISO-8859-1'"
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- [groff] 01/01: Partially revert previous preconv change.,
G. Branden Robinson <=