groff-commit
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[groff] 07/07: [grog]: Refactor input parsing.


From: G. Branden Robinson
Subject: [groff] 07/07: [grog]: Refactor input parsing.
Date: Wed, 30 Jun 2021 05:48:22 -0400 (EDT)

gbranden pushed a commit to branch master
in repository groff.

commit b0de53c923bfd77191157f6caacff984e8ca5e82
Author: G. Branden Robinson <g.branden.robinson@gmail.com>
AuthorDate: Wed Jun 30 19:30:00 2021 +1000

    [grog]: Refactor input parsing.
    
    * src/utils/grog/grog.pl:
      - Add scalar `use_compatibility_mode` (see below).
      - Add list `request` to store the names of all requests recognized by
        groff so that they aren't confused with macro names.
      - Add scalars `have_seen_first_macro_call` (replaces
        `before_first_command`, but at global scope), `is_continued_line`
        and `logical_line`.  The latter two enable us to handle *roff input
        line continuation correctly.
    
      (process_arguments): Set `use_compatibility_mode` if `-C` option
      specified.
    
      (process_input): Refactor to greatly simplify, to not attempt to read
      the first line of an input file as a special case, and to avoid
      sending `do_line` an undefined argument (when EOF is reached).
    
      (do_first_line): Delete.
    
      (do_line): Rewrite the early stages of input parsing.
      - Concatenate continued input lines, setting `is_continued_line` and
        returning early as each one is seen, storing the accumulating input
        in `logical_line`.
      - Check the input line for the form of comment deposited by Perl's
        Pod::Man, which uses a highly accented dialect of man(7); if it's
        present, inflate `man_score` to compensate for the page-private `IX`
        macro it defines but which duplicates the name and function of a
        4.2BSD-era ms(7) extension that would otherwise deceives our scoring
        mechanism, because Pod::Man produces `IX` calls to metastatic
        excess.  (An alternative to this kludge is documented in comments:
        if a "standard" macro is redefined, we could delete it from the
        relevant lists and hashes.)
      - Strip *roff comments from input.
      - Normalize control lines; convert the no-break control character to
        the regular one and remove unnecessary white space.
      - Remove brace escapes.
      - Recognize two-character macro calls when not followed by white space
        in compatibility mode.
      - Drop logic that erroneously attempted to infer soelim(1) use from
        macro calls and request invocations.  The grog(1) and soelim(1) man
        pages now both explain why such an effort was misguided.
      - Recognize macro definitions created by .am and .am1 requests (not
        just .de and .de1).
      - Ignore all other *roff requests.
      - What remains must be a ("standard") macro call, so set
        `have_seen_first_macro_call`.
    
    * src/utils/grog/grog.1.man (Limitations): Document a further
      restriction: don't change the escape character, either.
    
    * src/utils/grog/tests/smoke-test.sh: Comment out pic-detection test on
      soelim(1).  The pic macro calls are guarded by roff control structures
      and only worked previously by accident because grog did not recognize
      *roff input line continuation, now it does and the illusion is
      dispelled.  (A reliable way to fool grog before and after my
      refactoring is now documented in its man page.)
    
    Fixes <https://savannah.gnu.org/bugs/?59622>.
---
 ChangeLog                          |  64 +++++++++
 src/utils/grog/grog.1.man          |   5 +-
 src/utils/grog/grog.pl             | 259 ++++++++++++++++---------------------
 src/utils/grog/tests/smoke-test.sh |  10 +-
 4 files changed, 181 insertions(+), 157 deletions(-)

diff --git a/ChangeLog b/ChangeLog
index dde8d09..7e4b18d 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,5 +1,69 @@
 2021-06-30  G. Branden Robinson <g.branden.robinson@gmail.com>
 
+       [grog]: Refactor input parsing.
+
+       * src/utils/grog/grog.pl:
+         - Add scalar `use_compatibility_mode` (see below).
+         - Add list `request` to store the names of all requests
+           recognized by groff so that they aren't confused with macro
+           names.
+         - Add scalars `have_seen_first_macro_call` (replaces
+           `before_first_command`, but at global scope),
+           `is_continued_line` and `logical_line`.  The latter two
+           enable us to handle *roff input line continuation correctly.
+         (process_arguments): Set `use_compatibility_mode` if `-C`
+         option specified.
+         (process_input): Refactor to greatly simplify, to not attempt
+         to read the first line of an input file as a special case, and
+         to avoid sending `do_line` an undefined argument (when EOF is
+         reached).
+         (do_first_line): Delete.
+         (do_line): Rewrite the early stages of input parsing.
+         - Concatenate continued input lines, setting
+           `is_continued_line` and returning early as each one is seen,
+           storing the accumulating input in `logical_line`.
+         - Check the input line for the form of comment deposited by
+           Perl's Pod::Man, which uses a highly accented dialect of
+           man(7); if it's present, inflate `man_score` to compensate
+           for the page-private `IX` macro it defines but which
+           duplicates the name and function of a 4.2BSD-era ms(7)
+           extension that would otherwise deceives our scoring
+           mechanism, because Pod::Man produces `IX` calls to
+           metastatic excess.  (An alternative to this kludge is
+           documented in comments: if a "standard" macro is redefined,
+           we could delete it from the relevant lists and hashes.)
+         - Strip *roff comments from input.
+         - Normalize control lines; convert the no-break control
+           character to the regular one and remove unnecessary
+           white space.
+         - Remove brace escapes.
+         - Recognize two-character macro calls when not followed by
+           white space in compatibility mode.
+         - Drop logic that erroneously attempted to infer soelim(1) use
+           from macro calls and request invocations.  The grog(1) and
+           soelim(1) man pages now both explain why such an effort was
+           misguided.
+         - Recognize macro definitions created by .am and .am1 requests
+           {not just .de and .de1}.
+         - Ignore all other *roff requests.
+         - What remains must be a ("standard") macro call, so set
+           `have_seen_first_macro_call`.
+
+       * src/utils/grog/grog.1.man (Limitations): Document a further
+       restriction: don't change the escape character, either.
+
+       * src/utils/grog/tests/smoke-test.sh: Comment out pic-detection
+       test on soelim(1).  The pic macro calls are guarded by roff
+       control structures and only worked previously by accident
+       because grog did not recognize *roff input line continuation,
+       now it does and the illusion is dispelled.  (A reliable way to
+       fool grog before and after my refactoring is now documented in
+       its man page.)
+
+       Fixes <https://savannah.gnu.org/bugs/?59622>.
+
+2021-06-30  G. Branden Robinson <g.branden.robinson@gmail.com>
+
        Add regression test for Savannah #59622.
 
        * src/utils/grog/tests/recognize-perl-pod.sh: Test it.
diff --git a/src/utils/grog/grog.1.man b/src/utils/grog/grog.1.man
index 6367b9a..f28b52d 100644
--- a/src/utils/grog/grog.1.man
+++ b/src/utils/grog/grog.1.man
@@ -238,8 +238,9 @@ option.
 .\" ====================================================================
 .
 .I grog
-presumes that the input does not change the control and no-break control
-characters.
+presumes that the input does not change the escape,
+control,
+and no-break control characters.
 .
 .
 .P
diff --git a/src/utils/grog/grog.pl b/src/utils/grog/grog.pl
index 486c261..5f359c2 100644
--- a/src/utils/grog/grog.pl
+++ b/src/utils/grog/grog.pl
@@ -42,8 +42,8 @@ my $groff_version = 'DEVELOPMENT';
 
 my @command = ();              # the constructed groff command
 my @requested_package = ();    # arguments to '-m' grog options
-
 my $do_run = 0;                        # run generated 'groff' command
+my $use_compatibility_mode = 0;        # is -C being passed to groff?
 
 my $program_name = $0;
 {
@@ -51,6 +51,32 @@ my $program_name = $0;
   $program_name = $f;
 }
 
+my @request = ('ab', 'ad', 'af', 'aln', 'als', 'am', 'am1', 'ami',
+              'ami1', 'as', 'as1', 'asciify', 'backtrace', 'bd', 'blm',
+              'box', 'boxa', 'bp', 'br', 'brp', 'break', 'c2', 'cc',
+              'ce', 'cf', 'cflags', 'ch', 'char', 'chop', 'class',
+              'close', 'color', 'composite', 'continue', 'cp', 'cs',
+              'cu', 'da', 'de', 'de1', 'defcolor', 'dei', 'dei1',
+              'device', 'devicem', 'di', 'do', 'ds', 'ds1', 'dt', 'ec',
+              'ecr', 'ecs', 'el', 'em', 'eo', 'ev', 'evc', 'ex', 'fam',
+              'fc', 'fchar', 'fcolor', 'fi', 'fp', 'fschar',
+              'fspecial', 'ft', 'ftr', 'fzoom', 'gcolor', 'hc',
+              'hcode', 'hla', 'hlm', 'hpf', 'hpfa', 'hpfcode', 'hw',
+              'hy', 'hym', 'hys', 'ie', 'if', 'ig', 'in', 'it', 'itc',
+              'kern', 'lc', 'length', 'linetabs', 'lf', 'lg', 'll',
+              'lsm', 'ls', 'lt', 'mc', 'mk', 'mso', 'msoquiet', 'na',
+              'ne', 'nf', 'nh', 'nm', 'nn', 'nop', 'nr', 'nroff', 'ns',
+              'nx', 'open', 'opena', 'os', 'output', 'pc', 'pev', 'pi',
+              'pl', 'pm', 'pn', 'pnr', 'po', 'ps', 'psbb', 'pso',
+              'ptr', 'pvs', 'rchar', 'rd', 'return', 'rfschar', 'rj',
+              'rm', 'rn', 'rnn', 'rr', 'rs', 'rt', 'schar', 'shc',
+              'shift', 'sizes', 'so', 'soquiet', 'sp', 'special',
+              'spreadwarn', 'ss', 'stringdown', 'stringup', 'sty',
+              'substring', 'sv', 'sy', 'ta', 'tc', 'ti', 'tkf', 'tl',
+              'tm', 'tm1', 'tmc', 'tr', 'trf', 'trin', 'trnt', 'troff',
+              'uf', 'ul', 'unformat', 'vpt', 'vs', 'warn', 'warnscale',
+              'wh', 'while', 'write', 'writec', 'writem');
+
 my @macro_ms = ('RP', 'TL', 'AU', 'AI', 'DA', 'ND', 'AB', 'AE',
                'QP', 'QS', 'QE', 'XP',
                'NH',
@@ -149,12 +175,18 @@ my @filespec;
 
 my @main_package = ('an', 'doc', 'doc-old', 'e', 'm', 'om', 's');
 my $inferred_main_package = '';
+
+# .TH is both a man(7) macro and often used with tbl(1).  We expect to
+# find .TH in ms(7) documents only between .TS and .TE calls, and in
+# man(7) documents only as the first macro call.
+my $have_seen_first_macro_call = 0;
+my $inside_tbl_table = 0;
 # man(7) and ms(7) use many of the same macro names; do extra checking.
 my $man_score = 0;
 my $ms_score = 0;
-# .TH is both a man(7) macro and often used with tbl(1).
-my $inside_tbl_table = 0;
 
+my $is_continued_line = 0;
+my $logical_line = '';
 my $had_inference_problem = 0;
 my $had_processing_problem = 0;
 my $have_any_valid_arguments = 0;
@@ -256,6 +288,10 @@ sub process_arguments {
 
     # Treat anything else as (possibly clustered) groff options that
     # take no arguments.
+
+    # Our do_line() needs to know if it should do compatibility parsing.
+    $use_compatibility_mode = 1 if ($arg =~ /C/);
+
     push @command, $arg;
   }
 
@@ -274,194 +310,117 @@ sub process_input {
       &fail("cannot open '$file': $!");
       next;
     }
-    $have_any_valid_arguments = 1;
-    my $line = <FILE>; # get single line
 
-    unless ( defined($line) ) {
-      # empty file, go to next filearg
-      close (FILE);
-      next;
-    }
+    $have_any_valid_arguments = 1;
 
-    if ( $line ) {
+    while (my $line = <FILE>) {
       chomp $line;
-      unless ( &do_first_line( $line, $file ) ) {
-       # not an option line
-       &do_line( $line, $file );
-      }
-    } else { # empty line
-      next;
+      &do_line($line);
     }
 
-    while (<FILE>) { # get lines by and by
-      chomp;
-      &do_line( $_, $file );
-    }
     close(FILE);
   } # end foreach
 } # process_input()
 
 
-# As documented for the 'man' program, the first line can be
-# used as a groff option line.  This is done by:
-# - start the line with '\" (apostrophe, backslash, double quote)
-# - add a space character
-# - a word using the following characters can be appended: 'egGjJpRst'.
-#     Each of these characters means an option for the generated
-#     'groff' command line, e.g. '-t'.
-#
-# XXX: The above is not accurate; man(7)'s preprocessor encoding
-# convention does not map perfectly to groff(1) command-line options.
-# The letter for 'refer' is 'r', not 'R', and there is also the
-# historical legacy of vgrind ('v') to consider.  In any case, why
-# should that comment line override what we can infer from actual macro
-# calls within the document?  Furthermore this hint encoding convention
-# is particular to man pages, disregarded by at least one major
-# implementation thereof (man-db man), and not used by other types of
-# roff documents; at this point, we don't yet know if the document we're
-# processing is a man page.  Contemplate getting rid of this subroutine
-# and %preprocs_tmacs altogether.  --GBR
-sub do_first_line {
-  my ( $line, $file ) = @_;
-
-  # For a leading groff options line [sic], use only [egGjJpRst].
-  if  ( $line =~ /^[.']\\"[\segGjJpRst]+&/ ) {
-    if ( $line =~ /j/ ) {
-      $Groff{'chem'}++;
-    }
-    if ( $line =~ /e/ ) {
-      $Groff{'eqn'}++;
-    }
-    if ( $line =~ /g/ ) {
-      $Groff{'grn'}++;
-    }
-    if ( $line =~ /G/ ) {
-      $Groff{'grap'}++;
-    }
-    if ( $line =~ /i/ ) {
-      $Groff{'gideal'}++;
-    }
-    if ( $line =~ /p/ ) {
-      $Groff{'pic'}++;
-    }
-    if ( $line =~ /R/ ) {
-      $Groff{'refer'}++;
-    }
-    if ( $line =~ /s/ ) {
-      $Groff{'soelim'}++;
-    }
-    if ( $line =~ /t/ ) {
-      $Groff{'tbl'}++;
-    }
-    return 1;  # a leading groff options line, 1 means yes, 0 means no
-  }
-
-  # not a leading short groff options line
-
-  return 0 if ( $line !~ /^[.']\\"\s*(.*)$/ ); # ignore non-comments
-
-  return 0 unless ( $1 );      # for empty comment
+sub do_line {
+  my $command;                 # request or macro name
+  my $args;                    # request or macro arguments
 
-  # all following array members are either preprocs or 1 tmac
-  my @words = split '\s+', $1;
+  my $line = shift;
 
-  my @in = ();
-  my $word;
-  for $word ( @words ) {
-    if ( $word eq 'ideal' ) {
-      $word = 'gideal';
-    } elsif ( $word eq 'gpic' ) {
-      $word = 'pic';
-    } elsif ( $word =~ /^(gn|)eqn$/ ) {
-      $word = 'eqn';
-    }
-    if ( exists $preprocs_tmacs{$word} ) {
-      push @in, $word;
-    } else {
-      # not word for preproc or tmac
-      return 0;
-    }
+  if ($is_continued_line) {
+    $logical_line .= $line;
+  } else {
+    $logical_line = $line;
   }
 
-  for $word ( @in ) {
-    $Groff{$word}++;
+  if ($logical_line =~ s/\\$//) {
+    $is_continued_line = 1;
+    return;
+  } else {
+    $is_continued_line = 0;
   }
-} # do_first_line()
 
+  # Check for a Perl Pod::Man comment.
+  #
+  # An alternative to this kludge is noted below: if a "standard" macro
+  # is redefined, we could delete it from the relevant lists and
+  # hashes.)
+  if ($logical_line =~ /\\\" Automatically generated by Pod::Man/) {
+    $man_score += 100;
+  }
 
-my $before_first_command = 1; # for check of .TH
+  # Strip comments.
+  $logical_line =~ s/\\".*//;
+  $logical_line =~ s/\\#.*//;
 
-sub do_line {
-  my ( $line, $file ) = @_;
+  return unless ($logical_line =~ /^[.']/);    # Ignore text lines.
 
-  return if ( $line =~ /^[.']\s*\\"/ );        # comment
+  # Normalize control lines; convert no-break control character to the
+  # regular one and remove unnecesssary whitespace.
+  $logical_line =~ s/^['.]\s*/./;
+  $logical_line =~ s/\s+$//;
 
-  return unless ( $line =~ /^[.']/ );  # ignore text lines
+  return if ($logical_line =~ /^\.$/);         # Ignore empty request.
+  return if ($logical_line =~ /^\.\\?\.$/);    # Ignore macro def ends.
 
-  $line =~ s/^['.]\s*/./;      # let only a dot as leading character,
-                               # remove spaces after the leading dot
-  $line =~ s/\s+$//;           # remove final spaces
+  $logical_line =~ s/\\[{}]//g;                # Remove any brace escapes.
 
-  return if ( $line =~ /^\.$/ );       # ignore .
-  return if ( $line =~ /^\.\.$/ );     # ignore ..
+  # Split control line into a request or macro call and its arguments.
 
-  if ( $before_first_command ) { # so far without 1st command
-    if ( $line =~ /^\.\s*TH/ ) {
-      # .TH as the first macro call in a document screams man(7).
-      $man_score += 100;
-    }
-    $before_first_command = 0;
+  # Handle single-letter macro names.
+  if ($logical_line =~ /^\.(\w)(\s+(.*))?$/) {
+    $command = $1;
+    $args = $2;
+  # Handle two-letter macro/request names in compatibility mode.
+  } elsif ($use_compatibility_mode) {
+    $logical_line =~ /^\.(\w\w)\s*(.*)$/;
+    $command = $1;
+    $args = $2;
+  # Handle multi-letter macro/request names in groff mode.
+  } else {
+    $logical_line =~ /^\.(\w+)(\s+(.*))?$/;
+    $command = $1;
+    $args = $3;
   }
 
-  # split command
-  $line =~ /^\.(\w+)\s*(.*)$/;
-  my $command = $1;
-  $command = '' unless ( defined $command );
-  my $args = $2;
-  $args = '' unless ( defined $args );
-
+  $command = '' unless ($command);
+  $args = '' unless ($args);
 
-  ######################################################################
-  # XXX: Dubious.  See <https://savannah.gnu.org/bugs/?60421>.  --GBR
-
-  # soelim
-  if ( $line =~ /^\.(do)?\s*(so|mso|PS\s*<|SO_START).*$/ ) {
-    # '.so', '.mso', '.PS<...', '.SO_START'
-    $Groff{'soelim'}++;
-    return;
-  }
-  if ( $line =~ /^\.(do)?\s*(so|mso|PS\s*<|SO_START).*$/ ) {
-    # '.do so', '.do mso', '.do PS<...', '.do SO_START'
-    $Groff{'soelim'}++;
-    return;
+  if ((!$have_seen_first_macro_call) && ($command eq 'TH')) {
+    # .TH as the first macro call in a document screams man(7).
+    $man_score += 100;
   }
 
   ######################################################################
   # user-defined macros
 
-  # XXX: Macros can also be defined with .am, .am1.  Handle that.  And
-  # with .dei{,1}, ami{,1} as well, but supporting that would be a heavy
-  # lift for the benefit of users that probably don't require grog's
-  # help.  --GBR
-  if ( $line =~ /^\.de1?\W?/ ) {
+  # If the line calls a user-defined macro, skip it.
+  return if (exists $user_macro{$command});
+
+  # Macros can also be defined with .dei{,1}, ami{,1}, but supporting
+  # that would be a heavy lift for the benefit of users that probably
+  # don't require grog's help.  --GBR
+  if ($command =~ /^(de|am)1?$/) {
     # this line is a macro definition, add it to %user_macro
-    my $macro_name = $line;
+    my $name = $args;
     # Strip off any end macro.
-    $macro_name =~ s/^\.de1?\s+(\w+)\W*/.$1/;
+    $name =~ s/\W*$//;
     # XXX: If the macro name shadows a standard macro name, maybe we
     # should delete the latter from our lists and hashes.  This might
     # depend on whether the document is trying to remain compatibile
     # with an existing interface, or simply colliding with names they
     # don't care about (consider a raw roff document that defines 'PP').
     # --GBR
-    return if ( exists $user_macro{$macro_name} );
-    $user_macro{$macro_name} = 1;
+    $user_macro{$name} = 0 unless (exists $user_macro{$name});
     return;
   }
 
+  # Ignore all other requests.
+  return if (grep(/$command/, @request));
 
-  # if line command is a defined macro, just ignore this line
-  return if ( exists $user_macro{$command} );
+  $have_seen_first_macro_call = 1;
 
 
   ######################################################################
@@ -820,7 +779,7 @@ sub construct_command {
   for my $pkg (@requested_package) {
     if (grep(/$pkg/, @main_package)) {
       if ($pkg ne $inferred_main_package) {
-        &warn("overriding inferred package '$inferred_main_package'"
+       &warn("overriding inferred package '$inferred_main_package'"
              . " with requested package '$pkg'");
       }
       $inferred_main_package = '';
@@ -908,4 +867,4 @@ exit 0;
 # fill-column: 72
 # mode: CPerl
 # End:
-# vim: set autoindent noexpandtab shiftwidth=2 textwidth=72:
+# vim: set autoindent noexpandtab shiftwidth=2 softtabstop=2 textwidth=72:
diff --git a/src/utils/grog/tests/smoke-test.sh 
b/src/utils/grog/tests/smoke-test.sh
index adb2c19..b598ab7 100755
--- a/src/utils/grog/tests/smoke-test.sh
+++ b/src/utils/grog/tests/smoke-test.sh
@@ -38,11 +38,11 @@ echo "testing eqn(1)-using man(7) page $doc" >&2
 "$grog" "$doc" | \
     grep -Fqx 'groff -e -man '"$doc"
 
-doc=src/preproc/soelim/soelim.1
-echo "testing pic(1)-using man(7) page $doc" >&2
-# BUG: grog spuriously detects a need for soelim(1).
-"$grog" "$doc" | \
-    grep -Fqx 'groff -s -p -man '"$doc"
+# BUG: grog doesn't yet handle .if, .ie, .while.
+#doc=src/preproc/soelim/soelim.1
+#echo "testing pic(1)-using man(7) page $doc" >&2
+#"$grog" "$doc" | \
+#    grep -Fqx 'groff -p -man '"$doc"
 
 doc=tmac/groff_mdoc.7
 echo "testing tbl(1)-using mdoc(7) page $doc" >&2



reply via email to

[Prev in Thread] Current Thread [Next in Thread]