groff
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

input.cpp, read_size() error reporting


From: Ingo Schwarze
Subject: input.cpp, read_size() error reporting
Date: Wed, 1 Apr 2020 18:13:50 +0200
User-agent: Mutt/1.12.2 (2019-09-21)

Hi,

G. Branden Robinson wrote in commit c63baf5c, which is a revert:
(I'm changing the order a bit while quoting.)

> I reverted my recent changes to this file because Ingo hates them
> and emailed me to tell me so.

Please note that i'm not trying to bash Branden.  Quite to the
contrary, i very much appreaciate that he is working on improved
error reporting: in my presentations at international BSD conferences,
i often said that error reporting is important and needs to be
accurate and as clear as possible without being excessively noisy,
for the benefit of users.

Also note that i'm not blaming him for committing a patch that had
a number of problems.  It does sometimes happen that soemthing looks
like an obvious improvement to a developer so they just commit it,
and it only turns out later that it isn't quite right yet.  That
can easily be fixed by reverting, then reconsidering, which is
exactly what Branden did here: no lasting damage done.  Of course,
if you are unsure whether something is right, asking for code review
before committing is better; but in particular in a small project
lacking manpower, sometimes being bold and just moving ahead with
improvements may be the right thing to do.

> IMO read_size() needs a refactor

I just scrutinized the code and i must say i find that particular
function quite clear and straightforward.  To me, it feels like
James Clark did a decent job here in 1991, which also stood the
test of time and needed few fixes over the years.  The few that
were needed came from Werner Lemberg, as far as i looked.

Then again, if there are reasonable ideas for making it even simpler
and even more readable, no objection to that, but so far, i don't
see how, or in which way the design of the function might be
defective.

Here is a very compact summary of what the function does:

 * initial '-' or '+': skip it, set inc      
 * inspect the following token:
    * if it is '(':
      * if inc is not yet set, following '-' or '+' is skipped, setting inc
      * two valid digits: use them
      * one valid digit: set BAD, consume the digit and the invalid token
      * no valid digit: set BAD, consume one invalid token
    * else if it is a digit:
       * if it is 1, 2, or 3 and inc is not set, 
         handle a second digit, or set BAD and consume the invalid token
       * else just use the digit
    * else if it can be used as a delimiter:
       * if inc is not yet set, following '-' or '+' is skipped, setting inc
       * try get_number(), which does the following:
          * skip whitespace
          * if newline, tab, or \\}: warn and fail
          * parse_expr() -- this is likely getting complicated,
            so i'm skipping inspection for now; it may succeed or fail,
            and it is apparently supposed to report errors on stderr
       * if get_number() succeeded:
          * check closing delimiter or report error and BAIL OUT
       * else BAIL OUT, supposing get_number() reported the error
         (get_number() may have consumed one invalid token)
    * else BAIL OUT, token::delimiter() already reported the error
      (and consumed the token that is not a valid delimiter)
 * if BAD, which may result from \s(1x, \s(x, \s1x:
   report error and BAIL OUT (consuming the invalid token)
 * finally, handle the valid size

In particular, note that when encountering a token that is invalid in
the given context, as far as i can see, the function consistently
discards that token and consistently reports the error.

>     For fun, throw the following at nroff:
>
>     \s
>     \s+
>     \s-

Results in "cannot use newline as a starting delimiter".
This might profit from saying that we are talking about \s.

>     \sA
>     \s++
>     \s--
>     \s[+
>     \s[-

Results in "missing number".
This might profit from saying that the invalid token is 'A', '+',
'-', newline, or newline, respectively.

>     \s(+
>     \s(-
>     \s(++
>     \s(--
>     \s1
>     \s1x

Results in "bad digit in point size".
This might profit from saying that the bad digit is newline,
newline, '+', '-', newline, or 'x', respectively.

>     \s[++]
>     \s[--]

Results in "numeric expression expected (got ']')"
That sounds quite good already, i think.

Here four others:

      \s[1

Results in "missing ']'", which sounds good to me.

      \s#1

Results in "missing closing delimiter".
This might profit from saying that the missing delimiter is '#'.

      \s++1

Results in "cannot use character '+' as a starting delimiter",
which is very good.

      \s\%

Results in "missing number", which is hard to improve:
how do you describe a token that is not TOKEN_CHAR?
There would be a very large number of possibilities to handle.
I not sure it would be wise to try to tackle that at this point.

> Every one of these is malformed but the current code (in its
> now-reverted state) misses a few,

I'm not sure i follow: the current code - after your patch got
reverted - appears to report all errors, as far as i can see.
What exactly do you consider "missed"?  I admit though that in
some cases, there error messages could probably provide more detail.

> and sometimes characters get eaten (instead of sent to output)
> because of the lack of error-checking in read_size().

If a token is invalid at some point, it (almost?) always gets eaten,
not just in this function.  I think that is a basic design principle
of how roff(7) parsing is supposed to work.

Also, even after scrutinizing the code and doing some testing,
i fail to see any lack of error-checking.  Can you say which
error exactly isn't checked for, i.e. which exact input results
in which faulty output or deficient error reporting, and what
exactly you would like to see instead?

Below is an example of a patch showing how error messages might be
improved, in one specific respect.  My testing is limited so far:

   $ printf '\\s(\x02\n' | ./test-groff -T ascii       
  troff: <standard input>:1: error: bad digit in point size: character code 2
   $ printf '\\s(x\n' | ./test-groff -T ascii
  troff: <standard input>:1: error: bad digit in point size: 'x'
   $ printf '\\s(\\%%\n' | ./test-groff -T ascii 
  troff: <standard input>:1: error: bad digit in point size
   $ printf '\\s(\n' | ./test-groff -T ascii     
  troff: <standard input>:1: error: bad digit in point size

Of course, it would be possible to add more clauses like if (tok.eof())
else if (tok.newline()) and so on, but given the wide variety of
tokens that exist, i have a feeling this could easily be overdone.

Oh, is there a better function to use here than input_char_description(),
that is maybe able to describe more tokens in a human-readable way?
I don't know the groff parsing code well enough to judge that.

Yours,
  Ingo

P.S.
If you like this patch, if it survives full testing, and if it gets
committed, one could reduce the indentation of the final code in
the function, afterwards.  In general, it feels clearer to do error
handling up front, then let normal processing proceed once errors
have been dealt with, rather than handling errors in an else clause
as if they were an afterthought, indenting all the normal processing
one level for each kind of error that needs to be handled.


diff --git a/src/roff/troff/input.cpp b/src/roff/troff/input.cpp
index fb990bdf..6ecd2d52 100644
--- a/src/roff/troff/input.cpp
+++ b/src/roff/troff/input.cpp
@@ -5106,7 +5106,13 @@ static int read_size(int *x)
       return 0;
     }
   }
-  if (!bad) {
+  if (bad) {
+    if (c)
+      error("bad digit in point size: %1", input_char_description(c));
+    else
+      error("bad digit in point size");
+    return 0;
+  }
     switch (inc) {
     case 0:
       if (val == 0) {
@@ -5131,11 +5137,6 @@ static int read_size(int *x)
       *x = 1;
     }
     return 1;
-  }
-  else {
-    error("bad digit in point size");
-    return 0;
-  }
 }
 
 static symbol get_delim_name()



reply via email to

[Prev in Thread] Current Thread [Next in Thread]