[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: gropdf-ng merge status (was: PDF outline not capturing Cyrillic text
From: |
G. Branden Robinson |
Subject: |
Re: gropdf-ng merge status (was: PDF outline not capturing Cyrillic text) |
Date: |
Tue, 6 Feb 2024 19:30:58 -0600 |
Hi Deri,
At 2024-02-06T21:35:05+0000, Deri wrote:
> Many thanks for your thoughts on my code. I shall reply in general
> terms since your grasp of some of the issues is rather hazy, as you
> admit.
I generally don't feel I grasp code of nontrivial complexity until I've
documented it and written tests for it, and often not even then. I'm a
bear of very little brain!
> Huge AGL lookup table
>
> My least favourite solution, but you made me do it! The most elegant
> and efficient solution was to make a one line amendment to afmtodit
> which added an extra column to the groff font files which would have
> the UTF-16 code for that glyph. This would only affect devpdf and
> devps and I checked the library code groff uses to read its font files
> was not affected by an extra column. I also checked the buffer used
> would not cause an overflow. Despite this, you didn't like this
> solution, without giving a cogent reason, but suggesting a lookup
> table!
I remember expressing unease with the "new column" approach, not
rejection. The main reason is the documented format of the lines in
question.
groff_font(5):
The directive charset starts the character set subsection. (On
typesetters, this directive is misnamed since it starts a list of
glyphs, not characters.) It precedes a series of glyph
descriptions, one per line. Each such glyph description comprises
a set of fields separated by spaces or tabs and organized as
follows.
name metrics type code [entity‐name] [-- comment]
[...]
The entity‐name field defines an identifier for the glyph that the
postprocessor uses to print the troff glyph name. This field is
optional; it was introduced so that the grohtml output driver could
encode its character set. For example, the glyph \[Po] is
represented by “£” in HTML 4.0. For efficiency, these data
are now compiled directly into grohtml. grops uses the field to
build sub‐encoding arrays for PostScript fonts containing more than
256 glyphs. Anything on the line after the entity‐name field or
“--” is ignored.
The presence of 2 adjacent optional fields seems to me fairly close to
making the glyph descriptions formally undecidable. In practice,
they're not, until and unless someone decides to name their "entity"
"--"... (We don't actually tell anyone they're not allowed to do that.)
As I understand it, this feature is largely a consequence of the
implementation of grohtml 20-25 years ago, where an "entity" in HTML 4
and XHTML 1 was a well-defined thing. We might do well to tighten the
semantics and format of this optional fifth field a bit more.
More esteemed *roffers than I have stumbled over our documentation's
unfortunate tendency to sometimes toss the term "entity" around,
unmoored from any formal definition in the *roff language.
https://lists.gnu.org/archive/html/groff/2023-04/msg00002.html
While I'm complaining about hazy terminology that exacerbates my hazy
understanding of things, I'll observe that I don't understand what the
verb "to sub-encode" means. I suspect there are better words to express
what this is trying to get across. If I understood what grops was
actually doing here, I'd try to find those words.
> As to whether I should embed the table, or read it in, I deferred to
> the more efficient method use by afmtodit, embed it as part of make. I
> still would prefer the extra column solution, then there is no lookup
> at all.
I don't object to the idea, but I think our design decisions should be
documented, and it frequently seems to fall to me to undertake the
documentation. That means I have to ask a lot of questions, which
programmers sometimes interpret as critique. (And, to be fair,
sometimes I actually _do_ have critiques of an implementation.)
> use_charnames_in_special
>
> Probably unnecessary once you complete the work to return .device to
> its 1.23.0 condition, as you have stated.
That seems like a fair prediction. Almost all of the logic _on the
formatter side_ that employs this parameter seems to be in one function,
`encode_char()`.
https://git.savannah.gnu.org/cgit/groff.git/tree/src/roff/troff/input.cpp?h=1.23.0#n5427
(Last month, I renamed that to `encode_char_for_troff_output()` and I'm
thinking it can be further improved, more like
`encode_char_for_device_control()`...
...there's just one more thing.
There's one other occurrence, in a constructor.
https://git.savannah.gnu.org/cgit/groff.git/tree/src/roff/troff/input.cpp?h=1.23.0#n293
I look forward to someday understanding what that's doing there.)
> pdfmomclean
>
> Not quite sure how your work on #64484 will affect this, we will have
> to wait and see.
Fair enough.
> Stringhex
>
> Clearly you are still misunderstanding the issue, because there are
> some incorrect statements.
Okay.
> In any lookup there is a key/value pair.
I'm with ya so far.
> If dealing with a document written in Japanese, both the key and the
> value will arrive as unicode. No problem for the value, but the key
> will be invalid if used as part of a register name.
Yes. I was trying to say the same thing in the mail to which you're
replying.
> There are two obvious solutions. One is to encode the key into
> something, easily decoded, which is acceptable to be used as part of a
> register name, or do a loop traversal over two arrays, one holding the
> keys and one the values.
Yes.
> I'm pretty sure my 9yr old grandson would come up with a looping
> solution.
In my opinion, your grandson has good instincts if he avoids
implementing things like "an:cln", "sanitize.tmac", or my own
"an*abbreviate-page-topic".
> I really don't understand your opposition to the encoding solution,
If I'm debugging using troff and dump the string/macro list, then I
envision it being disheartening to see something like this.
.pm
PDFLB 9
pdfswitchtopage 32
pdfnote 380
pdf:note-T 57
pdfpause 29
PDFBOOKMARK.VIEW 21
pdf:look(0073007500700065007200630061006c006900660072006100670069006c0069007300740069006300650078007000690061006c00690064006f00630069006f007500732602)
41
pdfmark 31
pdftransition 58
pdfbackground 40
pdfpagenumbering 37
pdfbookmark 1677
Whereas with the 9 year old youth's solution, I get something more like
this.
.pnr
pdf:bm.nl 0
PDFOUTLINE.FOLDLEVEL 10000
PDFNOTE.WIDTH 252000
PDFNOTE.HEIGHT 144000
PDFHREF.VIEW.LEADING 5000
PDFHREF.LEADING 2000
pdf:look.id!1 1
.pm
PDFLB 9
pdfswitchtopage 32
pdfnote 380
pdf:note-T 57
pdfpause 29
PDFBOOKMARK.VIEW 21
pdf:look.content!1 41
pdfmark 31
pdftransition 58
pdfbackground 40
pdfpagenumbering 37
pdfbookmark 1677
.tm \*[pdf:look.content!1]
supercalifragilisticexpialidocious\[u2602]
I think there are advantages to not having to read, copy and paste,
or--God forbid, type--something like
pdf:look(0073007500700065007200630061006c006900660072006100670069006c0069007300740069006300650078007000690061006c00690064006f00630069006f007500732602).
Not to mention having to come up with an answer when someone asks me how
to decode that at a shell prompt.
> Ok, I accept you would have done it the childs way with the
> performance hit, but I prefer the more elegant encoding solution.
I agree that O(1) is generally better than O(n). But there are
important things to measure besides the asymptotic runtime behavior of
this one aspect of the system. In my experience, systems that are
easier to troubleshoot are more pleasant to use than ones that aren't.
Further, by not measuring the performance impact of encoding and
decoding all those character-string to hexadecimal-character-string
conversions, we're not telling the full story.
> Uniqueness of keys is an issue for either strategy. In mom, a user
> supplied key name is only possible by using the NAMED parameter, and
> if a user uses the same name twice in the document nothing nasty will
> happen, the overview panel will be correct, since each of those is
> tagged with a safe generated name, and if they have used the same name
> for two different places in the document, when they are checking all
> the intra-document links they will find one of them will go to the
> wrong place. Of course this could be improved by warning when the same
> name is provided for a different destination. The man/mdoc macros
> currently have no named destinations, all generated, but this will
> change if the mdoc section referencing is implemented.
Yes, though that's down the road a little way. First I just want to get
hyperlinks working for `Mt`, `Lk`, and `Xr`.
But it's an obvious concern for man pages. I'll _have_ to solve the
problem because man page section heading titles are so rigidly
prescribed. Format multiple man(7) (and mdoc(7)) documents and you'll
have guaranteed collisions. Prefixing the anchors/bookmarks with the
page identifier seems seems tractable--
<a name="groff(1)/Options">
--for instance. Even that's not perfect--with a large enough
agglomeration of pages you could have more than one groff(1) page--but
it might be close enough. For the pathological cases we can support a
user-specified prefix string, maybe.
> You mention a possible issue if a diversion is passed to stringhex,
> since this is 95% your own code for stringup/down, I'm pretty sure
> that whatever you do to solve the issue in your own code can be
> equally applied to stringhex, so this not an argument you can use to
> prevent its inclusion.
I wasn't making an argument to "prevent its inclusion". I was and am
studying and making observations, and if those point out oversights in
my own work as well as others', that's a net benefit because it _gets
the information out into the world_.
The point of collaborative software development is to get synergetic
benefits from multiple brains working on the same problem. I hope you
have found my little brain to have contributed something from time to
time.
> As regards your point 2, this is a non-issue, in 1.23.0 it works fine
> with .device. You ask what does:-
>
> \X'pdf: bizzarecmd \[u1234]'
>
> Mean? Well, assuming you are writing in the ethiopic language and
> wrote:-
>
> \X'pdf: bizzarecmd ሴ'
>
> And gropdf would do a "bizzarecmd" using the CHARACTER given (ETHIOPIC
> SYLLABLE SEE). Which could be setting a window title in the pdf
> viewer, I'm not sure, I have not written a handler for bizzarecmd.
That is the point I am trying to make. It is totally up to the
postprocessor (or output driver) how to interpret "\[u1234]".
> As you can see not "misleading to a novice" at all, the fact that
> preconv changed it to be a different form and gropdf changed it back
> to a character to use in pdf meta- data is completely transparent to
> the user.
I guess this is a matter of perspective, but the fact that preconv,
troff, and gropdf are all separate programs is significant to me. In my
opinion, that fact means that we need to document the interfaces between
these components. The "\[uXXXX]" that a person puts in an input
document can undergo transformation. (1) preconv might apply a Unicode
normalization form to it; (2) the formatter might honor user
instructions to translate or replace the special character, and, in the
future, it might participate in ligature replacement; (3) the output
device will of course do whatever it does to make a visible grapheme.
> Your work on \X and .device is to put .device back to how it was in
> 1.23.0 and alter \X to be the same, this is what you said would
> happen.
Yup. That's still my plan. It's a bit of a slog to get through all of
the corner cases.
In case it is of interest, I'm attaching two pieces of absolutely not
fully baked code from my Git stash and a working copy, respectively, to
illustrate my progress. Perhaps you would like the opportunity to repay
me in code review coin. I present my underbelly at its softest! :-O
Having gotten through all of that, I find myself unable to locate where
you explicitly identified any of my statements about stringhex as
incorrect, which is an expectation you set up. We agree that one
approach is O(1) and another is O(n). We agree that the key uniqueness
problem is unsolved by either one. You think the solution that I lean
toward is childish, and I think the solution you lean toward is
less helpful for people using GNU troff(1) to debug documents or macro
files. But those are matters of opinion, not correctness of claims.
Please more explicitly flag my mistaken/false statements so that I can
either rebut your characterization, or learn something from you.
> The purpose of my patch was intended to give Robin a robust solution
> to what he wanted to do.
Yes. That's fine--I simply wanted to take the opportunity to try and
keep things moving forward with the gropdf-ng merge, a long and
difficult process. And if he's happy, I'm happy--see below regarding my
trust in you to get the problem solved.
I hope you haven't overlooked that I was thrilled with your recent
commits that make the PDF-hyperlinkification of the collected groff man
pages possible. That was so compelling that I put other things on hold
to get Savannah #61434 done. <https://savannah.gnu.org/bugs/?61434>
> You wrote in another email:-
>
> "But tparm(const char *str, long, long, long, long, long, long, long,
> long, long) is one of the worst things I've ever seen in C code.
>
> As I just got done saying (more or less) to Deri, when you have to
> obfuscate your inputs to cram them into the data structure you're
> using, that's a sign that you're using the wrong data structure."
>
> I don't appreciate being conflated with "the worst things".
I wasn't attempting to liken anything you've written to the tparm thing.
But even if I were, that would be an assessment of _code_ rather than of
your quality as a person or as a programmer.
I probably sound like a stereotypical political liberal when I claim
that "bad" code comes from bad environments, not bad programmers. In my
opinion that really is the first order factor. People are trainable;
programmers improve as they practice. "Bad" code is often the result of
a poor fit for the tool with the problem, and it's sadly typical that a
programmer does not have the breadth of choice in tools that they would
prefer. They might not have learned of a well-suited alternative; no
one has mastered everything. They may be working under a management
mandate to employ certain technology, or operating under constraints
that render "better" tooling unsupportable in the environment in
question. The pressures of cost minimization can knock things out of
reach in the product phase that were easily sustained during research
and development.
Similarly, the *roff language imposes constraints on us. I suspect you
and I both would be reaching for Perl-like associative arrays to handle
the bookmark storage and lookup issue if *roff had them to offer us.
We should always be on the lookout for better solutions. Languages,
tools, and expertise evolve. This is what refactoring is all about.
Furthermore, we (or at least I) learn more about code by taking a
specimen that works and re-expressing it than by virtually any other
method. When we have a good automated test suite to establish that we
haven't regressed it, this can be "refactoring" at its best. If the end
result is smaller, clearer, and/or more performant code, it's a win.
> The structure we were discussing was a simple key/value pair.
Yes, but not so simple in *roff. *roff identifiers are so "liberal in
what they expect" that people get seduced into thinking you can put
anything into them, which leads to episodes like Savannah #64202.
> I have noticed that you tend to call things obfuscated when you have
> difficulty understanding them.
Once again, I confess that I am a bear of very little brain.
But I also figure that even if I'm roughly as sharp as the average
person on this list--if I may make such a boast even hypothetically--
then plenty of other people will have at least as much difficulty as I.
> Encoding a key into a well known address space (the hex numbers) is
> not obfuscation.
V pbaprqr gung lbh znl or noyr gb ernq n urknqrpvzny rapbqvat bs NFPVV
jvgu sne terngre rnfr guna V pna. Ohg vs abg, gura V fhttrfg gung lbh
erpbafvqre lbhe pynvz urer.
If you're old enough to have grandchildren I trust that you recognize
that form of non-obfuscation immediately. And USENET still lives...
> Consider BASE64 in mail systems, is that obfuscation or a valid method
> of protecting an essentially ascii environment,
It was an expedient for transmitting 8-bit data payloads across
non-8-bit clean data communications channels. Would Base64 or UU
encoding ever have been developed if 8-bit clean channels had been
ubiquitous in the first place?
I think you're inferring value judgments from my observations that
you're not warranted in making. The bookmark-encoding-in-an-identifier
problem sucks, that's for sure. I don't blame anyone for doing what
they can to escape its constraints. But I think it's worth exploring
the space of solutions available to us.
Recall that Base64 has costs in time and space. And of course, most
people can't just read a Base64-encoded email as-is, that is by, say,
catting or grepping their inbox. These are downsides we should be
forthright about.
> to my mind its a sensible compromise in systems designed before
> unicode (rings a bell?).
I'm not passionate about Base64 one way or the other. If the day
arrives when we no longer need it for anything because the problems that
motivated its development disappear, then I wouldn't mourn its demise.
It's an instrument, as stringhex, or some other solution to the same
problem, would be.
> I'm afraid you give the impression that your ideas on how I should do
> my voluntary contribution to groff have more weight than my own, is
> that how you see it?
Not at all. We've had conflict over this point before, so I will ask
again whether you would prefer that gropdf be in the contrib/ directory.
Peter Schaffter, for instance, has made a simple statement of where he
regards the boundaries of mom to be even within the contrib/mom
directory; he's not terribly concerned with, as I recall, asserting
control over the Automake script or the groff_mom(7) man page, both of
which originated with others in the first place. So, I pretty much keep
my fingers out of om.tmac. (I think I've run some one-liner changes, at
about the level of typo fixes, by him on one or two occasions, in
response to Savannah tickets. I simply don't grasp the package well
enough to do more.)
...but, Peter is also wholly responsible for mom's documentation.
If you leave to others the task of writing the documentation for your
contributions, they are (I am) going to study them as closely as they
(I) need to to try to communicate their function to third parties.
Specifically, when I study them closely enough, I start thinking like an
engineer again instead of just a technical writer. And I don't think
that's unwarranted since I serve both roles in this project. Further,
if a code module lives in the non-contrib part of the groff source
distribution, I think the strong implication is that responsibility for
its maintenance falls to the groff team as a whole.
(I'll grant that the distinction is less clear than it could be, because
groff has relatively few active contributors and a lot of the stuff in
the contrib/ directory is maintained either by me or by no one at all,
because the original authors have wandered away over the past 20 years.)
> I welcome people who can find issues with my code, and by issues I
> mean if it produces output other than intended, fails on edge cases I
> have not considered, falls over given valid input.
Certainly. If you didn't welcome that sort of feedback I think you'd be
neglecting your responsibilities as a software developer. Fortunately
that scenario is counterfactual.
> I am quite sure there will be "bugs" in my code, it is fairly complex,
> but subjecting it to a "code review" without even running it to see if
> it does what it says on the box, is not helpful.
I think you've pretty badly mistaken my perspective. One of the reasons
I stick my long nose into your code in this way is because I don't worry
that you won't produce correct results. You have an established record
of delivering solutions that work as advertised. (None of us is
perfect, but you do as well as anyone, as far as I can tell. It
wouldn't surprise me in the least if your defect rate were lower than
mine. I depend heavily on iterative development processes and automated
testing to protect myself from faceplants, and I still suffer those
occasionally. Here's a recent example
<https://savannah.gnu.org/bugs/?65225>. We/I didn't have a regression
test for tbl's '\R' feature, and I got bitten. I overlooked the forest
for the wood lice of invalid inputs. We have that regression test now.)
Regards,
Branden
device-X-1.diff
Description: Text Data
device-X-2.diff
Description: Text Data
signature.asc
Description: PGP signature
- Re: PDF outline not capturing Cyrillic text, Robin Haberkorn, 2024/02/03
- Re: PDF outline not capturing Cyrillic text, Deri, 2024/02/06
- gropdf-ng merge status (was: PDF outline not capturing Cyrillic text), G. Branden Robinson, 2024/02/06
- Re: gropdf-ng merge status (was: PDF outline not capturing Cyrillic text), Deri, 2024/02/06
- Re: gropdf-ng merge status (was: PDF outline not capturing Cyrillic text),
G. Branden Robinson <=
- Re: gropdf-ng merge status (was: PDF outline not capturing Cyrillic text), G. Branden Robinson, 2024/02/07
- Tears in my eyes, joy in my heart (was: gropdf-ng merge status (was: PDF outline not capturing Cyrillic text)), Deri, 2024/02/07
- Re: Tears in my eyes, joy in my heart (was: gropdf-ng merge status (was: PDF outline not capturing Cyrillic text)), Dave Kemper, 2024/02/07
- Re: Tears in my eyes, joy in my heart (was: gropdf-ng merge status (was: PDF outline not capturing Cyrillic text)), Peter Schaffter, 2024/02/07
- Re: Tears in my eyes, joy in my heart (was: gropdf-ng merge status, Oliver Corff, 2024/02/07
Re: Re: PDF outline not capturing Cyrillic text, Robin Haberkorn, 2024/02/06