groff
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: 1.23: UTF-8 device produces mysterious characters


From: Dave Kemper
Subject: Re: 1.23: UTF-8 device produces mysterious characters
Date: Tue, 13 Sep 2022 15:56:18 -0500

On 9/13/22, G. Branden Robinson <g.branden.robinson@gmail.com> wrote:
>> Or look at the Unicode standard, where real great minds with
>> incredible multi-national professional life careers are involved,
>> get the official PDF (hr-hrm, i have not updated since Unicode
>> 13..), combined words are separated with hyphen-minus, _not_
>> hyphen.
>
> I am dubious of this claim.  I would like to see how you verified it.

http://en.wikipedia.org/wiki/Hyphen-minus#Description also makes this
claim: "Though the Unicode Standard states that the U+2010 hyphen is
'preferred' over the hyphen-minus, the Standard itself uses the
hyphen-minus as its hyphen character."  It has citations for both
these statements for anyone who wants to dig further.  It seems an odd
choice, but is hardly the Unicode committee's worst sin.  (That
distinction I still reserve for their recommendation to use U+2019 as
both a closing single quotation mark and an apostrophe, an asinine
overloading that obscures the vast semantic difference between the
two.)

> However, by default groff does _not_ break after en dashes.  I
> don't know why this is the case; it has been true for a long time.

I don't know why either, but I can speculate that for most hyphen and
em dash usages, a break following the dash is acceptable, whereas for
one common use of the en dash -- indicating a number range -- a break
would look odd.

> My hypothesis is that less(1) treats '.' as standing for any single code
> point rather than any single byte in the input stream.

My version (less 563) doesn't behave that way, even in a UTF-8 locale.
Maybe it's the older version I'm using, or maybe some other
environmental factor.

> [3] The "hyphen-minus" was, I gather, an entity unknown to typographers
>     in, say, 1970.  It exists because early computer character encoding
>     standards, like ASCII, had limited glyph repertoires and overloaded
>     many glyphs.

There is also the fact that computer keyboards were largely based on
typewriter keyboards, which going back decades earlier had only one
midlevel horizontal bar that had to pull duty as a hyphen, minus sign,
and (typed multiple times) dash.  So even had they had more slots to
work with, the ASCII designers had only the one all-purpose input
character to encode.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]