[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: z/OS porting issues, UTF-8 support, and the groff man(1) page
From: |
Mike Fulton |
Subject: |
Re: z/OS porting issues, UTF-8 support, and the groff man(1) page |
Date: |
Sat, 1 Apr 2023 16:47:25 -0700 |
On Fri, Mar 31, 2023 at 2:55 PM G. Branden Robinson <
g.branden.robinson@gmail.com> wrote:
> [adding Dave to CC; seek your name below for my magical summons]
>
> At 2023-03-31T13:05:09-0700, Mike Fulton wrote:
> > On Fri, Mar 31, 2023 at 8:57 AM G. Branden Robinson <
> > > As a groff developer, I'm interested in minimizing the number of
> > > patches you have to carry "downstream" to support groff.
> > >
> > Definitely - I have not yet been able to build with the 'git' dev
> > build but instead have been building from the tarball. I was planning
> > to work to upstream changes once I had the 'git' build working (we are
> > getting there now that we have more tools in place - it's a circuitous
> > process!)
>
> When you're ready to make that shift, be sure to read the "INSTALL.REPO"
> file in the root of the repository or distribution archive.
>
Bruno Haible has provided an enhancement to gnu libiconv that now 'falls
back'
to < and > from the mathematical angled brackets.
The net of that change is that 'man groff' now works for me, which is great!
I _do_ want to tackle the other things that are brought up here as well
(in particular getting a proper fix for my sed hack) and I want to figure
out how
to build man so that I can get 'true' UTF-8 support in my man pages.
I am going to take a crack at getting the 'git build' going. I will reach
out once
I have made progress with that. Hopefully it won't be too hard - depends on
how
many other tools are required for bootstrap/configure. It sounds like that
may also
help with my 'sed' problems (see below).
>
> > > I assume the change here:
> > >
> > >
> https://github.com/ZOSOpenTools/groffport/blob/main/patches/makevarescape.sed.patch
> > >
> > > is due to a limitation of the system's sed(1)?
> > >
> > Yes - that is the change. No - it's not because of sed. We have ported
> > sed and could rely on it as a dependency. The issue we hit is a bit
> > ugly. Because z/OS is a 'multi-tenant' operating system, we want
> > people to be able to install into a particular location of their
> > choice (either as developer _or_ as a consumer of the binary).
>
> ...without a recompile, I assume?
>
Correct. Without a recompile.
>
> > To make that work, we run a post-process on the files when someone
> > downloads them to change the install 'root' location from where we
> > built the code to the target location they want to install into. It's
> > ugly and we end up doing a find across files to do this trick. If that
> > 'sed' change is in there, we end up 'missing' some particular updates
> > because the string gets changed on us for the 'root' and so I took out
> > that sed update (a complete hack that I need to do better).
>
> Ah. Hmm. I can think of a better way, although it won't (completely)
> help groff 1.22.4.
>
> For groff 1.23, I revised our man pages to be much more careful about
> documenting full file specifications to groff-installed files and to
> compute their values based on the build's configuration
> parameters--stuff like "./configure --prefix=/home/foobar".
>
I will check this out - maybe the problem 'goes away' in 1.23.
>
> Something I think you could do starting with the 1.23.0 release
> candidates--if you keep the groff build tree around somewhere--is to
> perform your sed operation on all the *.man files in the source tree
> (and build tree, if it is separate), sniping any of the existing fodder
> for sed replacement that you find appropriate.
>
> To be concrete, I'm talking about this stuff:
>
>
> https://git.savannah.gnu.org/cgit/groff.git/tree/Makefile.am?id=e3824d611be904bad22176f4f4eb282a5352509d#n864
>
> So your multi-tenancy assistance script could do something like this:
>
> MANS=$(find groff-source-dir groff-build-dir -name "*.man")
> sed -i 's#@BINDIR@#'"$TENANT_HOME"'/bin#g' $MANS
> cd groff-build-dir
> make man-all # You can thank Keith Marshall for suggesting this.
>
I will try the 'git build' first and see what that looks like.
>
> ...and as Emeril Lagasse would say, "bam!" The pages will be
> regenerated with correct file specifications with no cumbersome
> workarounds. And thanks to makevarescape.sed, if the file names wind up
> being long, they'll break in pleasant locations and won't be hyphenated.
>
> Or so I predict, not having actually done this concretely.
>
> If you're wondering why you need to search both the build and source
> directories for .man documents, that's my fault.
>
>
> https://git.savannah.gnu.org/cgit/groff.git/commit/?id=31536c517dfe49b4e4a715a732f76b701531e90a
>
> > > Interestingly, this meshes closely with groff's assumptions. Due to
> > > its chronological origins ca. 1990, it does not accept UTF-8 input,
> > > but it aware of UTF-8 and can produce it as output. The formatter,
> > > troff(1), accepts ISO Latin-1 input, except on systems where the C
> > > preprocessor macro "IS_EBCDIC_HOST" evaluates true; it then assumes
> > > that its input is encoded using code page 1047.
> > >
> > From my perspective, we can drop support for 1047 altogether. However,
> > I don't know if someone else has done their own 'separate' port. I
> > haven't seen it if there is one. Correct. I don't set that symbol.
>
> Ooh, this is tempting. Can you tell me if "OS/390 Unix" is the same
> product as "z/OS"? Or, if not, if such a thing as "OS/390 Unix" is
> still supported? I apologize for not knowing much about IBM operating
> systems. (I've heard wonderful stories about SMIT, though...)
>
Over the years, the operating system has evolved from MVS to OS/390 to z/OS.
What is shipped with the operating system has evolved too. Up until the
80's, there was no POSIX environment available. That was added in the early
90's as 'Open Edition'. Back in the 90's it was optional, but now,
it's always available on the z/OS system (although you can still restrict
users
to not be able to _use_ the POSIX environment if you want).
So, we now have lots of names for the same thing: OS/390 Unix, Unix System
Services,
Open Edition. Some services still spit out the old names (so that tools
don't get broken) so
you will see comparisons to 'OS/390' and sometimes to 'z/OS'.
It's important to note that the hardware (e.g a Z16) runs a variety of
operating systems including
Linux, z/OS, z/VM, z/TPF, z/VSE. The Z hardware family is typically
referenced as 's390x'.
That was a very long background to say 'Yes - OS/390 Unix can be thought of
as 'the same' as z/OS'
although z/OS has a lot more stuff in it than just the POSIX environment
that we now refer to as
z/OS Unix System Services.
> > > I reckon you've already dealt with this if necessary, and ensured
> > > that your groff 1.22.4 build does not define that symbol.
> > >
> > > Is code page 1047 deprecated or obsolescent on z/OS? If groff
> > > dropped support for it, do you suspect any z/OS users would be
> > > inconvenienced?
> > >
> > I would say neither. An application can choose whether it wants to work
> in
> > UTF-8/ASCII or whether it wants to work in EBCDIC (or both if it's
> careful).
> > I wrote a blog on this awhile back:
> >
> https://makingdeveloperslivesbetter.wordpress.com/2022/01/07/is-z-os-ascii-or-ebcdic-yes/
>
> It looks like what's going on here is that z/OS has metadata available
> for any file of interest to a Unix-like environment that tags a given
> file as ISO 8859-1- or EBCDIC-encoded (if it has to be interpreted as a
> character stream encoded using a single byte).
>
Correct. We can 'tag' a file (via the chtag command) in the hierarchical
file system
with the CCSID and we have some nice services for 'autoconversion' between
ASCII and EBCDIC that can be used.
>
> I presume there are facilities to permute the encodings (since ISO
> 8859-1 and code page 1047 are equivalent except for ordering)
> dynamically as well as statically; for the latter you recommend iconv.
>
Yes - the OS provides the iconv C function and the iconv shell command in
the OS
although it is more limited that gnu libiconv.
>
> So, instead of maintaining groff's own facilities to interpret code page
> 1047 input, we would simply advise affected users to (convert and) tag
> their input files with z/OS's "chtag" command.
>
> This would indeed make possible a nice simplification to GNU troff's
> input processing.
>
> I do not yet assume it would be wise to kill off grotty(1)'s support for
> generating code page 1047 _output_...but maybe we can. Is it possible
> to configure the environment on z/OS such that that is the case? How do
> you spell the standard C locale variables for this scenario?
>
> "LC_ALL=en_US.EBCDIC"?
>
I'm no locale expert, but I think it's the other way around where it's
assumed
to be EBCDIC, e.g. LC_AL=fr_FR.UTF-8
>
> This may be important for ensuring that we keep nroff(1) working.
>
> > > If there is no longer an audience for code page 1047, several
> > > aspects of groff could be simplified, and it might make the
> > > transition of GNU troff's internal type to int32_t easier. (I
> > > started down this road once before.)
> >
> > This makes sense to me. I know for Perl, we made sure to keep EBCDIC
> > there, but the z/OS Open Tools community doesn't build with EBCDIC.
>
> I think for groff the main win will be to make it easier for people to
> learn and contribute to the project without this additional layer of
> translation in input processing (at least). The significant challenges
> of coping nicely with UTF-8 input were going to be there anyway, arising
> from the narrow-character architecture.
>
> > > > Would others also find it valuable to be able to have the
> > > > mathematical angle brackets in UTF-8 be transliterated to angle
> > > > brackets in ISO8859-1?
> > >
> > > Unless you mean degradation to basic Latin less than and greater
> > > than signs, U+003C and U+003E, then I don't think there are any
> > > valid transliteration targets in ISO Latin-1. The "left-" and
> > > "right-pointing double angle quotation mark"s (U+00AB and U+00BB)
> > > are indeed visually similar but semantically pretty distinct. I
> > > don't think I'd want to impose such a fallback in general. (There
> > > are multiple ways groff users could provide fallbacks for
> > > themselves.)
> >
> > Fair enough!
> >
> > > > If so, perhaps a 'starter fix' would be if I worked with the
> > > > libiconv folks to see if that can be added (I opened a similar
> > > > question in the libiconv channel since honestly I'm not sure the
> > > > best way to fix this).
> > >
> > > You can pursue both lines of attack independently, especially if the
> > > iconv developers have a good reason for not performing this fallback
> > > already.
> > >
> > > I'm not sure groff has a good reason for not performing this
> > > fallback. At this point I think I will tap Dave Kemper, another
> > > groff developer who has a fairly strong interest in the fallback
> > > issue.
> >
> > Thank you.
>
> Dave, what do you think about fallbacks for \(la and \(ra?
>
> > > To cut out yet another source of trouble, if your terminal emulator
> > > has more than 765 lines of scrollback buffer, you can omit paging
> > > the groff(1) document entirely.
> >
> > I did this and it _does_ look good! When I ran it through less -R I
> > did hit problems with the angled brackets - that may be an issue with
> > less.
>
> Okay--let us know if the problem returns to the groff court.
>
> > > I would next inspect groff's device-independent output (which I call
> > > "grout" for short) to see what's being handed to groff's terminal
> > > output driver (grotty(1)).
> > >
> > > $ zcat $(man -w groff) | groff -man -Tutf8 | less
>
> I forgot an important part here.
>
> $ zcat $(man -w groff) | groff -Z -man -Tutf8 | less
>
> Gotta have that "-Z" flag.
>
> > > Around line 459 you should see a sequence of lines like this.
> > >
> > > tGNU
> > > wh24
> > > Cla
> > > h24
> > > thttp://www.gnu.org
> > > Cra
> > > h24
> > > t.
> > >
> > > Those "Cla" and "Cra" lines are key. If they are not absent, then you
> > > have almost certainly found a bug in groff.
>
> > > Another thing I would do is to view the groff_char(7) man page.
> > >
> > > $ man groff_char
> >
> > I don't get warnings here, but the Output and Input columns under:
> > 8-bit Character Codes 160 to 255
> > are all
> > � �
>
> Don't worry about that. The man page in groff 1.22.4 is wrong in that
> respect. It's fixed in the groff 1.23.0 release candidates.
>
>
> https://git.savannah.gnu.org/cgit/groff.git/commit/?id=3e583c9541e4f764c175d7507a9aea1f8eeaaa55
>
> Regards,
> Branden
>