[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Groff] Having a problem with parsing output to html...
From: |
Keith Marshall |
Subject: |
Re: [Groff] Having a problem with parsing output to html... |
Date: |
Fri, 25 Mar 2011 09:53:18 +0000 |
On 25 March 2011 04:38, Werner LEMBERG wrote:
>
> Justin,
>
> a simple example says more than thousand words... So please give us
> an example we can examine.
Hear! Hear!
> At a first glance, it seems you have an encoding problem (but this
> doesn't explain the strange things you see). The default encoding of
> groff is latin1, and your input file is probably UTF8. Starting with
> version 1.20, groff can handle UTF8 by use a new preprocessor.
>
> The HTML output driver is still experimental (and basically
> unmaintained currently due to lack of time and interest); it is easily
> possible that you've found a bug.
Equally -- perhaps more -- likely, Justin has encountered a hyphenation
issue. This:
> On the 11th in my groff file, an "â" character is found after 64
> characters have been printed, within the word hamburger, the text gets
> parsed and printed as "hamâburger". If I change hamburger to donations
> I have the "â" character show up at the 60th character on the line,
> with donations being "donaâtions".
is reminiscent of an issue I myself observed, earlier this week. I had
run some informally structured ASCII text through a sed filter, and then
through nroff, (v1.20.1), to produce an alternative layout. Although I
had suppressed hyphenation (.hy 0), I did have several explicit ASCII
hyphen characters in the input stream; each of these was replaced, in
the output stream, by the three byte octal sequence 342 200 220, (which
I guess represents u2010 -- the Unicode hyphen which groff_char(7)
documents as the output form for hyphen).
Viewing this output with "less", on my UTF-8 aware console, it looked
absolutely fine, but after uploading as a package description file on my
SourceForge downloads page, each hyphen was rendered, by Firefox, with
unwanted whitespace surrounding it; rendered by Internet Explorer, each
hyphen was replaced by three characters of garbage, amongst it being the
"â" observed by Justin, IIRC.
So yes, I guess what you actually see is dependent on encoding, (and how
the viewer interprets the u2010 sequence, however it is encoded). In my
case, I wanted real ASCII hyphens in my output stream; adding "-Tascii"
to my nroff command gave me that.
--
Regards,
Keith.