[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: lynx-dev 0x9A bug
From: |
Karel Kulhavy |
Subject: |
Re: lynx-dev 0x9A bug |
Date: |
Tue, 5 Oct 1999 16:37:50 +0200 |
> On Tue, 5 Oct 1999, Karel Kulhavy wrote:
>
> [ Reformatted for quoting - watch your line lenght! ]
>
> > I've found out that when I run lynx in -dump -raw mode, Lynx removes
> characters 0x9A from the original source.
>
> > This bug is in version 2.7.1 as well as in version 2.8.2rel.1.
>
> > I have a html file containing Czech text in cp1250 encoding. Some
> > czech words contain char 0x9A which is small letter 's' with
> > caron. After running lynx -dump -raw on this local html file, the
> > 0x9A character is left out in the output, although the characters
> > around this character forming the word are left untouched.
>
> Depending on circumstances this may be expected, as a precaution
> against having this byte (and others in the range 0x80..0x9F)
> act as a control character. It depends on your environment whether
> that makes sense or not; but if you want lynx to spit out such
> bytes as if they were normal displayable characters, you have to
> *tell it* that your Display Character Set is one where these characters
> are allowed.
Doen't the "-raw" option include telling the Lynx that everything above 0x80
is a normal letter?
>
> For this use of -dump, lynx uses basically the same logic as for
> normal interactive display. So you should see the same effect.
> With -dump, lynx should use the D.C.S. saved from the Options Screen
> (in .lynxrc) or set in lynx.cfg (called simply CHARACTER_SET there).
>
>
> What OS are you using?
Linux 2.2.12 on i386
>
> Are you *sure* that it is lynx that is removing the character?
> Just echoing the file to the screen may not be enough to check -
> since the byte may actually act as a control character.
Yes. I had a bum.html on disk. Then I issued:
lynx bum.html -dump -raw >a
mc (Midnight Commander)
then I viewed the "a" with the built-in viewer, switched to hex-mode and looked
at the fact that the character is missing in it's place.
Also, I am using lynx to get formatted text into my perverse web browser,
where the 0x9A missed too. It's not a bug in Midnight Commander.
Thea ctual system of getting data from lynx into my program consists of pipe(),
fork(), dup2() into stdout of lynx and execlp(lynx).
Then I viewed the bum.html and the character was in it's place.
>
> Does this happen only with 0x9A, or also with other characters in the
> range 0x80..0x9F?
I don't know.
>
> So what is your effective Display Character Set? Is it actually
> what you want to get out of lynx?
I take no care of display character set because I believe that when I switch
-raw on, Lynx forgets all encoding problems and only dumps the bytes.
>
> Have you set an ASSUME_LOCAL_CHARSET and/or ASSUME_CHARSET in
> lynx.cfg? You should set e.g. the first one if lynx should assume
> that local files are all in the windows-1250 charset. Then the -raw
> should not be needed for your local file example. (Leave it out
> when it isn't needed - it might actually confuse things.)
>
> Does the file contain a META tag with charset specification?
> (In that case, ASSUME_* would not be used.)
>
> With which screen handling library was lynx compiled? (curses/ncurses/
> slang?) There could be some relevant code differences.
Doesn't the -dump include forgetting there is a display?
Clock