lynx-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: lynx-dev tech. question: translating strings to different charsets


From: Vlad Harchev
Subject: Re: lynx-dev tech. question: translating strings to different charsets
Date: Sun, 5 Sep 1999 10:18:58 +0500 (SAMST)

On Fri, 3 Sep 1999, Klaus Weide wrote:

 OK, as I reported, hyphenation already works. I had to slightly change
approach - now each word is hyphenated and LY_SOFT_HYPHEN is inserted at the
rightmost possibly hyphen position of each word, updating
text->permissible_split (since last word on the line can be unhyphenatable).

 Here is more basic information on hyphenation rules:
they are patterns that specify possible hyphen positions in the part of that
pattern. Such patterns are build by running special programs over native text
(I don't know the algorithm exactly). The order of patterns in the dictionary
is not signtificant. Hyphenation exceptions are expressed in terms of patterns
too (at least in libhnj) - (using plain hydict, linux is hyphenated as lin-ux
- is this correct?). Pattern matching is implemented as finite-state machine
in libhnj (the transitions are calculated when reading hydict). Apparently, if
two languages use different keycodes, it's possible to concatenate hydicts to
get the hyrules that will hyphenate two languages at the same time - so I
afraid, english phrases like StarDivision will be hyphenated incorrectly if
hydict for French is loaded since AFAIK French and English use latin-1
encoding (at least the keycodes of both lanugages are not disjoint).

 As for translation, here are my thoughts:
* to avoid performance decrease due to LYUCFullyTranslateString_1, the
  following thing can be used:
    the translation of each character used in hydict chset (aka "human
    letter")  to d.c.s. can be precalculated (since translation of even
    unicode "characters" is zero-state machine) - so seems flexibility is
    regained - user will have to specify either in hydict (as comment) or in
    lynx.cfg the chset used in hydict to make such translation. As for
    Unicode, IMO even at the present state (without modification) libhnj is
    suitable for this - simply there will be extra (that can be avoided with
    cleverer approach - of using 'int' instead of 'char') states used by UTF
    prefixes.
* IMO we can turn lynx is a powerfull charset translator with a very cheap
    hack ( I mean adding something like 'lynx -recode utf-8 koi8-r < in >out')
    IMO this worth this.


>[...] 
> >  I was really upset (seriously) by the "performance of the insplace
> > byte-to-byte table translation" to the "currently only possible
> > LYUCFullyTranslateString_1 translation" ratio, not that it's not convenient.
> > 
> >  Implementing hyphenation, I have the following assumptions:
> > 1) CJK texts needn't be hyphenated
> > 2) The display charset is not utf8 (in its real sense - no multibyte chars
> >    present in output) mostly due to 1)
> 
> UTF-8 as display character set has nothing to do with CJK.

 I was wrong. 

> > 3) Input is not multibyte text mostly due to 1)
> >
> >  Due to 1),2),3), I assume that size of the translated string won't change 
> > - 
> >  so dynamic allocation won't be necessary. Correct me if I'm wrong.
> 
> You are wrong.
> 
> It seems you are not about to add hyhenation support to lynx, but hyphenation-
> support-that-maybe-works-for-some-charsets-only.  Just at the time when a
> UTF-8 capable xterm has appeared, so we can expect more demand for UTF-8
> output, you pretend it doesn't exist at all.

 I plan to add better support for hyphenation to lynx than it currently has :).
 And I don't wish to spent all my life on complete implementation of it.
 But IMO the approach I described above is flexible enough.
 At least something can be added later.
 
> Even if I accept your 1),2),3) for the sake of argument, the input string
> lenght just isn't always the translated string length.  What makes you think
> so?  You must have noticed that there are strings longer than 1 character
> in the *.tbl files.  You must have seen "(c)" for a copyright character.
> You should have seen how Cyrillic text appears in 7-bit Appoximations.

 As for Cyrillic 7-bit approximations, it's quite unuseful with hyphenation,
since english hyphenation rules (resources in english are very useful) will
collide, so the user will turn it off anyway I think.

> Anyway it's more important to be correct and to be more general (w.r.t
> charsets) than it is to squeeze the last bit of speed out of it, don't
> you think so?  Especially for what you are doing, an unnecessary
> extra.  People who want to use it can _expect_ lynx to be slower!
> 
> >  The most important thing is:
> >     does LYUCFullyTranslateString_1 reallocate the string being translated
> >     (if the size of the translation matches the size of the original)?
> 
> It tries to avoid it when it is not necessary.  That's one reason why it
> is complicated.  I have never tested how much it succeeds in that, but I
> believe it does well.  If you really want to know - well you know about
> ltrace. :)

 Obviously, performance won't depend on *FullyTranslate* speed with approach
described at the begining of the message.

> >[...] 
> >  Another variant (you mentioned it) - assume that charset of the hy rules is
> > the same as display chset - but IMO this is less flexible (but more logical 
> > - 
> > seems that display chset is changed _VERY_ infrequently). 
> 
> You're using linux.  Give --enable-font-switch a try!

 I found it unstable (or that version of kernel console driver was
unreliable), and I don't know any languages except English and Russian - that
can be displayed in at the same time without changing d.c.s.

> > Then no word 
> > translation will be necessary. Frankly speaking, I love this variant very 
> > much
> > - do others (and do you in particular)?
> 
> I can't say I love *any* of this.  There are too many assumptions already -
> don't just "assume" that things will hopefully match!
> 
> More flexible == good, bound to a specific display character set == bad.
> I shouldn't get different (or wrong) results when I switch from one
> d.c.s. (that supports a given language) to another one (that has all the
> same necessary characters) because I reconnect from a different terminal etc.
> I shouldn't have to recompile lynx or translate an external file and load
> that etc. in order to get the "right" hyphenation with a different d.c.s.
> I definitely should not get complete non-support (or breakage) when I
> use UTF-8 as d.c.s.!

 I plan to detect d.c.s changes to recalculate lookup tables, so no
translation will be necessary. Will you use hyphenation? If not, I recommend
to compile it in - with and and justification, lynx becomes a very good
html->txt translator (we have stylesheets implemenation pending for more
flexibility), --with-backspaces complements this. At least I'll inform Linux
Documentation Project coordinator about the lynx capabilities (they are using
some stupid programs to translate sgml -> txt with backspaces).

> >  Hy rules is a plain text file, so they can be distributed in only one 
> > chset.
> > Users with display chset non-matching chset of rules will be able to 
> > translate
> > the hy rules to their display chset with any program (e.g. GNU recode).
> > 
> >  Now I decided to select this variant as main - at least someone who thinks
> > this is inflexible can implement translation later.
> 
> I think it is too inflexible, especially given that tranlation mechanisms
> are already present in lynx this looks just like laziness.  Apparently
> you are only interested in *your* charset, and don't care much about
> Russian users that happen to use a different display character set.
> Or about users that don't have a Cyrillic font at all, but can still
> read the "default" Latin approximation.  Or about users with a font
> with characters for several different scripts than can view Cyrillic and
> various others at the same time using UTF-8.  By applying hyphenation to
> the _characters_ (for which Unicode is our general representation) instead
> of a specific _encoding_, you could provide Russian hyphenation for all of
> them!  That requires (probably) changes in the way how text is passed
> around in lynx currently.  But then those changes should be made, it's
> probably a good idea anyway to use (some form of) unicode representation
> more before text reaches HText.  If you add your code for the limited
> approach now, that will jsut become more confusing to do later.
> 
> A somewhat different topic, it is already bad enough that, although lynx
> itself has all that generality of choosing and switching display character
> sets, NLS messsage catalogs can only be used with one ("the right") d.c.s.
> But that's jsut inherited from the gettext approach, not really a
> limitation of our own making.  (We should automatically char-translate
> messages as appropriate [but that's not so trivial as just calling some
> translation fucntion because gettext() returns static strings].)  I don't
> like to see more "features" that unnecassarily depend on the d.c.s. for
> non-broken behavior like that.
> 
> > > > But using this function, especially due to the fact that it uses
> > > > dynamic allocation, will take a lot CPU time. 
> > > 
> > > Do you know how much it takes?
> > 
> >  It can take arbitrary long time due to dynamic allocation/deallocation, and
> > will fragment heap (so entrire lynx code will be slower due to dynamic
> > (de)allocations with fragmented heap).
> 
> In other words, wild guessing...  look how much lynx already uses heap
> memory.  Look at every single HTSprintf0 for example.
 
 lynx has awful memory managment IMO. I though about rewriting it, but it uses
dynamic allocation so frequently, that using chunks for HTLine content, etc,
won't help (due to very intensive allocations in for other purposes).

 Lynx takes as much memory as NS does. (After 5 hours of browsing, single
instance takes 35 Mb of virtual memory - due to terrific emmory
fragmentation).

>[...] 
> 
> Where is that libhnj site?

 There is no libhnj site. It's in GNOME CVS. There is a www gateway at 
http://cvs.labs.redhat.com/lxr/source/libhnj/
 The author is Raph Levien <address@hidden> (he has site www.levien.com, but
libhnj is not mentioned there). libhnj cvs tree lack a script for converting
TeX rules into libhnj format - this is a perl script that works 17 seconds on 
P100. Due to this I wrote a C program that does it in 0.33 secs on 5x86-133, I
didn't checked it in yet, so contact me if you need to have hydicts other than
for english. Last Friday Raph said that he will check the perl script in and
mail me that script too.
 
> >  I forgot to say, that info "human letters" will be derived from the table 
> > of
> > "tolower" mapping, so the table of "tolower" mapping will be the only
> > information included in the hyrules. Currently hyphenation file with hy 
> > rules
> > allow comments (that start with '%'). The "tolower" mapping will be placed 
> > in
> > the comment to. I decided to use the following syntax:
> > 
> > %! Aa Bb Xx Mm Cc
> > %! Ee Pp Hh
> > % more lines follow - mapping characters can be done in any order.
> > % Mapping Information will be aggregated from several lines.
> > 
> > (that tells that lowercased version of symbol 'A' is 'a', 'B' -> 'b' etc).  
> > Note: the support for parsing this syntax will be added to libhnj.
> 
> You are reinventing the Unicode standard here (a small part of it).
> I don't know it well, but I know that upper/lowercase mapping and all
> kinds of character properties - including whether something is a letter -
> are well defined there.  And everything is available in machine-readable
> format.  You don't need to invent your own rules for what's a letter and
> what isn't.

 But info about lowercase/uppercase mapping is absent in the lynx.
 Due to the syntax chosen, it will be somewhat difficult to handle d.c.s and
dyrules utf8-encoded, so I won't add support for it right now (so the
byte-to-byte mapping for "human letters" will be still mandatory, since
chars that render into "(c)" are not "human letter"). The
thing that will be left to do is to write uft8 character gathering (in case of 
utf8
d.c.s), converting it to lowercase and then to hyrules charset.
 I don't have time to implement complete thing (hacking libnhj will be
necessary, shipping unicode tables will be required ...)
 Anyway, I'll try to help people to solve their problems with hyphenation.
English-speaking-or-reading-only people won't have any problems. Though people
that use documents with several (say) latin-1 encoded languages will be unable
to use hyphenation at all (since hydict for only one of those languages can be
loaded due to the fact that chsets are not disjoint), so they'll get incorrect
hyphenation for words in other languages. To solve this problem, <span lang=x>
must be used (it's hard to convince german writer to surround "debian" with
<span lang=en></span>, thou' such words can be added to the hyphenation
exceptions. My experience can tell that collisions will be unlikely, since
hyphenation patterns are build by scanning a bunch of taive-language
documents, so probably "debian" and other english words won't be hyphenated
at all with german hyrules).
 And IMO, as log as UTF8 is not widely used _in_documents_ (not on terminals),
the problem with documents mixing several,say, latin-1 encoded languages will
remain.

>[a-lot-of-obsolete-stuff-skipped]
> All you are doing here seems to me like a giant step back in the past.

 But making bigger step by adding any support for hyphenation (even for
limited set of settings).

>[...] 
> Of course we cannot (currently) translate all charsets to Unicode.  (But
> hyphenation does not appply to CJK characters anyway, from what I gather.)
> And in practice German is rarely written in Cyrillic letters, so it doesn't
> make sense to include e.g. Cyrillic letter patterns in the set for German.

 As I said, the hyrules for these particular languages can be concatenated to
get hyrules for Cyrillic and German - they have disjoint set of character
codes.

>[...]

 So, I'll add support for any d.c.s other than uft8 and like, provided 
chset of hyrules is not utf8 too. 
 But expect patch at the end of the next week - I spent too much time
writing/testing converter from TeX rules to libhnj rules and adding support
for hyphenation into lynx - I have other mandatory duties.

 As I remember, you have to post some patch to lynx too :)

 Best regards,
  -Vlad


reply via email to

[Prev in Thread] Current Thread [Next in Thread]