lynx-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: lynx-dev tech. question: translating strings to different charsets


From: Klaus Weide
Subject: Re: lynx-dev tech. question: translating strings to different charsets
Date: Fri, 3 Sep 1999 20:03:22 -0500 (CDT)

On Fri, 3 Sep 1999, Vlad Harchev wrote:
> On Thu, 2 Sep 1999, Klaus Weide wrote:
> > On Fri, 3 Sep 1999, Vlad Harchev wrote:
> > > On Wed, 1 Sep 1999, Klaus Weide wrote:
> > > > On Thu, 2 Sep 1999, Vlad Harchev wrote:

>  I was really upset (seriously) by the "performance of the insplace
> byte-to-byte table translation" to the "currently only possible
> LYUCFullyTranslateString_1 translation" ratio, not that it's not convenient.
> 
>  Implementing hyphenation, I have the following assumptions:
> 1) CJK texts needn't be hyphenated
> 2) The display charset is not utf8 (in its real sense - no multibyte chars
>    present in output) mostly due to 1)

UTF-8 as display character set has nothing to do with CJK.

> 3) Input is not multibyte text mostly due to 1)
>
>  Due to 1),2),3), I assume that size of the translated string won't change - 
>  so dynamic allocation won't be necessary. Correct me if I'm wrong.

You are wrong.

It seems you are not about to add hyhenation support to lynx, but hyphenation-
support-that-maybe-works-for-some-charsets-only.  Just at the time when a
UTF-8 capable xterm has appeared, so we can expect more demand for UTF-8
output, you pretend it doesn't exist at all.

Even if I accept your 1),2),3) for the sake of argument, the input string
lenght just isn't always the translated string length.  What makes you think
so?  You must have noticed that there are strings longer than 1 character
in the *.tbl files.  You must have seen "(c)" for a copyright character.
You should have seen how Cyrillic text appears in 7-bit Appoximations.

Anyway it's more important to be correct and to be more general (w.r.t
charsets) than it is to squeeze the last bit of speed out of it, don't
you think so?  Especially for what you are doing, an unnecessary
extra.  People who want to use it can _expect_ lynx to be slower!

>  The most important thing is:
>       does LYUCFullyTranslateString_1 reallocate the string being translated
>       (if the size of the translation matches the size of the original)?

It tries to avoid it when it is not necessary.  That's one reason why it
is complicated.  I have never tested how much it succeeds in that, but I
believe it does well.  If you really want to know - well you know about
ltrace. :)

> > > In order
> > > hyphenation to work, the last word on each line, that doesn't fit on
> > > that line, should be hyphenated. In order it to be hyphenated, it should 
> > > be
> > > translated to the charset of the hyphenation rules (that are loaded at
> > > startup).
> > 
> > Then maybe the whole approach is flawed.
> 
>  Another variant (you mentioned it) - assume that charset of the hy rules is
> the same as display chset - but IMO this is less flexible (but more logical - 
> seems that display chset is changed _VERY_ infrequently). 

You're using linux.  Give --enable-font-switch a try!

> Then no word 
> translation will be necessary. Frankly speaking, I love this variant very much
> - do others (and do you in particular)?

I can't say I love *any* of this.  There are too many assumptions already -
don't just "assume" that things will hopefully match!

More flexible == good, bound to a specific display character set == bad.
I shouldn't get different (or wrong) results when I switch from one
d.c.s. (that supports a given language) to another one (that has all the
same necessary characters) because I reconnect from a different terminal etc.
I shouldn't have to recompile lynx or translate an external file and load
that etc. in order to get the "right" hyphenation with a different d.c.s.
I definitely should not get complete non-support (or breakage) when I
use UTF-8 as d.c.s.!

>  Hy rules is a plain text file, so they can be distributed in only one chset.
> Users with display chset non-matching chset of rules will be able to translate
> the hy rules to their display chset with any program (e.g. GNU recode).
> 
>  Now I decided to select this variant as main - at least someone who thinks
> this is inflexible can implement translation later.

I think it is too inflexible, especially given that tranlation mechanisms
are already present in lynx this looks just like laziness.  Apparently
you are only interested in *your* charset, and don't care much about
Russian users that happen to use a different display character set.
Or about users that don't have a Cyrillic font at all, but can still
read the "default" Latin approximation.  Or about users with a font
with characters for several different scripts than can view Cyrillic and
various others at the same time using UTF-8.  By applying hyphenation to
the _characters_ (for which Unicode is our general representation) instead
of a specific _encoding_, you could provide Russian hyphenation for all of
them!  That requires (probably) changes in the way how text is passed
around in lynx currently.  But then those changes should be made, it's
probably a good idea anyway to use (some form of) unicode representation
more before text reaches HText.  If you add your code for the limited
approach now, that will jsut become more confusing to do later.

A somewhat different topic, it is already bad enough that, although lynx
itself has all that generality of choosing and switching display character
sets, NLS messsage catalogs can only be used with one ("the right") d.c.s.
But that's jsut inherited from the gettext approach, not really a
limitation of our own making.  (We should automatically char-translate
messages as appropriate [but that's not so trivial as just calling some
translation fucntion because gettext() returns static strings].)  I don't
like to see more "features" that unnecassarily depend on the d.c.s. for
non-broken behavior like that.

> > > But using this function, especially due to the fact that it uses
> > > dynamic allocation, will take a lot CPU time. 
> > 
> > Do you know how much it takes?
> 
>  It can take arbitrary long time due to dynamic allocation/deallocation, and
> will fragment heap (so entrire lynx code will be slower due to dynamic
> (de)allocations with fragmented heap).

In other words, wild guessing...  look how much lynx already uses heap
memory.  Look at every single HTSprintf0 for example.

> > It can't be too bad, at least in normal cases.  We're running most
> > attribute strings that are handled in some way through it.
> 
>  Comparing direct inplace byte-to-byte table translation, it can be 100 times
> slower - IMO malloc and free taking the most time for 5-letter words'
> translation.

So what?  People running a bloated lynx with luxury features can expect
that.  Not that I think your factor of 100, even if realistic, has much
relevance for the actual impact of the code in LYUC*, when compared to
what the rest of lynx does.

> > > So I ask, what is more
> > > preferable way to get rid of dynamic allocation:
> > > 1) Use UC* tables directly (may be hack them in order to do this)
> > > 2) Insure (looking at the code) or hack LYUCFullyTranslateString_1 so it 
> > > won't
> > > reallocate buffer - so it will be possible to pass pointer to static 
> > > storage
> > > to it (but it will be slow due to the generality).
> > 
> > It's good enough for "normal" use.  I don't see why it's suddenly not fast
> > enough for _your_ purpose.
>  
>  May be guys with Alphas or P][ won't notice performance decrease, but I
> expect the parsing+rendering speed of lynx to decrease by 3Kb/sec on 5x86-133
> I have.

It is a non-problem.  Look at the rest of the code.  A couple more mallocs per
line won't do that much.  Even for people who want to use your code.  I expect
that the impact of loading the tables and size bloat caused by that is more
important.

> > > PS: information about what characters are "human letters" and the mapping 
> > > from
> > >  "tolower" mapping will be included in the file with hyphenation rules, 
> > > so the
> > >   translation from one charset to another is the only performance problem 
> > > I 
> > >   see.
> > 
> 
>  Here is more detailed info:
> 
>  Information about what characters are letters in given charset is absent in
> UC* tables, so we must have it. Also, since hyphenation rules are for
> lowercased text only, the information about "tolower" mapping is necessary
> (which is also absent in UC* tables). Given my assumptions, it seems to me
> that the approach is not flawed at all.
>  Independent of how terribly this approach is flawed, that information must be
> supplied in order hy rules to work. But IMO end user shouldn't care about
> this, since the hy dictionaries will be distributed separately from lynx
> (probably on libhnj site) - so the end user will have to download it and copy
> the dictitionary in some place, recode it (if necessary) and then point that
> place in lynx.cfg. (The hy dict for US English will be shipped with lynx
> with comments in lynx.cfg pointing to the location where other dictionaries
> are stored). 

Where is that libhnj site?

>  I forgot to say, that info "human letters" will be derived from the table of
> "tolower" mapping, so the table of "tolower" mapping will be the only
> information included in the hyrules. Currently hyphenation file with hy rules
> allow comments (that start with '%'). The "tolower" mapping will be placed in
> the comment to. I decided to use the following syntax:
> 
> %! Aa Bb Xx Mm Cc
> %! Ee Pp Hh
> % more lines follow - mapping characters can be done in any order.
> % Mapping Information will be aggregated from several lines.
> 
> (that tells that lowercased version of symbol 'A' is 'a', 'B' -> 'b' etc).  
> Note: the support for parsing this syntax will be added to libhnj.

You are reinventing the Unicode standard here (a small part of it).
I don't know it well, but I know that upper/lowercase mapping and all
kinds of character properties - including whether something is a letter -
are well defined there.  And everything is available in machine-readable
format.  You don't need to invent your own rules for what's a letter and
what isn't.

> > I think your approach is terribly flawed.
> 
>  I don't think so due to assumptions I made.

But the assumptions are wrong and unnecessary, or only necessary if
you _want_ to use a limited and limiting method.

> > Given charsets A and B, in the general case
> >  - You cannot assume that A can be translated to B at all.
> 
>  Let the user to think about this.

That sounds like you want to provide a half-finished solution, and
disclaim any interest in completing it in the future.

> >  - You cannot assume that strings will stay the same lenght.
>  This is the mandatory (same length of strings).

Don't let facts get in the way...

But I assume your change of plan means you don't want to translate
anything any more (in addition to waht lynx already does) after all,
so this may be irrelevant.

> >  - You cannot assume that translations are reversible without loss.
>  This is not a problem if we choose the varian with chset of hyrules matching
> display chset. 

So what do you do if the _don't_ match?  Use the rules anyway, and produce
completely wrong output?

> > Maybe you should translate your "rules", not the text.

I still think that, more specifically they should be in (or, at some stage
be transformable to) the UCS.

> > But having hyphenation rules bound to a specific character encoding is
> > flawd from the outset.  The should be expressed in terms of _characters_,
> > i.e. Unicode values.  Everything else is a hack.
> 
>  I don't think that Unicode is necessary (until major number of files will be
> Unicode'd or even until display chset is UTF-8 on the major number of
> computers).

All you are doing here seems to me like a giant step back in the past.

Hyphenation rules have to be language specific of course.  But on top of
that, in your approach, they will also be charset specific.  But charset is
just one of many encoding schemes for (a sub-repertoire of) characters.
And all characters can be expressed in one "Universal Character Set",
ISO 10646 or (the same for our purpose here) Unicode.

Things like hyphenation logically operate on characters, not on some
ephemeral encoding of characters.  At least they should.  In full generality,
to cover N languages in M encodings (charsets), you need N x M different
sets of hyphenation rules (language is orthogonal to charset!).  If you
express everything in one universal character set, you need only N sets
of rules.

Of course we cannot (currently) translate all charsets to Unicode.  (But
hyphenation does not appply to CJK characters anyway, from what I gather.)
And in practice German is rarely written in Cyrillic letters, so it doesn't
make sense to include e.g. Cyrillic letter patterns in the set for German.
But there are several charsets that can be used and _are_ used for Russian
(as you know), English, German, and so on.  Putting your algorithm into
lynx - a program that already can deal with many character encodings, and
transform between them and Unicode - in a form that works only an one or
a few specific character encodings just doesn't make sense, in this day
and age.

So I repeat (with only spelling corrections :)):
> > Yes, that probably means to change the
> > whole chartrans thing, as to when/where things get transformed.  If you
> > try to add hyphenation support without that, you are trying to take the
> > easy way out which I predict won't work reliably.
> 
>  If it will work, it will work reliably.

It may work for you, as long as the hyphenation patterns happen to be in
a form that fits your display character set.  I wouldn't like to see
something so specialized added to lynx, or even provided as an extra
library by you (because it will require hooks in the lynx code in a
specific way), when the framework is already there do do it better.

Please take some time to figure out a better and more general approach
before you add more stuff that makes it still more difficult to do it right
later.  Please take some time to learn more about Unicode (you'll find
a lot of information on the web, it is not that hard to find).  


   Klaus


reply via email to

[Prev in Thread] Current Thread [Next in Thread]