lynx-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: lynx-dev tech. question: translating strings to different charsets


From: Vlad Harchev
Subject: Re: lynx-dev tech. question: translating strings to different charsets
Date: Fri, 3 Sep 1999 15:17:18 +0500 (SAMST)

On Thu, 2 Sep 1999, Klaus Weide wrote:

> On Fri, 3 Sep 1999, Vlad Harchev wrote:
> 
> > On Wed, 1 Sep 1999, Klaus Weide wrote:
> > 
> > > On Thu, 2 Sep 1999, Vlad Harchev wrote:
> > > 
> > > >  I started implemented support for hyphenation. I need to know how can I
> > > > translate the string from current charset of the document to some 
> > > > other, given
> > > > by chset handle (don't want to dig through lynx headers to discover 
> > > > this).
> > > > No entity conversion desired.
> > > 
> > > The most complete function for that is in LYCharUtils.c,
> > > LYUCFullyTranslateString_1 or one of the wrappers around it
> > > (or make a new one).
> > > 
> > > I hope you know what you mean with "current charset of the document".
> > 
> >  OK, now I understood that I will need the translation from display charset 
> > to
> > some charset (the name of which will be known after reading the lynx.cfg).
> >  I checked LYUCFullyTranslateString_1 - and was very upset. 
> 
> You are _upset_?  You want to add a luxury item like hyphenation support
> and then you get _upset_ when you find out it's not as convenient as you
> thought?  I hope you're not serious...

 I was really upset (seriously) by the "performance of the insplace
byte-to-byte table translation" to the "currently only possible
LYUCFullyTranslateString_1 translation" ratio, not that it's not convenient.

 Implementing hyphenation, I have the following assumptions:
1) CJK texts needn't be hyphenated
2) The display charset is not utf8 (in its real sense - no multibyte chars
   present in output) mostly due to 1)
3) Input is not multibyte text mostly due to 1)

 Due to 1),2),3), I assume that size of the translated string won't change - 
 so dynamic allocation won't be necessary. Correct me if I'm wrong.
 
 The most important thing is:
        does LYUCFullyTranslateString_1 reallocate the string being translated
        (if the size of the translation matches the size of the original)?

> > In order
> > hyphenation to work, the last word on each line, that doesn't fit on
> > that line, should be hyphenated. In order it to be hyphenated, it should be
> > translated to the charset of the hyphenation rules (that are loaded at
> > startup).
> 
> Then maybe the whole approach is flawed.

 Another variant (you mentioned it) - assume that charset of the hy rules is
the same as display chset - but IMO this is less flexible (but more logical - 
seems that display chset is changed _VERY_ infrequently). Then no word 
translation will be necessary. Frankly speaking, I love this variant very much
- do others (and do you in particular)?
 Hy rules is a plain text file, so they can be distributed in only one chset.
Users with display chset non-matching chset of rules will be able to translate
the hy rules to their display chset with any program (e.g. GNU recode).

 Now I decided to select this variant as main - at least someone who thinks
this is inflexible can implement translation later.

> > But using this function, especially due to the fact that it uses
> > dynamic allocation, will take a lot CPU time. 
> 
> Do you know how much it takes?

 It can take arbitrary long time due to dynamic allocation/deallocation, and
will fragment heap (so entrire lynx code will be slower due to dynamic
(de)allocations with fragmented heap).

> It can't be too bad, at least in normal cases.  We're running most
> attribute strings that are handled in some way through it.

 Comparing direct inplace byte-to-byte table translation, it can be 100 times
slower - IMO malloc and free taking the most time for 5-letter words'
translation.
 
> > So I ask, what is more
> > preferable way to get rid of dynamic allocation:
> > 1) Use UC* tables directly (may be hack them in order to do this)
> > 2) Insure (looking at the code) or hack LYUCFullyTranslateString_1 so it 
> > won't
> > reallocate buffer - so it will be possible to pass pointer to static storage
> > to it (but it will be slow due to the generality).
> 
> It's good enough for "normal" use.  I don't see why it's suddenly not fast
> enough for _your_ purpose.
 
 May be guys with Alphas or P][ won't notice performance decrease, but I
expect the parsing+rendering speed of lynx to decrease by 3Kb/sec on 5x86-133
I have.

> > PS: information about what characters are "human letters" and the mapping 
> > from
> >  "tolower" mapping will be included in the file with hyphenation rules, so 
> > the
> >   translation from one charset to another is the only performance problem I 
> >   see.
> 

 Here is more detailed info:

 Information about what characters are letters in given charset is absent in
UC* tables, so we must have it. Also, since hyphenation rules are for
lowercased text only, the information about "tolower" mapping is necessary
(which is also absent in UC* tables). Given my assumptions, it seems to me
that the approach is not flawed at all.
 Independent of how terribly this approach is flawed, that information must be
supplied in order hy rules to work. But IMO end user shouldn't care about
this, since the hy dictionaries will be distributed separately from lynx
(probably on libhnj site) - so the end user will have to download it and copy
the dictitionary in some place, recode it (if necessary) and then point that
place in lynx.cfg. (The hy dict for US English will be shipped with lynx
with comments in lynx.cfg pointing to the location where other dictionaries
are stored). 

 I forgot to say, that info "human letters" will be derived from the table of
"tolower" mapping, so the table of "tolower" mapping will be the only
information included in the hyrules. Currently hyphenation file with hy rules
allow comments (that start with '%'). The "tolower" mapping will be placed in
the comment to. I decided to use the following syntax:

%! Aa Bb Xx Mm Cc
%! Ee Pp Hh
% more lines follow - mapping characters can be done in any order.
% Mapping Information will be aggregated from several lines.

(that tells that lowercased version of symbol 'A' is 'a', 'B' -> 'b' etc).  
Note: the support for parsing this syntax will be added to libhnj.

> I think your approach is terribly flawed.

 I don't think so due to assumptions I made.

> Given charsets A and B, in the general case
>  - You cannot assume that A can be translated to B at all.

 Let the user to think about this.
 
>  - You cannot assume that strings will stay the same lenght.
 This is the mandatory (same length of strings).
>  - You cannot assume that translations are reversible without loss.
 This is not a problem if we choose the varian with chset of hyrules matching
display chset. 

> Maybe you should translate your "rules", not the text.
> 
> But having hyphenation rules bound to a specific character encoding is
> flawd from the outset.  The should be expressed in terms of _characters_,
> i.e. Unicode values.  Everything else is a hack.

 I don't think that Unicode is necessary (until major number of files will be
Unicode'd or even until display chset is UTF-8 on the major number of
computers).

> That means you should apply them to the text strings ("words") while they
> are in a compatible encoding.   Yes, that prbably means to change the
> whole chartrans thing, as to when/where things get transformed.  If you
> try to ad hyphenation support without that, you are trying to take the
> easy way out which I predict won't work reliably.

 If it will work, it will work reliably.

>    Klaus
> 

 Best regards,
  -Vlad


reply via email to

[Prev in Thread] Current Thread [Next in Thread]