Re: lynx-dev hyphenation (was tech. question: translating strings)

lynx-dev
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: lynx-dev hyphenation (was tech. question: translating strings)

From:	Vlad Harchev
Subject:	Re: lynx-dev hyphenation (was tech. question: translating strings)
Date:	Tue, 7 Sep 1999 16:18:20 +0500 (SAMST)
On Mon, 6 Sep 1999, Klaus Weide wrote:

> [ last part of a series of replies ]
> On Sun, 5 Sep 1999, Vlad Harchev wrote:
> 
> >  I plan to add better support for hyphenation to lynx than it currently has 
> > :).
> 
> It's an open question whether incomplete support for hyhenation, that
> relies on a specific display character setting or it will hyphenate
> wrong, is better than no hyphenation at all.
> 
> Hypenation by lynx isn't exactly a feature that many people are
> missing.  That's my impression so far, based on interest expressed on
> the list in response to your ideas (iirc, basically none or negative.)
> I certainly don't miss it.
>
   Yes, there were no response from others.
 
> By the way (not that you said anything else) Lynx _does_ have "support
> for hyphenation" alread.  Support for author-provided hyphenation that
> is, in the form of &shy; or equivalent.

   Of course, but seems it's incomparable to mine in visual results and
flexibility :)

>[...] 
> The problem is that fixing it later will be more difficult than
> implementing it in a general way (more general than your immediate
> needs) from the start.  Every patch added to lynx for some extra
> feature that makes some assumptions binds the hands of other lynx
> developers more, for changing the way things work later (especially
> if the patches look like what you did to SGML.c).

 Sorry for SGML.c, but seems there are no better ways to do this (just shorten
macro names, write more generic macros - IMO only that can be done).
 
> For example, and specifically relevant to the topic of hyphenation,
> HText_append* currently gets its input fed in the current_char_set
> (i.e. already translated to the d.c.s.).  That need not remain so, in
> fact it would be better IMO, for several reasons, to eventually feed
> characters to the HText object in a 'standard' form (probably UTF-8).

 Wow, you plan to make it yet more slower :) But what for? And please finish
what you started (or post it here at least).

> Translation to the d.c.s. would then occur in GridText.c.  The four
> UCStages kept in the HTParentAnchor object are already designed to
> account for this variation of procedure.  Now if you add hyphenation
> at the HText_append* level making the assumption that (charset of the
> character stream)==d.c.s., and start writing code around this
> assumption (including configuration, messages, documentation),
> changing the assumption cannot be done without breaking your stuff.
> 
> Actually I fell I should start making those changes _now_, before
> patches from you get added that make unwise assumptions.

 It's up to you, but what for, again? IMO lynx is still missing a lot of
user-level features, and you plan to make some internal redesign; any user
will notice that lynx become slower (or won't note this if he/she has good
CPU). Also, say, russian users will note that lynx uses much more memory since
due to the russian texts encoded as utf8.
 And IMO d.c.s is rather a permament setting (that isn't changed very
often) - what for to translate some c.s. to utf and then to other c.s. when
displaying - IMO I don't see the goals for keeping HText in utf8. What are
they, again?

>[...] 
> That argument is completely based on the assumption that hyphenation
> gets applied _after_ translation to the d.c.s.  Which is exactly what
> I am trying to tell you not to assume.  Of course the hyphenation should
> be applied to the "real characters" (which would be Cyrillic characters
> in this case), not to their ASCII replacement representation!

 I don't see any advantages of this (except the problems with 2 words you
called "useful").

> And that is one good reason why translation to the d.c.s. should be
> deferred to a later stage, i.e. it should be done as late as possible
> (GridText.c instead of SGML.c) so that various pieces of code that look
> at the data stream can assume it is in a standard encoding.

 Better have it in wide characters rather than in utf8 then. But I don't see
any use of it, really (it would be useful for generalized 'isalpha()',
'tolower()', etc, but this IMO is used only in searching for strings).

>[...]
> > > You're using linux.  Give --enable-font-switch a try!
> > 
> >  I found it unstable (or that version of kernel console driver was
> > unreliable), and I don't know any languages except English and Russian - 
> > that
> > can be displayed in at the same time without changing d.c.s.
> 
> It depends on what kbd font files you have installed.  It works only
> for some fonts, and just doesn't do anything if you switch to a d.c.s.
> it doesn't know about.  So in that case you have to do the font loading
> or other manipulation still externally - or if you can't, you shouldn't
> have selected that d.c.s in the first place.  Still it works well enough
> for me in various situations (I know the limitations).  If it does not
> work right for you in a situation where you think it should, report a
> bug.  (I have some changes to UCAuto.c that should help.)
>

 I had the following problems:
 When exiting from lynx, the something wrong went with console driver, each
letter is doubled in height (ie each letter occupied 2 rows). When I invoke
'reset', the height of each letter returned to 1 row, but only the upper half
of the display was used, while lower was also changing with some strange
stuff. I had to reboot linux to fix this (I didn't try to set the console
dimentsions to match real). And I have no reason to change fonts: russian,
pseudographics and ascii symbols fit in one font.

> >  I plan to detect d.c.s changes to recalculate lookup tables, so no
> > translation will be necessary. Will you use hyphenation? 
> 
> No, as far as I know now.  If you make it easy enough to apply the
> necessary extra files, I will probably test it out of curiosity.
> But I don't need it, don't really want it.  Why should I, lynx's text
> display generally looks fine (or at least if it doesn't it's not the
> fault of missing hyphenation, but mostly the fault of HTML (ab)use by
> authors that has nothing to do with hyphenation).

 IMO it's better to be used with hyphenation - then the lynx is very visually
attractive. And IMO it will help to produce better rendering of tables (to be
implemented).

> > If not, I recommend
> > to compile it in - with and and justification, lynx becomes a very good
> > html->txt translator (we have stylesheets implemenation pending for more
> > flexibility), --with-backspaces complements this. At least I'll inform Linux
> > Documentation Project coordinator about the lynx capabilities (they are 
> > using
> > some stupid programs to translate sgml -> txt with backspaces).
> 
> Thanks, but I already have man and groff and various other text
> processing tools (most of them unused).  Yeah, those LDP people are
> probably stupid enough to use SGML tools for an SGML job, instead of
> a text HTML browser, how could they?

 I meant that that program doesn't do justification (and hyphenation of
course) probably it's a perl script - I don't remember, so the produced files
look very ugly. 

> >  Lynx takes as much memory as NS does. (After 5 hours of browsing, single
> > instance takes 35 Mb of virtual memory - due to terrific emmory
> > fragmentation).
> 
> Time for you to compile with --enable-find-leaks then.  You should do
> that anyway after making significant changes, or any changes that
> use malloc etc. unless you are very sure you have not introduced memory
> leaks.

 When loading 900Kb file as mainpage, with and without source_cache, the VSS
is 29Mb. (lss-disabled lynx's VSS is 4Mb on this file).
 Yes, this is probably due to leaks (I tried lss-disabled lynx 1st time on
that file). Stylechanges can't take to much IMO.

>[...] 
> So you have to incorporate it from somewhere.  You might as well use
> the universal source then, instead of requiring each hyphenation file
> provider to redo the work.

 IMO it's easier for provider to type
Aa Bb Cc Dd Ee <etc>, rather than to find out unicode values for each of the
characters, and to write the special awk and perl scripts to translate the
TeX hyrules file.
 And anyway, upper->lower and 'isalpha' mapping should be provided
somehow in case of unicode.

> > The
> > thing that will be left to do is to write uft8 character gathering (in case 
> > of utf8
> > d.c.s), converting it to lowercase and then to hyrules charset.
> 
> I don't understand the details of what you're saying here.  Just
> the notion of having a "hyrules charset" seems wrong (unless that's
> a character encoding scheme that provides for all possible characters,
> you know what I mean...)

 "gathering" means calculating the unicode character code (ie 32 bit value
from multibyte utf8-encoded character).

> 
> >  I don't have time to implement complete thing (hacking libnhj will be
> > necessary, shipping unicode tables will be required ...)
> >  Anyway, I'll try to help people to solve their problems with hyphenation.
> > English-speaking-or-reading-only people won't have any problems.
> 
> I never believe claims that such-and-such people will not have any 
> problems.

 But seems my statement is correct.
 
> >                                                              Though people
> > that use documents with several (say) latin-1 encoded languages will be 
> > unable
> > to use hyphenation at all (since hydict for only one of those languages can 
> > be
> > loaded due to the fact that chsets are not disjoint), so they'll get 
> > incorrect
> > hyphenation for words in other languages. To solve this problem, <span 
> > lang=x>
> > must be used (it's hard to convince german writer to surround "debian" with
> > <span lang=en></span>, thou' such words can be added to the hyphenation
> > exceptions. My experience can tell that collisions will be unlikely, since
> > hyphenation patterns are build by scanning a bunch of taive-language
> > documents, so probably "debian" and other english words won't be hyphenated
> > at all with german hyrules).
> 
> You haven't looked at really multilingual texts, with more than a few
> single words from a different than the "main" language.  Such texts
> are rare.  But lynx should support them, at least not mess them up,
> when they do occur.  Authors of such pages will use LANG attributes if
> they care about correct handling, since that is the HTML way of doing
> it.  If they don't care, there isn't much lynx can do about it, except
> allowing the user to switch betwen several assumptions.  For documents
> where the author did care: even if hyhenation can be done only for one
> language "at a time" (where "at a time" could mean for one document),
> the hyphenation algorithm should at least be turned off in <SPAN
> LANG=fr> text portions where the specified language differs from that
> of the hyphenation rules (like this one)</SPAN>.

 I plan to support "lang" attribute.

> >  And IMO, as log as UTF8 is not widely used _in_documents_ (not on 
> > terminals),
> > the problem with documents mixing several,say, latin-1 encoded languages 
> > will
> > remain.
> 
> What does UTF-8 in documents have to do with mixing several languages
> that use the same repertoire in one document?  Nothing as far as I
> can tell.  UTF-8 is just a trannsmission format.  And its slow rate
> of adoption in the outside world has not kept lynx from using it
> internally.

 I'm glad that you understand that UTF-8 (and UCS*) doesn't  have anything
with "mixing several languages that use the same repertoire in one document"
(I thought I thought that this was a solution). The 'lang=' is for solving 
this. Why do you push "unicode" everywhere?

> Be ready for the future.  Lynx has been for years, in some respects.
> Maybe the world will catch up sometime.
> 
> > > And in practice German is rarely written in Cyrillic letters, so it 
> > > doesn't
> > > make sense to include e.g. Cyrillic letter patterns in the set for German.
> > 
> >  As I said, the hyrules for these particular languages can be concatenated 
> > to
> > get hyrules for Cyrillic and German - they have disjoint set of character
> > codes.
> 
> Merely an accident (as said elsewhere), and does it really work in your
> approach unless you have a display character set with both LATIN
> CAPITAL LETTER A WITH DIAERESIS and CYRILLIC CAPITAL LETTER IO?

 I assume you mean these letters have equal char.codes in d.c.s.

 If I was encountering such documents, I'd compose or choose another font -
that means that these 2 chars will have different character codes in that
d.c.s. Or (another, looser's solution) - use hyrules for either of the 
languages. Or (good hacker's solution) - fix the code that deals with 
hyphenation (ie my code). Or (TeX user's solution) - don't do anything, since
russian and german hyrules won't collide - in order for them to collide, at
least 2 characters must be present of either of the languages (frankly
speaking, this is incorrect, since one russian hydictionary uses patterns
with 1 letter, but there are another dictionaries).
So, I rely on the 'good hackers'. 
 As you saw the lynx.cfg setting I plan to introduce, with domain name
matching and file matching, the problem can be solved for www-based docs if
one languages dominates over others.

 But let's speak how your TRST is complete and flexible :)
 In general, neither you nor me have time to cover all cases.

> > >[...]
> > 
> >  So, I'll add support for any d.c.s other than uft8 and like, provided 
> > chset of hyrules is not utf8 too.
> 
> I don't exactly understand the meat of this promise, too many "other
> than" and "like" and "provide".

 "and like" means CJK texts (hyphenation doesn't make sense for J, but for C
and K I don't know). As for utf8-encoded hyrules  - the hyphenation simply
won't work or dictionary won't load by libhnj. In other words, each signle 
byte in  hyrules denotes a single "human letter", each single byte in d.c.s.
denotes a single "human letter" (and not part of letter) - to make direct
table-driven translation possible.

> >  As I remember, you have to post some patch to lynx too :)
> 
> Yea yea.  You're just keeping me from it.:)

 Yes, our chats become too long.. Please more conlusions next time.
 
 And I ask you to finish TRST - add support for lss (at least write code
without extensive testing).

>    Klaus
> 

To lynx-dev: sorry for huge message.

 Best regards,
  -Vlad
[Prev in Thread]
Current Thread
[Next in Thread]
Re: lynx-dev tech. question: translating strings to different charsets, (continued)
- Re: lynx-dev tech. question: translating strings to different charsets, Henry Nelson, 1999/09/13
Prev by Date: lynx-dev Patches questions
Next by Date: Re: lynx-dev Automate Login to set Cookie or Lynxing from a Script
Previous by thread: lynx-dev hyphenation (was tech. question: translating strings)
Next by thread: Re: lynx-dev hyphenation (was tech. question: translating strings)
Index(es):
- Date
- Thread