lynx-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

lynx-dev hyphenation (was tech. question: translating strings)


From: Klaus Weide
Subject: lynx-dev hyphenation (was tech. question: translating strings)
Date: Mon, 6 Sep 1999 09:08:33 -0500 (CDT)

[ last part of a series of replies ]
On Sun, 5 Sep 1999, Vlad Harchev wrote:

>  I plan to add better support for hyphenation to lynx than it currently has 
> :).

It's an open question whether incomplete support for hyhenation, that
relies on a specific display character setting or it will hyphenate
wrong, is better than no hyphenation at all.

Hypenation by lynx isn't exactly a feature that many people are
missing.  That's my impression so far, based on interest expressed on
the list in response to your ideas (iirc, basically none or negative.)
I certainly don't miss it.

By the way (not that you said anything else) Lynx _does_ have "support
for hyphenation" alread.  Support for author-provided hyphenation that
is, in the form of ­ or equivalent.

>  And I don't wish to spent all my life on complete implementation of it.

You wouldn't have to.  Gross exaggeration.

>  But IMO the approach I described above is flexible enough.
>  At least something can be added later.

The problem is that fixing it later will be more difficult than
implementing it in a general way (more general than your immediate
needs) from the start.  Every patch added to lynx for some extra
feature that makes some assumptions binds the hands of other lynx
developers more, for changing the way things work later (especially
if the patches look like what you did to SGML.c).

For example, and specifically relevant to the topic of hyphenation,
HText_append* currently gets its input fed in the current_char_set
(i.e. already translated to the d.c.s.).  That need not remain so, in
fact it would be better IMO, for several reasons, to eventually feed
characters to the HText object in a 'standard' form (probably UTF-8).
Translation to the d.c.s. would then occur in GridText.c.  The four
UCStages kept in the HTParentAnchor object are already designed to
account for this variation of procedure.  Now if you add hyphenation
at the HText_append* level making the assumption that (charset of the
character stream)==d.c.s., and start writing code around this
assumption (including configuration, messages, documentation),
changing the assumption cannot be done without breaking your stuff.

Actually I fell I should start making those changes _now_, before
patches from you get added that make unwise assumptions.

> > Even if I accept your 1),2),3) for the sake of argument, the input string
> > lenght just isn't always the translated string length.  What makes you think
> > so?  You must have noticed that there are strings longer than 1 character
> > in the *.tbl files.  You must have seen "(c)" for a copyright character.
> > You should have seen how Cyrillic text appears in 7-bit Appoximations.
> 
>  As for Cyrillic 7-bit approximations, it's quite unuseful with hyphenation,
> since english hyphenation rules (resources in english are very useful) will
> collide, so the user will turn it off anyway I think.

That argument is completely based on the assumption that hyphenation
gets applied _after_ translation to the d.c.s.  Which is exactly what
I am trying to tell you not to assume.  Of course the hyphenation should
be applied to the "real characters" (which would be Cyrillic characters
in this case), not to their ASCII replacement representation!

And that is one good reason why translation to the d.c.s. should be
deferred to a later stage, i.e. it should be done as late as possible
(GridText.c instead of SGML.c) so that various pieces of code that look
at the data stream can assume it is in a standard encoding.

>  Obviously, performance won't depend on *FullyTranslate* speed with approach
> described at the begining of the message.

So *FullyTranslate* is a non-issue.  Even if it wasn't, there was no
good reason for bringing up its speed in this context, without any
indication that it _is_ slow in normal use (I think it is reasonably
fast and does not unnecessarily allocate memory).

> > >[...] 
> > >  Another variant (you mentioned it) - assume that charset of the hy rules 
> > > is
> > > the same as display chset - but IMO this is less flexible (but more 
> > > logical - 
> > > seems that display chset is changed _VERY_ infrequently). 
> > 
> > You're using linux.  Give --enable-font-switch a try!
> 
>  I found it unstable (or that version of kernel console driver was
> unreliable), and I don't know any languages except English and Russian - that
> can be displayed in at the same time without changing d.c.s.

It depends on what kbd font files you have installed.  It works only
for some fonts, and just doesn't do anything if you switch to a d.c.s.
it doesn't know about.  So in that case you have to do the font loading
or other manipulation still externally - or if you can't, you shouldn't
have selected that d.c.s in the first place.  Still it works well enough
for me in various situations (I know the limitations).  If it does not
work right for you in a situation where you think it should, report a
bug.  (I have some changes to UCAuto.c that should help.)

>  I plan to detect d.c.s changes to recalculate lookup tables, so no
> translation will be necessary. Will you use hyphenation? 

No, as far as I know now.  If you make it easy enough to apply the
necessary extra files, I will probably test it out of curiosity.
But I don't need it, don't really want it.  Why should I, lynx's text
display generally looks fine (or at least if it doesn't it's not the
fault of missing hyphenation, but mostly the fault of HTML (ab)use by
authors that has nothing to do with hyphenation).

> If not, I recommend
> to compile it in - with and and justification, lynx becomes a very good
> html->txt translator (we have stylesheets implemenation pending for more
> flexibility), --with-backspaces complements this. At least I'll inform Linux
> Documentation Project coordinator about the lynx capabilities (they are using
> some stupid programs to translate sgml -> txt with backspaces).

Thanks, but I already have man and groff and various other text
processing tools (most of them unused).  Yeah, those LDP people are
probably stupid enough to use SGML tools for an SGML job, instead of
a text HTML browser, how could they?

Will Lynx be able to make toast and boil eggs too?

I didn't know we have stylesheets implementation pending, where does it
pend?

>  lynx has awful memory managment IMO. I though about rewriting it, but it uses
> dynamic allocation so frequently, that using chunks for HTLine content, etc,
> won't help (due to very intensive allocations in for other purposes).
> 
>  Lynx takes as much memory as NS does. (After 5 hours of browsing, single
> instance takes 35 Mb of virtual memory - due to terrific emmory
> fragmentation).

Time for you to compile with --enable-find-leaks then.  You should do
that anyway after making significant changes, or any changes that
use malloc etc. unless you are very sure you have not introduced memory
leaks.

I don't think I've ever seen something like lynx taking 35 Mb of
virtual memory - unless there was a good reason for it like having
several HUGE documents in memory cache at the same time.
Maybe you are using SOURCE_CACHE:MEMORY, and it doesn't clean up all
cached files.  And you are using color-style which is a memory hog
anyway.

Use ulimit to limit the memory available to lynx.  Then you'll find
out whether lynx really _needs_ that memory.

>  But info about lowercase/uppercase mapping is absent in the lynx.

So you have to incorporate it from somewhere.  You might as well use
the universal source then, instead of requiring each hyphenation file
provider to redo the work.

>  Due to the syntax chosen, it will be somewhat difficult to handle d.c.s and
> dyrules utf8-encoded, so I won't add support for it right now (so the
> byte-to-byte mapping for "human letters" will be still mandatory, since
> chars that render into "(c)" are not "human letter"). 

All letters, in fact all characters in Unicode are human, with the
possible exception of Klingon and suchlike (which to my knowledge is
not yet official)...

> The
> thing that will be left to do is to write uft8 character gathering (in case 
> of utf8
> d.c.s), converting it to lowercase and then to hyrules charset.

I don't understand the details of what you're saying here.  Just
the notion of having a "hyrules charset" seems wrong (unless that's
a character encoding scheme that provides for all possible characters,
you know what I mean...)

>  I don't have time to implement complete thing (hacking libnhj will be
> necessary, shipping unicode tables will be required ...)
>  Anyway, I'll try to help people to solve their problems with hyphenation.
> English-speaking-or-reading-only people won't have any problems.

I never believe claims that such-and-such people will not have any 
problems.

>                                                              Though people
> that use documents with several (say) latin-1 encoded languages will be unable
> to use hyphenation at all (since hydict for only one of those languages can be
> loaded due to the fact that chsets are not disjoint), so they'll get incorrect
> hyphenation for words in other languages. To solve this problem, <span lang=x>
> must be used (it's hard to convince german writer to surround "debian" with
> <span lang=en></span>, thou' such words can be added to the hyphenation
> exceptions. My experience can tell that collisions will be unlikely, since
> hyphenation patterns are build by scanning a bunch of taive-language
> documents, so probably "debian" and other english words won't be hyphenated
> at all with german hyrules).

You haven't looked at really multilingual texts, with more than a few
single words from a different than the "main" language.  Such texts
are rare.  But lynx should support them, at least not mess them up,
when they do occur.  Authors of such pages will use LANG attributes if
they care about correct handling, since that is the HTML way of doing
it.  If they don't care, there isn't much lynx can do about it, except
allowing the user to switch betwen several assumptions.  For documents
where the author did care: even if hyhenation can be done only for one
language "at a time" (where "at a time" could mean for one document),
the hyphenation algorithm should at least be turned off in <SPAN
LANG=fr> text portions where the specified language differs from that
of the hyphenation rules (like this one)</SPAN>.

>  And IMO, as log as UTF8 is not widely used _in_documents_ (not on terminals),
> the problem with documents mixing several,say, latin-1 encoded languages will
> remain.

What does UTF-8 in documents have to do with mixing several languages
that use the same repertoire in one document?  Nothing as far as I
can tell.  UTF-8 is just a trannsmission format.  And its slow rate
of adoption in the outside world has not kept lynx from using it
internally.

Be ready for the future.  Lynx has been for years, in some respects.
Maybe the world will catch up sometime.

> > And in practice German is rarely written in Cyrillic letters, so it doesn't
> > make sense to include e.g. Cyrillic letter patterns in the set for German.
> 
>  As I said, the hyrules for these particular languages can be concatenated to
> get hyrules for Cyrillic and German - they have disjoint set of character
> codes.

Merely an accident (as said elsewhere), and does it really work in your
approach unless you have a display character set with both LATIN
CAPITAL LETTER A WITH DIAERESIS and CYRILLIC CAPITAL LETTER IO?

> >[...]
> 
>  So, I'll add support for any d.c.s other than uft8 and like, provided 
> chset of hyrules is not utf8 too.

I don't exactly understand the meat of this promise, too many "other
than" and "like" and "provide".

>  As I remember, you have to post some patch to lynx too :)

Yea yea.  You're just keeping me from it.:)

   Klaus


reply via email to

[Prev in Thread] Current Thread [Next in Thread]