lynx-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: lynx-dev hyhenation (was tech. question: translating strings)


From: Vlad Harchev
Subject: Re: lynx-dev hyhenation (was tech. question: translating strings)
Date: Thu, 9 Sep 1999 09:44:28 +0500 (SAMST)

On Tue, 7 Sep 1999, Klaus Weide wrote:

> On Tue, 7 Sep 1999, Vlad Harchev wrote:
> 
> > On Mon, 6 Sep 1999, Klaus Weide wrote:
> 
> >  Libhnj builds a finite state machine while reading hyrules. After it's 
> > read,
> > that state machine is used for hyphenation. Obviously, the characters of the
> > word become and 'input' for this state-machine. The 'pattern' is associated
> > with each state - it's 'yielded' to aggregate into the info about how the
> > given word can be hyphenated.
> >  Using utf for non-english languages that doesn't use latin letter, such as
> > russian, will increareas the number of states in that state machine by the
> > length in bytes of the utf8-encoded russian letter, or
> 
> So is it possible at all or not, to apply the aplgorithm in utf-8 form?
> In some other messages I got the impression that it wasn't.  I am just
> asking for clarification, not saying it would be a good idea to do it
> that way.

 Yes, it's possible to apply it in utf8 form too (obviously hyrules must be in
utf or they should be converting to utf before building state machine).

> >  if wide character strings will be used, the input to this state machine
will
> > be int's - so it should be hacked.
> 
> I still think that would be the best way in which the algorithm should be
> apllied.  Not necessarily for storing the rules, or for representing text
> within lynx (except temporarily while applying the rules).

 May be ... But looking up various info about characters (what's the
lowercased char of the given, is the wide character with given code letter)
is very difficult (what data structures do you recommend - may be hashtables
by the code of the character - so that hash function will be just (x&255), and 
then singly-linked list in that bucket?
 It turns out to me that it's not so difficult, but IMO hyrules must provide
special info about 'isalpha' and 'tolower' mappings in this case:
Consider comments to hyrules in the form :
% a 0x61
% A 0x41 a
% b 0x62
% B 0x42 b

 This fragment defines that 'A' in hyrules has unicode character code 0x41,
and that 'a' is lowercased 'A'. If no lowercased equivalent is provided, then
this letter is already lowercased or doesn't have lowercase equivalent.
 Definition of 'b' must precede definition of 'B'.
 Libhnj will collect that definition, and will build the automata with 'int'
or 'short' as it's input (instead of 'char').
 So you got the spirit. What do you think about this? I didn't look at libhnj
code to decide how much hacking this would require.
 And I will probably delay my hy. patch for several weeks (but I don't promise
that I'll integrate unicode support, thou most likely will).

>[...] 
> >  I plan to implement the following confiruation setings and commandline
> > options:
> 
> I hope Henry will have something to say here, so I am not going to talk
> about the number of new top-level options...

 Seems this is a minimal set of setings that will allow such degree of
flexibility.
 
> > HYHENATE:TRUE #or FALSE
> > 
> >  HYPHENDIR:dirname
> > # the name of the directory where hyrules files are located, if their name 
> > is
> > # not absolute.
> > 
> >  HYPHENDICT:tag:<FILESPEC>:CHSET
> > # each set of files with hyrules can be assigned a tag - an string without
> > # ':' in the name - that tag will be used in referring to it.
> > # <FILESPEC> specifies the filenames which should be concatenated to get the
> > # required set of hyrules. It has the following grammar:
> > # <FILESPEC>: <FILE> | <FILE>+<FILESPEC>
> > # ie a list of file names (some of them can be non-asbolute) separated 
> > with'+'
> > # CHSET is chset of the resultant set of hyrules - the name of the chset
> > # known to lynx. If omitted, iso-8859-1 will be assumed.
> 
> You should use a different separator than '+'.  Didn't we go through this
> a while ago wrt. INCLUDE?

 Ok, then it will be 
HYPHENDICT:tag:CHSET:<FILESPEC>
and CHSET is mandatory.
And <FILESPEC> is a list of space-separated file names, with spaces quoted
with '\'.
 
> You don't explain the nature of the binding between CHSET and TAG (it
> should be capitalized).  I.e. it's not just the charset in which the
> rules are given, but the rules only apply if that charset is selected
> as d.c.s.

 Seems it's my fault that you didn't understand that CHSET and d.c.s. needn't
match - simply d.c.s. must be any chset other than utf8.

> > HYPHENCTL:TAG:<LANGSPEC>:<URLSPEC>
> > # specifies the conditions of activating set of hyrules tagged with TAG. If
> > # TAG is '-', then no hyphenation will be applied
> > # LANGSPEC specifies the content-language provided by http or <html lang=>
> > # or <META http-equiv ..>. It has the following grammar:
> > # <LANGSPEC>: * | <CONCRETE_LANGSPEC>
> > # <CONCRETE_LANGSPEC>: LANGNAME | LANGNAME,<CONCRETE_LANGSPEC>
> 
> I find your BNF-like syntax hard to understand, and I do understand the
> form that is used in RFCs.  This will be incomprehensible for most
> people.  As the prettysrc settings, but I digress.  A form where you use
> 'LANGNAME[...]' and explain in that that the '...' means optional more,
> comma separated, has a better chance to be understood.

 I found that comma are not required - the <LANGSPEC> should be just
space-separated list of language names (according to RFC1766).
 But I plan to merge the <LANGSPEC> and <URLSPEC> to <LANG_OR_URL_SPEC> 
(since it's possible to distinguish the url pattern (in either of the forms)
from langspec pattern).
 And the items will be space separated - the comma is not needed here too.

> > # Ie '*' (that matches unspecified language) or list of language names such 
> > as
> > # 'en' (defined by RFC1766).
> > # <URLSPEC> specifies URLs for which it's applicable:
> > # <URLSPEC>: * | <URLSPEC_PATTERNS>
> > # <URLSPEC_PATTERNS>: <URLSPEC_PATTERN> | 
> > <URLSPEC_PATTERN>,<URLSPEC_PATTERNS>
> > # where <URLSPEC_PATTERN> can be one of the following:
> > # address@hidden
> > # @domain_suffix
> > # where path will be matched from the begining of the remote path, and
> > # domain_suffix will be matched from the end of the domain name excluding 
> > port
> > # number (e.g. "@.edu", "tranlsations/address@hidden")
> 
> Yet another and _completely_ unintuitive way for specifying URL matching,
> that's just too horrible to be true.  The only possible reason is that you
> are too lazy to parse URL patters that are given in normal URL order,
> so you dump it on the user to learn a new syntax.

 What other way of URL matching you are talking about?
 If you mean plain substring matching (e.g. .ru will be found in www.rules.nl)
you are wrong - the key is to point what part of URL must be matched. IMO the
syntax I chose will solve the problem (Another varian t- put dot before the
string to be matched in domain name (e.g. .ru), and put dot after the string
to be matched from begining of URL, e.g. www.linux.org/fr/.).

> > # This setting will help to try to avoid collision of hyrules for languages
> > # that have common letters used in human words (like German and English).
> 
> It still doesn't make sense to talk about collisions.  That seems to
> imply that a "collision-free" mode is somehow the normal case.  But
> there isn't one, for nearly every combination of a human language with
> nearly any other (English without accents being the big exception).
> You have to explain how collision-free combination could work if you
> talk about collision at all.

 That was not the final state of the text to be added to lynx.cfg, this was a
bare description of the controls the user will have (so it didn't contain
definition of "collisions", etc).
 I'll post the final version of the text when the code will be written.

 Best regards,
  -Vlad


reply via email to

[Prev in Thread] Current Thread [Next in Thread]