lynx-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: lynx-dev hyphenation


From: Leonid Pauzner
Subject: Re: lynx-dev hyphenation
Date: Thu, 29 Jul 1999 21:58:07 +0400 (MSD)

29-Jul-99 23:13 Vlad Harchev wrote:
> On Thu, 29 Jul 1999, Klaus Peter Wegge wrote:

>> > 1) how to get information about the language of the current html file 
>> > (based
>> > on the charset name of the current document or user setups).

No language <--> charset mapping possible:
ISO-8859-1 covers a dozen of Western Europeal languages,
ISO-8859-2 covers several languages,
windows-1251 covers ALL cyrillic-based languages
while ISO-88859-5 covers Russian only, etc.

There is a `Content-Language=' HTTP/1.0/1.1 tag which could be set by the
server. (I assume AltaVista guess the document's language from this parameter)

[Just for completeness: the document may contain a text of different languages
say, English and French etc. In theory, there is a language attribute in
HTML/4.0 which could be set for each individual section but I have never seen
such tags in the real world.]

Another problem with an implementation of hyphinations may be charsets:
(1) document charset,
(2) display charset, and
(3) charset of the hyphination rules.
What to do when (2) != (3) and essentionally when ((2)!=(3) && (1)!=(3)) ?


>> Most specs in german site are wrong. I tried to use this mechanism
>> for choosing the right speech synthesizer for reading the site to a
>> multitasking user. I think the wrong specs come with the common usage
>> of generators for html-files, which are not configured very well.
>> I think, it's the same for other languages.
>> A collegue of mine played arround with a small word statistic tool:
>> very fast, heuristic and good detection for a lot of language.
>> As I remember implementation was done in about 500 lines pascal.
>> If you are interested I'll give you more details.

>  Please provide the details about word statistic tool (how big dictionary
> files does it need, is there an URL for this tool, is it OpenSource, does it
> handle multiply charsets for a given language...).
>  And seems that we need a mapping from charset name to language name (if
> mapping in strict sense is possible, ie the given charset name is used for
> encoding only one language) - otherwise the user will have to select right
> language for current document manually.

>> Klaus
>>

>  Best regards,
>   -Vlad





reply via email to

[Prev in Thread] Current Thread [Next in Thread]