Re: [silpa-discuss] Re: Transliteration changes to kannada

silpa-discuss

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [silpa-discuss] Re: Transliteration changes to kannada

From:	Vasudev Kamath
Subject:	Re: [silpa-discuss] Re: Transliteration changes to kannada
Date:	Wed, 02 Jun 2010 08:29:56 +0530
User-agent:	Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.9) Gecko/20100317 Thunderbird/3.0.4

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 6/1/2010 10:11 PM, Alok G. Singh wrote:
> Vasudev Kamath wrote:
> 
>> Which reads out to be Rayu. So I don't think this as bug in the code but
>> in CMU dictionary pronunciation
> 
> Yes, I think that is a fair comment. IMHO, in most Indian languages j is
> pronounced as a fricative and so giving it the 'y' phenome would be wrong.
> 
Yeah.
Santhos do we have any Indian pronunciation dictionary or can we utilize
phoneme db used in Dhavani tts here?.. Here is idea from Laxmi Narayana
Kamath which he started as discussion in Google Wave, I'm just pasting
it here. Let us know what you think

Laxminaryan Wrote:

Transliterating Indic scripts has been, while not without technical

 challenges, is theoratically straight forward, owing to the nature of

 similatrities in Indic scripts. Another nature that has been of help

 is that the Indic scripts on the Internet or generally on the computer

 have been used to express only Indic languages(at least majority of

 it). But when it comes to using English script in the Indian context,

 there are different ways it gets used. Even more challenging, is that

 it might get used in different ways even in the same article or even

 the same sentence.

 Examples:

  1. "Chandan ko bolo hospital se chabi lao. There is no way to stop

 this nonsense. Rajesh ko apni had ka malum honi chahiye"

  2. "Chandan ko bolo hospital se chAbi lAo. There is no way to stop

 this nonsense. Rajesh ko apni had kA mAlUm honi chAhiye"

  3. "???? ?? ???? hospital ?? ???? ???. There is no way to stop this

 nonsense. ????? ?? ????? ?? ?? ????? ???? ??????"

  These 3 examples show (respectively)how English script can be used

 to write hindi in just phonetic like way, or as an itrans format, or

 just to use English words in the middle of Hindi text. This is by no

 means a complete example. But it goes to prove the kind of challenges

 we will be facing on programming the transliterator to handle an

 encounter with a word written using English script.

  Not only will the text have multiple possibilities, the user too

 might have different needs while transliterating.

  Example scenarios:

    1. The user wants everything transliterated.

    2. The user might be interested in only transliterating Indic text

 to his/her native script(probably Kannada), as in the 3rd example

 above.

    3. The user might want to transliterate only names(he/she just

 wants to know who was involved).

    4) The user wants to transliterate only those words found in

 particular dictionary.

    5) The user would know what language the text is in, but still

 would want to transliterate it to some other language. For example.

 The user might know the language used is Hindi, but would want

 transliterate it to tamil so that he could read.  And so on and so

 forth.

 The solution:

  One solution I have in mind is to use the concept of filters and

 filter chains. (if you have used iptables NAT, you can almost directly

 translate the gist of that idea to this)

 Each word is passed through each filter in the chain. The word is

 processed by the filter in the way it knows best. The processed text

 is returned back along with a result, whether the processing was

 successfull or not.

 We could have filters like [Dont transliterate] [CMU dictionary]

 [Indian names dictionary] [itrans] [phonetic] [skip real English]

 [Intelligent phonetic multi output] and so on.

  Also , a filter that could take other filters as outputs(a bit like

 nesting). For now, there i only one: [fuzzy dictionary based selector]

 [Dont transliterate] just returns same word with result as

 successfull, so that English gets routed to output as it is

 [CMU dictionary] checks in dicitonary, if existing in dictionary,

 returns the word in dictionary and sets result as success full

 otherwise sets result as fail.

 [Indian names dictionary] similar to CMU

 [itrans] tries to parse as itrans. fails if not an itrans based word.

 [skip real English] checks if the wordis in English dictionary. if it

 exists, it will return successfull, so that it wont get

 transliterated. The idea is , if the word is real English, it should

 remain English, otherwise it should get transliterated. example:

 "maine kal saaDhe saat ke train pakadkar galti ki" .. should get

 transliterated to "???? ??? ???? ??? ?? train ??????? ????? ??"

[phonetic/IPA] Try to pronouce using some English tts which

gives IPA output .. then transliterate resulting IPA to target Indic
language.

 Now for the magic filter you might be looking for:

 Its a combination of

 [Intelligent phonetic multi output] , [fuzzy dictionary based

 selector] , and a custom group of "provider" filters.

 [Intelligent phonetic multi output] does not return one word. Instead,

 it returns a list in the selcted language of all possible

 pronounciations of  word in English script.

 [fuzzy dictionary based selector] takes a list of filters as

 "provider" filter. (right now, itrans is the only sensible filter for

 this, but I have others in mind for which I am not getting any name).

 First, [fuzzy dictionary based selector] will filter the word through

 each of the given group of filters, each time checking if the

 successfully returned word is in the dictionary for target language.

 If it exists, return the word. Otherwise, repeat with next available

 filter. If the avialable filters are exhausted without success, repeat

 for every word returned by [Intelligent phonetic multi output] I later

 realized we might need this filter without [Intelligent phonetic

multi output]. If it is needed, we can always include it in the

group of filters manually.

Example situations:

     Consider the text.

         Santhosh, Vasudev, Laxminarayan and Silpa were sitting in a
park. Santhosh ne Silpa se kahA 'I want you to improve your
transliteration'. tho Silpa ne kahaa '???? ?? ?? ??? program ????? ??. ??
???? ?? ??? ?? ??????? ?? ??.'

You wouldnt usually find so much of a mix in one single text. But this
will help cover many situations.

Situations:

     1. I can't read Devanagari script. I want to read the Hindi text in
the sentence. So I tell silpa  to transliterate to Kannada. Now, If
Silpa implements English transliteration, and it encounters English in
the text, It will transliterate English also to Kannada. I dont want it
to waste time over English. I just want the Hindi to be transliterated
to Kannada. Possible English chain contains of only 1 filter:

    [DontTransliterate]

    This means, if it encounters an english word, it will give back
exact word . It wont transliterate at all. It will only transliterate
others.

    2. I know whatever is in English script, but isnt English, is in
itrans. I want it to transliterate those things that arent English but
Itrans,  to Hindi. (Content might be too much intermixed to manually
select and transliterate only itrans.) . So the English chain will be

   [skip real English(dict=Full English Dict)]   [fuzzy dicitonary based
selector (itrans(target_language,  phonetic), dictinary=dict(cmu
pronounciation dictionary)]

   3. I want to transliterate everything to Bengali. Because my
neighbour is good in understanding English, but is bad at reading.

  [Indian names]  [CMU]  [fuzzy dicitonary based selector (itrans,
phonetic,  intelligent_phonetic_multi_output, dictinary=dict(cmu
pronounciation dictionary))]  [phonetic]

 I am thinking of a filter which automatically checks if beginning or

 the ending parts of the word is in dictionary . (thus being able to
detect if the word is actually just a combination of multiple words)
.Still not forming properly though.

Thanks and Regards
Vasudev Kamath
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkwFySwACgkQp9tcfy/seDlXYQCfWoFtqsYgagZ3s1q6aswUWY/K
KxgAnRuv+1HxCRyIY1ChQzw5NZa8Aimm
=tiZt
-----END PGP SIGNATURE-----

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [silpa-discuss] Re: Transliteration changes to kannada, Alok G. Singh, 2010/06/01
- Re: [silpa-discuss] Re: Transliteration changes to kannada, Vasudev Kamath, 2010/06/01
  - Re: [silpa-discuss] Re: Transliteration changes to kannada, Alok G. Singh, 2010/06/01
    - Re: [silpa-discuss] Re: Transliteration changes to kannada, Vasudev Kamath <=

Prev by Date: Re: [silpa-discuss] Re: Transliteration changes to kannada
Next by Date: [silpa-discuss] [task #10441] SILPA Library for Python 2.5
Previous by thread: Re: [silpa-discuss] Re: Transliteration changes to kannada
Next by thread: [silpa-discuss] [task #10441] SILPA Library for Python 2.5
Index(es):
- Date
- Thread