[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [silpa-discuss] Re: Transliteration changes to kannada
From: |
Vasudev Kamath |
Subject: |
Re: [silpa-discuss] Re: Transliteration changes to kannada |
Date: |
Wed, 02 Jun 2010 08:29:56 +0530 |
User-agent: |
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.9) Gecko/20100317 Thunderbird/3.0.4 |
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On 6/1/2010 10:11 PM, Alok G. Singh wrote:
> Vasudev Kamath wrote:
>
>> Which reads out to be Rayu. So I don't think this as bug in the code but
>> in CMU dictionary pronunciation
>
> Yes, I think that is a fair comment. IMHO, in most Indian languages j is
> pronounced as a fricative and so giving it the 'y' phenome would be wrong.
>
Yeah.
Santhos do we have any Indian pronunciation dictionary or can we utilize
phoneme db used in Dhavani tts here?.. Here is idea from Laxmi Narayana
Kamath which he started as discussion in Google Wave, I'm just pasting
it here. Let us know what you think
Laxminaryan Wrote:
Transliterating Indic scripts has been, while not without technical
challenges, is theoratically straight forward, owing to the nature of
similatrities in Indic scripts. Another nature that has been of help
is that the Indic scripts on the Internet or generally on the computer
have been used to express only Indic languages(at least majority of
it). But when it comes to using English script in the Indian context,
there are different ways it gets used. Even more challenging, is that
it might get used in different ways even in the same article or even
the same sentence.
Examples:
1. "Chandan ko bolo hospital se chabi lao. There is no way to stop
this nonsense. Rajesh ko apni had ka malum honi chahiye"
2. "Chandan ko bolo hospital se chAbi lAo. There is no way to stop
this nonsense. Rajesh ko apni had kA mAlUm honi chAhiye"
3. "???? ?? ???? hospital ?? ???? ???. There is no way to stop this
nonsense. ????? ?? ????? ?? ?? ????? ???? ??????"
These 3 examples show (respectively)how English script can be used
to write hindi in just phonetic like way, or as an itrans format, or
just to use English words in the middle of Hindi text. This is by no
means a complete example. But it goes to prove the kind of challenges
we will be facing on programming the transliterator to handle an
encounter with a word written using English script.
Not only will the text have multiple possibilities, the user too
might have different needs while transliterating.
Example scenarios:
1. The user wants everything transliterated.
2. The user might be interested in only transliterating Indic text
to his/her native script(probably Kannada), as in the 3rd example
above.
3. The user might want to transliterate only names(he/she just
wants to know who was involved).
4) The user wants to transliterate only those words found in
particular dictionary.
5) The user would know what language the text is in, but still
would want to transliterate it to some other language. For example.
The user might know the language used is Hindi, but would want
transliterate it to tamil so that he could read. And so on and so
forth.
The solution:
One solution I have in mind is to use the concept of filters and
filter chains. (if you have used iptables NAT, you can almost directly
translate the gist of that idea to this)
Each word is passed through each filter in the chain. The word is
processed by the filter in the way it knows best. The processed text
is returned back along with a result, whether the processing was
successfull or not.
We could have filters like [Dont transliterate] [CMU dictionary]
[Indian names dictionary] [itrans] [phonetic] [skip real English]
[Intelligent phonetic multi output] and so on.
Also , a filter that could take other filters as outputs(a bit like
nesting). For now, there i only one: [fuzzy dictionary based selector]
[Dont transliterate] just returns same word with result as
successfull, so that English gets routed to output as it is
[CMU dictionary] checks in dicitonary, if existing in dictionary,
returns the word in dictionary and sets result as success full
otherwise sets result as fail.
[Indian names dictionary] similar to CMU
[itrans] tries to parse as itrans. fails if not an itrans based word.
[skip real English] checks if the wordis in English dictionary. if it
exists, it will return successfull, so that it wont get
transliterated. The idea is , if the word is real English, it should
remain English, otherwise it should get transliterated. example:
"maine kal saaDhe saat ke train pakadkar galti ki" .. should get
transliterated to "???? ??? ???? ??? ?? train ??????? ????? ??"
[phonetic/IPA] Try to pronouce using some English tts which
gives IPA output .. then transliterate resulting IPA to target Indic
language.
Now for the magic filter you might be looking for:
Its a combination of
[Intelligent phonetic multi output] , [fuzzy dictionary based
selector] , and a custom group of "provider" filters.
[Intelligent phonetic multi output] does not return one word. Instead,
it returns a list in the selcted language of all possible
pronounciations of word in English script.
[fuzzy dictionary based selector] takes a list of filters as
"provider" filter. (right now, itrans is the only sensible filter for
this, but I have others in mind for which I am not getting any name).
First, [fuzzy dictionary based selector] will filter the word through
each of the given group of filters, each time checking if the
successfully returned word is in the dictionary for target language.
If it exists, return the word. Otherwise, repeat with next available
filter. If the avialable filters are exhausted without success, repeat
for every word returned by [Intelligent phonetic multi output] I later
realized we might need this filter without [Intelligent phonetic
multi output]. If it is needed, we can always include it in the
group of filters manually.
Example situations:
Consider the text.
Santhosh, Vasudev, Laxminarayan and Silpa were sitting in a
park. Santhosh ne Silpa se kahA 'I want you to improve your
transliteration'. tho Silpa ne kahaa '???? ?? ?? ??? program ????? ??. ??
???? ?? ??? ?? ??????? ?? ??.'
You wouldnt usually find so much of a mix in one single text. But this
will help cover many situations.
Situations:
1. I can't read Devanagari script. I want to read the Hindi text in
the sentence. So I tell silpa to transliterate to Kannada. Now, If
Silpa implements English transliteration, and it encounters English in
the text, It will transliterate English also to Kannada. I dont want it
to waste time over English. I just want the Hindi to be transliterated
to Kannada. Possible English chain contains of only 1 filter:
[DontTransliterate]
This means, if it encounters an english word, it will give back
exact word . It wont transliterate at all. It will only transliterate
others.
2. I know whatever is in English script, but isnt English, is in
itrans. I want it to transliterate those things that arent English but
Itrans, to Hindi. (Content might be too much intermixed to manually
select and transliterate only itrans.) . So the English chain will be
[skip real English(dict=Full English Dict)] [fuzzy dicitonary based
selector (itrans(target_language, phonetic), dictinary=dict(cmu
pronounciation dictionary)]
3. I want to transliterate everything to Bengali. Because my
neighbour is good in understanding English, but is bad at reading.
[Indian names] [CMU] [fuzzy dicitonary based selector (itrans,
phonetic, intelligent_phonetic_multi_output, dictinary=dict(cmu
pronounciation dictionary))] [phonetic]
I am thinking of a filter which automatically checks if beginning or
the ending parts of the word is in dictionary . (thus being able to
detect if the word is actually just a combination of multiple words)
.Still not forming properly though.
Thanks and Regards
Vasudev Kamath
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
iEYEARECAAYFAkwFySwACgkQp9tcfy/seDlXYQCfWoFtqsYgagZ3s1q6aswUWY/K
KxgAnRuv+1HxCRyIY1ChQzw5NZa8Aimm
=tiZt
-----END PGP SIGNATURE-----