speech-reco
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Speech-reco] Speech recognition and synthesis ideas


From: Eric S. Johansson
Subject: Re: [Speech-reco] Speech recognition and synthesis ideas
Date: Sun, 03 Oct 2010 10:13:15 -0400
User-agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.9) Gecko/20100915 Thunderbird/3.1.4

 On 10/3/2010 9:08 AM, Bill Cox wrote:
I feel traditional speech recognition approaches have two weaknesses.
First, many rely on Mel-cepstral features.  If my understanding is
correct, then these programs first throw away most of the speech
information, and convert each frame into 25 frequency energies, on the
Mel frequency scale.  This eliminates any possibility of generating
speech from the model.  Then, to further reduce the data, an FFT is
taken of these 25 energies, resulting in one or two "features" that
get stored in the HMM for matching.  If this is really how it's done,
it's simply incredible that any words are matched at all.  The second
problem I see is that everyone seems to store their speech models in
standard Hidden Markov Models, or HHMs.  While mathematically very
cool, HMMs seem to be a poor match to reality.  The assumption that
really kills me is that speech features are not correlated between
samples.  So, for example, he longer I say "eeeeee" in the sound word
"speech", the less likely the model is to believe that I actually said
"e"!  Do I understand this correctly?  Shouldn't any sane model work
better on longer samples?

well, "speech" and speeeeeeeech" are two different sets of sounds so you should get different results. Hidden Markov models mimic the human audio experience. This is why they're a good thing. I do believe that some research out there that's on a phonetic basis but I don't know if it's just the same old garbage regurgitated from the 19 60s or if it is some new thinking.
Unfortunately, I haven't found anyone who knows anything about either
speech-recognition or TTS to talk to, and I'm left having to do all of
my own research from scratch.  I'm ok with this in general.
Eventually I'll stumble around enough to figure out what I'm doing,
but if there are guys out there who actually have a clue about these
topics, I'd love to chat!

The main reason you probably can't find anyone to talk to is that if they are actively working the field, they are so bound up by nondisclosure agreements that they can't open their mouths. Having said that, I have a contact at MIT who I've used to validate some speech UI ideas and he might be a little point you to some grad students willing to spend some time working with you on this issue. There is also another recognizer developed at MIT that might be useful. Unfortunately the person who now owns it is a bit of a recluse and hasn't answered e-mail from a few of us in Boston area speech recognition users for a while. I have a couple other contacts in the industry that I will contact to see if they can point you at anyone or any books that would be useful.

If you are dead set on making a recognition engine, I think that the Simon approach is the way to go even those dependent on a nonfree library. I would go this route because you need something to compare your your final solution to. Until you actually have a recognizer running and failing miserably, you won't know what's a good choice or a bad choice on more subtle levels which in turn have a great effect on the end-user's experience.

Some relatively ancient references. The Wikipedia article probably have the best overview and a decent set of pointers to the next part of your research.

http://portal.acm.org/citation.cfm?id=153687
http://www.faqs.org/docs/Linux-HOWTO/Speech-Recognition-HOWTO.html
http://en.wikipedia.org/wiki/Speech_recognition

I really do wish people would focus on the needs of the user first. We have recognition engines that work. Yes they are not free by any stretch of imagination but, they give us a working base for UIs and APIs. Focusing on the needs of the end-user first let's disabled people to take care of their own needs without being dependent on someone else. After you get the disabled people working again, then we will have a pool of people able to participate in something important to their future versus sitting on the sidelines, giving up their career and whingeing about not being able to do what they need.

Freeing people, disabled or otherwise, comes before freeing software.

Doing otherwise, forces disabled users down the path of extensive nonfree software use making the transition to a free software-base extremely difficult. incremental approaches allow disabled users to gradually adopt free software as disability friendly free software tools evolve. I personally would prefer to see someone working with 90% free software now than 100% nonfree software for the next 5 to 10 years.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]