[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Speech-reco] Speech recognition and synthesis ideas

From: Bill Cox
Subject: [Speech-reco] Speech recognition and synthesis ideas
Date: Sun, 3 Oct 2010 09:08:03 -0400

I need a good forum to discuss implementing GNU licensed speech
recognition and TTS.  So far, address@hidden is totally dead,
which is why I'm copying vinux-dev, but it would be great to get a few
guys who are interested in this stuff to start talking to each other.

Here's the basic game plan I'm thinking of implementing in my own
effort.  I would like to model speech in the speech-recognition engine
accurately enough to use the same database for TTS.  Two techniques
come to mind.  I can match either linear prediction based roots
(rather than coefficients, reflection coefficients, or LSPs) or FFT
extracted spectrums to recognise speech.  For the LPC version, I'd
compute the actual roots of the LPC denominator, which I suspect are
more correlated to how speech is perceived.  For either the LPC or FFT
approach, I'm thinking of breaking voiced speech up into single pitch
sized frames, so that each frame captures one glottal pulse.  I
suspect this may reduce variation in extracted speech features,
improving matching.

I feel traditional speech recognition approaches have two weaknesses.
First, many rely on Mel-cepstral features.  If my understanding is
correct, then these programs first throw away most of the speech
information, and convert each frame into 25 frequency energies, on the
Mel frequency scale.  This eliminates any possibility of generating
speech from the model.  Then, to further reduce the data, an FFT is
taken of these 25 energies, resulting in one or two "features" that
get stored in the HMM for matching.  If this is really how it's done,
it's simply incredible that any words are matched at all.  The second
problem I see is that everyone seems to store their speech models in
standard Hidden Markov Models, or HHMs.  While mathematically very
cool, HMMs seem to be a poor match to reality.  The assumption that
really kills me is that speech features are not correlated between
samples.  So, for example, he longer I say "eeeeee" in the sound word
"speech", the less likely the model is to believe that I actually said
"e"!  Do I understand this correctly?  Shouldn't any sane model work
better on longer samples?

Unfortunately, I haven't found anyone who knows anything about either
speech-recognition or TTS to talk to, and I'm left having to do all of
my own research from scratch.  I'm ok with this in general.
Eventually I'll stumble around enough to figure out what I'm doing,
but if there are guys out there who actually have a clue about these
topics, I'd love to chat!


reply via email to

[Prev in Thread] Current Thread [Next in Thread]