gnuspeech-contact
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [gnuspeech-contact] Re: Parameter names and meanings


From: David Hill
Subject: Re: [gnuspeech-contact] Re: Parameter names and meanings
Date: Fri, 3 Feb 2006 14:24:09 -0800

Hi Eric,

This is the first time I've seen this message, so maybe there was a list problem. Very odd!


On Feb 3, 2006, at 4:40 AM, Eric Zoerner wrote:

I haven't seen a reply to this message which was sent to the list on 3 Jan. Is this information not available?

Thanks,
Eric



Ar 3 Ean 2006, ag 18:47, scríobh Eric Zoerner:

Maybe I'm not looking in the right place, but there seems to be some information missing in the GnuSpeech documentation which is causing some difficulty for me.

I am having some difficulty with the names used for the parameters, both at the posture (Synthesizer) and suprasegmental (Monet) levels, in that I cannot find documentation on how the parameters relate to the tools. There is general discussion of the parameters in some cases, but there is no direct mapping of the concepts to the parameter names.

I'll have to look into this.


For example, the Synthesizer app allows you to adjust the "breathiness" of a posture, but there is no reference to how adjusting this affects the output parameters of the Synthesizer app. My best guess is that it may affect the "fricBW" parameter.

The "breathiness" parameter actually injects noise as part of the glottal excitation. The parameter simulates the fact that with some voices, there is a part of the glottis that does not fully close, and air passing through that unclosed portion, as the main glottal closure increases, causes "breathy" noise at the glottis itself. This is one of the features that distinguishes most female voices ("she had a really husky voice" indicates a more extreme case) from most male voices.

You can find the source for the software Tube Resonance Model engine as "tube.c" on the gnu site under
"trliium/src/softwareTRM/tube.c

The most relevant part of the code is:

            /*  CREATE GLOTTAL PULSE (OR SINE TONE)  */
            pulse = oscillator(f0);

            /*  CREATE PULSED NOISE  */
            pulsed_noise = lp_noise * pulse;

            /*  CREATE NOISY GLOTTAL PULSE  */
            pulse = ax * ((pulse * (1.0 - breathinessFactor)) +
                          (pulsed_noise * breathinessFactor));

            /*  CROSS-MIX PURE NOISE WITH PULSED NOISE  */
            if (modulation) {
                crossmix = ax * crossmixFactor;
                crossmix = (crossmix < 1.0) ? crossmix : 1.0;
                signal = (pulsed_noise * crossmix) +
                    (lp_noise * (1.0 - crossmix));


Note that when voiced fricatives (especially "z") are synthesised, the frication noise is pulsed at glottal frequency. This is a different effect and is is what the noise cross-mix is all about.


It would be helpful to get a description of each parameter, what the abbreviation stands for, what it means, and how it is adjusted by the tools.

Agreed.  It shall be done.


r1 through r8 are fairly obvious, and I've figured out the meanings of fricVol, fricPos, etc., but the ones I'm not sure of include fricCF (is this the throat transmission Cutoff Frequency?), and fricBW (bandwidth??).

This is "fricative center frequency". Fricatives are well imitated with a particular bandwidth and centre frequency (though real fricatives are more complex if you get down to analyse them, but also pretty variable). Thus a "sh" sound has a wide bandwidth and low CF (2600 Hz and 2500 Hz respectively) whilst a "s" sound has a narrow bandwidth and higher CF (500 Hz and 5500 Hz respectively). This kind of information is in the posture data entries.


In the Monet documentation appendix, the BW and AX are listed, but no explanation of what these mean that I can find. (I did find explanation of qss, and duration parameters in transitions).

BW is "bandwidth" -- I am not sure of the context you had in mind. AX is an old term that should have been updated. It dates back to the days of Lawrence's "Parametric Artificial Talker" (or "PAT"), probably the first fully functional formant-based synthesiser (like the later MITalk and DECTalk) that toured the US in 1953 with its British inventor. AX simply stands for "Larynx Amplitude" -- the A being "amplitude" and the X standing for "larynx" -- more properly referred to as glottis or vocal folds these days. Don't ask me to justify it ;-) Computer memory was at a premium back in those days, and PAT did not, at that time, even use a computer, but the techniques of character saving spilled over, I suppose. AX stands for the amplitude of the glottal waveform (but even there some qualification is needed. Waveform "amplitude" can be specified as peak amplitude, RMS amplitude, energy flow (power) ... We are using peak amplitude. An energy measure might be better. This is like the debate between VU readings and other measures of audio output for audio equipment.

Please keep asking questions. I am very happy to supply answers, and it will help me to see what is missing and steer me to producing a document to fill in the gaps.

I'll check out what might be needed and make a start asap, but I am currently trying to get a working version of the "Synthesizer" application up (it is going quite well), and I also want to get a "real-time Monet" working too.

Thanks for writing.

All good wishes.

david
----
David Hill
Imagination is more important than knowledge. (Albert Einstein)
Kill your television!




reply via email to

[Prev in Thread] Current Thread [Next in Thread]