[fdo] Re: TTS API
Milan Zamazal
pdm at brailcom.org
Mon Nov 1 09:07:16 PST 2004
>>>>> "WW" == Willie Walker <William.Walker at Sun.COM> writes:
WW> How the driver chooses to handle lengthy input text seems like
WW> it needs to be more of an implementation detail than a driver
WW> interface specification.
WW> I think the overall requirement here is that the time to first
WW> sample be "very short"
Yes.
WW> and the time to cancel a request in process should also be "very
WW> short".
In case of hardware synthesizers yes. In case of software synthesizers
"reasonably short", only not to waste too much CPU time when there's no
longer need to synthesize the rest of a 1 MB long text just sent to the
synthesizer.
WW> The other level of detail that needs to be worked out is whether
WW> a stream of audio data is returned to the app or whether the app
WW> supplies the driver with a place to send the audio to. IMO,
WW> this appears to be a bit of a stylistic thing and I don't see
WW> strong benefits or drawbacks one way or the other.
Perhaps the sink might better, because the audio data appears in some
form of stream during its higher level processing. Also, I guess you
had some reason to choose the sink way in FreeTTS. OTOH the marker
issue can make the things more complicated.
WW> BTW, an overlying problem that needs to be worked out across the
WW> whole OS is the notion of managing contention for the audio
WW> output device. For example, do you queue, mix, cancel,
WW> etc. muiltiple audio requests? This situation will happen very
WW> frequently in the case of an OS that plays audio when windows
WW> appear and disappear - this behavior causes contention with a
WW> screen reader that wants to also say the name of the window that
WW> was just shown.
IMO no need to care about this, we can expect the speech enabled desktop
is properly configured.
WW> I think the main thing to think about here is how the app is to
WW> get the events. A few methods:
WW> 1) Sending an audio stream and a marker index to the client.
WW> This gives the client more control and allows it to manage its
WW> own destiny. Adds a bit of complexity to the client, though.
WW> 2) Sending a linear sequence of audio data and marker data.
WW> Similar to #1, but I'm not so sure you're going to find a
WW> synthesizer that implements things this way.
festival-freebsoft-utils does it this way, because it's simpler
interface than 1).
WW> 3) The MRCP way (I think), which is to have separate things
WW> for playing and handling events. The synthesizer will spew data
WW> to the audio sink and events to the clients. The timing issues
WW> here are a bit odd to me, because one can never be sure the
WW> client receives the event the moment (or even near the moment)
WW> the audio is played. In any case, this is most similar to what
WW> the hardware synthesizers are doing.
In addition to the timing issues, for the purpose of software speech the
sink is rarely a direct audio output, so this method is completely
unsuitable for software synthesis.
>>> I'd suggest using SSML instead of VoiceXML. If I'm not
>>> mistaken, SSML is what is aimed at TTS, while the purpose of
>>> VoiceXML is different.
WW> Just as a clarification: SSML is a sub-spec of the VoiceXML
WW> effort. It is based upon JSML, the Java Speech API Markup
WW> Language, which was created by my group here at Sun.
I'm a bit confused. According to http://www.w3c.org/Voice/, VoiceXML is
an XML markup, which is together with SSML part of the Voice Browser
activity.
Regards,
Milan Zamazal
--
I think any law that restricts independent use of brainpower is suspect.
-- Kent Pitman in comp.lang.lisp
More information about the freedesktop
mailing list