[fdo] Re: TTS API

Sat Oct 30 13:45:23 PDT 2004

Wow!  The number of people on this list is sure growing!

>> Or more generally a sequence of audio samples.  Motivation: I think
>> most software synthesizers we are likely to support perform processing
>> of the whole text in several steps, only last of them being writing 
>> the
>> whole produced audio sample somewhere.  When synthesizing long texts,
>> it is desirable to allow the synthesizer to split the input into
>> several pieces so that we don't wait for the first coming audio data
>> too long.
> KTTSD already does this, and I think it would be duplication of work 
> to do
> it in every driver if the higher speech system can take care of this.
> Doing it before sending the phrases to the engines allows to interupt a
> longer text with warnings, etc.

Is this level of detail something that needs to be exposed at
the driver level?  How the driver chooses to handle lengthy
input text seems like it needs to be more of an implementation
detail than a driver interface specification.

I think the overall requirement here is that the time to first
sample be "very short" and the time to cancel a request in process
should also be "very short".  (You guys define what "very short"
means).

The other level of detail that needs to be worked out is whether
a stream of audio data is returned to the app or whether the app
supplies the driver with a place to send the audio to.  IMO, this
appears to be a bit of a stylistic thing and I don't see strong
benefits or drawbacks one way or the other.  If someone gives you
audio, you can send it to a sink.  If someone allows you to give
the driver a sink, you can write your sink to give you audio.  In
either case, the functions of pause/resume/cancel/ff/rev all
introduce a fair amount of complexity, especially when the driver
is trying to multithread things to give you its best performance
possible.

In FreeTTS, we chose the latter (i.e., give FreeTTS a sink to send
the audio to).  It works OK, but is a little unintuitive and our
implementation kind of puts the app at the mercy of FreeTTS when it
comes to the timing.  For the former (i.e., have the driver give you
audio), I'm not sure how many other engines out there really support
this level of flexibility.

BTW, an overlying problem that needs to be worked out across the
whole OS is the notion of managing contention for the audio output
device.  For example, do you queue, mix, cancel, etc. muiltiple
audio requests?  This situation will happen very frequently in the
case of an OS that plays audio when windows appear and disappear -
this behavior causes contention with a screen reader that wants to
also say the name of the window that was just shown.

>> Markers should be available for both software and hardware synthesis.
>> But they differ in their form: While with hardware synthesis feedback
>> should be received whenever the marker is reached in the audio output,
>> with software synthesis positions of the markers in the returned audio
>> sample should be returned.  Or the audio sample can be returned in
>> several pieces as described above, it can be especially split on 
>> marker
>> positions and the returned list could contain not only the audio
>> samples, but also the reached markers.

I think the main thing to think about here is how the app is to
get the events.  A few methods:

    1) Sending an audio stream and a marker index to the client.
       This gives the client more control and allows it to
       manage its own destiny.  Adds a bit of complexity to the
       client, though.

    2) Sending a linear sequence of audio data and marker data.
       Similar to #1, but I'm not so sure you're going to find
       a synthesizer that implements things this way.

    3) The MRCP way (I think), which is to have separate things
       for playing and handling events.  The synthesizer will
       spew data to the audio sink and events to the clients.
       The timing issues here are a bit odd to me, because one
       can never be sure the client receives the event the
       moment (or even near the moment) the audio is played.
       In any case, this is most similar to what the hardware
       synthesizers are doing.

>> I'd suggest using SSML instead of VoiceXML.  If I'm not mistaken, SSML
>> is what is aimed at TTS, while the purpose of VoiceXML is different.

Just as a clarification:  SSML is a sub-spec of the VoiceXML effort.
It is based upon JSML, the Java Speech API Markup Language, which was
created by my group here at Sun.

Will