[fdo] Re: TTS API

Mon Nov 1 09:07:16 PST 2004

>>>>> "WW" == Willie Walker <William.Walker at Sun.COM> writes:

    WW> How the driver chooses to handle lengthy input text seems like
    WW> it needs to be more of an implementation detail than a driver
    WW> interface specification.

    WW> I think the overall requirement here is that the time to first
    WW> sample be "very short" 

Yes.

    WW> and the time to cancel a request in process should also be "very
    WW> short".  

In case of hardware synthesizers yes.  In case of software synthesizers
"reasonably short", only not to waste too much CPU time when there's no
longer need to synthesize the rest of a 1 MB long text just sent to the
synthesizer.

    WW> The other level of detail that needs to be worked out is whether
    WW> a stream of audio data is returned to the app or whether the app
    WW> supplies the driver with a place to send the audio to.  IMO,
    WW> this appears to be a bit of a stylistic thing and I don't see
    WW> strong benefits or drawbacks one way or the other.

Perhaps the sink might better, because the audio data appears in some
form of stream during its higher level processing.  Also, I guess you
had some reason to choose the sink way in FreeTTS.  OTOH the marker
issue can make the things more complicated.

    WW> BTW, an overlying problem that needs to be worked out across the
    WW> whole OS is the notion of managing contention for the audio
    WW> output device.  For example, do you queue, mix, cancel,
    WW> etc. muiltiple audio requests?  This situation will happen very
    WW> frequently in the case of an OS that plays audio when windows
    WW> appear and disappear - this behavior causes contention with a
    WW> screen reader that wants to also say the name of the window that
    WW> was just shown.

IMO no need to care about this, we can expect the speech enabled desktop
is properly configured.

    WW> I think the main thing to think about here is how the app is to
    WW> get the events.  A few methods:

    WW>     1) Sending an audio stream and a marker index to the client.
    WW> This gives the client more control and allows it to manage its
    WW> own destiny.  Adds a bit of complexity to the client, though.

    WW>     2) Sending a linear sequence of audio data and marker data.
    WW> Similar to #1, but I'm not so sure you're going to find a
    WW> synthesizer that implements things this way.

festival-freebsoft-utils does it this way, because it's simpler
interface than 1).

    WW>     3) The MRCP way (I think), which is to have separate things
    WW> for playing and handling events.  The synthesizer will spew data
    WW> to the audio sink and events to the clients.  The timing issues
    WW> here are a bit odd to me, because one can never be sure the
    WW> client receives the event the moment (or even near the moment)
    WW> the audio is played.  In any case, this is most similar to what
    WW> the hardware synthesizers are doing.

In addition to the timing issues, for the purpose of software speech the
sink is rarely a direct audio output, so this method is completely
unsuitable for software synthesis.

    >>> I'd suggest using SSML instead of VoiceXML.  If I'm not
    >>> mistaken, SSML is what is aimed at TTS, while the purpose of
    >>> VoiceXML is different.

    WW> Just as a clarification: SSML is a sub-spec of the VoiceXML
    WW> effort.  It is based upon JSML, the Java Speech API Markup
    WW> Language, which was created by my group here at Sun.

I'm a bit confused.  According to http://www.w3c.org/Voice/, VoiceXML is
an XML markup, which is together with SSML part of the Voice Browser
activity.

Regards,

Milan Zamazal

-- 
I think any law that restricts independent use of brainpower is suspect.
                                               -- Kent Pitman in comp.lang.lisp