[fdo] Re: TTS API

Mon Nov 1 10:17:55 PST 2004

>     WW> and the time to cancel a request in process should also be 
> "very
>     WW> short".
>
> In case of hardware synthesizers yes.  In case of software synthesizers
> "reasonably short", only not to waste too much CPU time when there's no
> longer need to synthesize the rest of a 1 MB long text just sent to the
> synthesizer.

Sorry - I was unclear about what I meant here.  The main requirement
should be that the time between issuing a cancel request and the time
the next speak request is handled should be "very short."  That is,
the synthesizer should become available very quickly the moment a
cancel request is issued.  The idea here is that as I type/tab away,
the thing currently being spoken can be interrupted immediately and
the new thing can be spoken immediately.

How or what the synthesizer does to handle this is an implementation
detail.

>     WW> BTW, an overlying problem that needs to be worked out across 
> the
>     WW> whole OS is the notion of managing contention for the audio
>     WW> output device.  For example, do you queue, mix, cancel,
>     WW> etc. muiltiple audio requests?  This situation will happen very
>     WW> frequently in the case of an OS that plays audio when windows
>     WW> appear and disappear - this behavior causes contention with a
>     WW> screen reader that wants to also say the name of the window 
> that
>     WW> was just shown.
>
> IMO no need to care about this, we can expect the speech enabled 
> desktop
> is properly configured.

One would hope.  In the past month, however, I've seen evidence to the
contrary.  Us engine providers are on the front line -- my comments to
this group are based on real world experiences and requests from real
users (I'm guessing your comments are, too, so please don't take my
comment as a back-handed attack -- I mean nothing of the sort).

Perhaps as an addendum to the requirements document, we need to make a
list of assumptions.  One such assumption is this one.

>     WW> Just as a clarification: SSML is a sub-spec of the VoiceXML
>     WW> effort.  It is based upon JSML, the Java Speech API Markup
>     WW> Language, which was created by my group here at Sun.
>
> I'm a bit confused.  According to http://www.w3c.org/Voice/, VoiceXML 
> is
> an XML markup, which is together with SSML part of the Voice Browser
> activity.

I think we're saying the same thing here.  VoiceXML is the overall
arching specification.  It is the language one uses to define the
speech dialog.  There are many facets to a dialog, including defining
what people can say over the course of the dialog and what the system
says to the user.  Many of these facets are defined by sub-specs.  For
example, the W3C Speech Synthesis Markup Language (SSML) is used for
speech synthesis, and the Speech Recognition Grammar Specification
(SRGS) is for speech recognition.

The main point is that SSML can stand alone and does not have
dependencies on the other specs.  That is, the requirement would
be clearer by stating support for SSML instead of VoiceXML.  One
might also further refine this requirement as is done in the
MRCP V2 spec (the quotes are MRCP terms and statements):

    1) a "basic synthesizer," which provides very limited
       capabilities.  It only needs to support the <speak>,
       <audio>, <sayas>, and <mark> elements and can achieve
       synthesis "by playing out concatenated audio clips."

    2) a "speech synthesizer," which is "capable of rendering
       regular speech and SHOULD have full SSML support."

Will