[Accessibility] TTS API document updated

Wed Feb 22 23:04:58 PST 2006

Hi Gary,

thank you for your comments.

> >     1.4. All language dependent functionality should be covered here,
> > not in applications.
> Comment: I don't understand this.  Perhaps we could elaborate more?

Is this more clear?

1.4 All language dependent functionality with respect to text processing
for speech synthesis should be covered in the synthesizers or synthesis
drivers, not in applications.

It means the applications should not be required to do sentence boundary
detection, replacement of '?' for ``question mark'' etc. 

> >     3.2. SHOULD HAVE: An application will be able to specify the default
> >       prosody and style elements for a voice.  These elements will match
> >       those defined in the SSML specification, and the synthesizer may
> >       choose which attributes it wishes to support.  Note that prosody,
> >       voice and style elements specified in SSML sent as a `speak'
> >       request will override the default values.
> Suggest: request will temporarily override the default values.

Yes.

> Suggest:  Change to SHOULD HAVE and add
> 
>         4.2.1 MUST HAVE: The application is able to discover if the 
> synthesizer supports SSML markup at all.

This already is a MUST HAVE, see (1.7). However...

> If a synthesizer cannot support SSML, it should at least be able to ignore the markup and speak the 
> contained text, but if not, it must at least inform the application so the 
> application can remove the markup before sending it to the synthesizer.  

...it is easy enough to remove the markup that it can be done in the
synthesizer driver. So TTS API can always (MUST HAVE) accept SSML.
Why should the applications have to remove SSML themselves if this
can be implemented at one place for the synthesizers who can't do it
themselves?

Speech Dispatcher output modules are handling SSML this way right now
for Flite and for Generic. Applications don't have to care.

> Speaking the markup would be the worst of all possibilities.

Yes.

> If a synth author doesn't want to support SSML, I'd prefer they be able
> to at least meet the rest of this specification.  If SSML is MUST HAVE,
> a synth author might say to himself, "Well I can't meet this MUST HAVE
> requirement, so I'll ignore the entire specification."

This document doesn't *directly* specify what the synthesizers
themselves need to support. It specifies what API should be provided on
the system for us to interface them. Indirectly, it also suggest to the
synthesizer creators what is important for us.

The requirements on the synthesizers themselves would be very similar
to this, but there might be some little differences. You pointed
out one of them.

> >     4.4. MUST HAVE: An application must be able to cancel a synthesis
> >       operation in progress.  In case of hardware synthesizers, this
> >       means cancelling the audio output as well.
> Change to: In case of hardware synthesizers, or synthesizers that produce 
> their own audio, this means..

Yes, good point.

> >     4.13. SHOULD HAVE: The synthesizer should not try to split the
> >       contents of the `s' SSML element into several independent pieces,
> >       unless required by a markup inside.
> >       Rationale: An application may have better information about the
> >       synthesized text and perform its own utterance chunking.
> Comment: Synthesis authors might balk at this, as utterance chunking is 
> usually an integral part of synthesis.  There could be confusion over 
> terminology here.  For example, in Festival, "chunking" is the process of 
> analyzing a sentence for parts of speech and grouping the sentence into noun 
> phrases, verb phrases, etc.  I'm not sure anymore why this is even here.

I think what was meant rather is what KTTSD calls Sentence boundary
detection, where the text is cut to several pieces for performance
reasons. But I don't know more. Can someone clarify that please?

> > 	1 (SHOULD HAVE) characters
> > 		The format 'glyphs' is supported.
> Comment: glyphs??

It's the classical spelling. See the specifications for SSML say-as
http://www.w3.org/TR/2005/NOTE-ssml-sayas-20050526/

> > 	4 (NICE TO HAVE) tts:digits
> > 		Indicates the content of the element is a number.
> > 	The attribute "detail" is supported and can take a numerical
> > 	value, meaning how many digits should the synthesizer group
> > 	for reading. The value of 0 means the number should be
> > 	pronounced as a whole, while any non-zero value means that a
> Suggest: I would use "grouping" rather than "detail".

It seems that the SSML specification prefers not to introduce attributes
that would only have a meaning with one value of another attribute and
so keep the number of attributes minimal.

You can see that on a couple of places. I think the most obvious is
right in the say-as definition. If you look at 'ordinals', there
'detail' contains a list of separators. Seems to be a decision, but it
is not explicitly commented in the specifications anywhere.

I just kept the style of the SSML specs. But I don't have any strong
opinion on that.

> Comment: Is the "detail" attribute really needed?  Couldn't I do the same 
> thing using markup like this:
> <tts:digits>543</tts:digits><tts:digits>172</tts:digits><tts:digits>182</tts:digits>

If we decide the value detail is not needed, then the whole attribute is
not needed, since it would be better to use the attribute 'ordinal' in
your example. (Rant: Your example is incorrect. tts:digits is a say-as
attribute value, so the tag is <say as ...>)

However, I think that to explicitly split this is not the task of the
application. The typical situation there will likely be: The user
prefers to have the digits read grouped by 3 right now because it
suits the situation he is in. So in his application, he asks for the
grouping and the application asks TTS API via global settings.
But we decided earlier that those global settings should match the
settings you are able to express via SSML.

Several users had that requirement on Speech Dispatcher.

And I think it is useful in itself too. It is not clear from
your example, if you have three three-digits numbers or one nine-digit
number.

> Comment: Need to specify where the <tts:style> element may occur within SSML 
> and whether it contains content.  I think you intend for it to occur within a 
> <s> or <p> elements and contain the content to be spoken in the indicated 
> style.  For example,
> <s>The abbreviation <tts:style field="capital_letters" 
> mode="spell">TTS</tts:style> stands for text to speech.</s>

Not really. You might want to read whole paragraphs with punctuation
information. Actually this is what users most often want to do.

So it must be possible to include most other markup in <tts:style>.
You are right this needs to be specified.

With regards,
Hynek Hanke