[Accessibility] TTS API document updated

Gary Cramblitt garycramblitt at comcast.net
Thu Feb 23 07:11:00 PST 2006


On Thursday 23 February 2006 02:04, Hynek Hanke wrote:
> > >     1.4. All language dependent functionality should be covered here,
> > > not in applications.
> >
> > Comment: I don't understand this.  Perhaps we could elaborate more?
>
> Is this more clear?
>
> 1.4 All language dependent functionality with respect to text processing
> for speech synthesis should be covered in the synthesizers or synthesis
> drivers, not in applications.
>
> It means the applications should not be required to do sentence boundary
> detection, replacement of '?' for ``question mark'' etc.

Yes, especially with the last part.

> > Suggest:  Change to SHOULD HAVE and add
> >
> >         4.2.1 MUST HAVE: The application is able to discover if the
> > synthesizer supports SSML markup at all.
>
> This already is a MUST HAVE, see (1.7). However...
>
> > If a synthesizer cannot support SSML, it should at least be able to
> > ignore the markup and speak the contained text, but if not, it must at
> > least inform the application so the application can remove the markup
> > before sending it to the synthesizer.
>
> ...it is easy enough to remove the markup that it can be done in the
> synthesizer driver. So TTS API can always (MUST HAVE) accept SSML.
> Why should the applications have to remove SSML themselves if this
> can be implemented at one place for the synthesizers who can't do it
> themselves?
>
> Speech Dispatcher output modules are handling SSML this way right now
> for Flite and for Generic. Applications don't have to care.
>
> > Speaking the markup would be the worst of all possibilities.
>
> Yes.
>
> > If a synth author doesn't want to support SSML, I'd prefer they be able
> > to at least meet the rest of this specification.  If SSML is MUST HAVE,
> > a synth author might say to himself, "Well I can't meet this MUST HAVE
> > requirement, so I'll ignore the entire specification."
>
> This document doesn't *directly* specify what the synthesizers
> themselves need to support. It specifies what API should be provided on
> the system for us to interface them. Indirectly, it also suggest to the
> synthesizer creators what is important for us.
>
> The requirements on the synthesizers themselves would be very similar
> to this, but there might be some little differences. You pointed
> out one of them.

I'm confused, probably my own fault.  I thought this document was intended for 
distribution to synthesis authors.  Because of that, I'm operating under the 
assumption that we are asking for the minimum functionality that we need from 
them.  If there is some additional functionality that can be layered in a 
platform API such as Gnome Speech, KTTS, or Speech Dispatcher, we should 
leave that out of this specification, or at least put it into the NICE TO 
HAVE category.

I went back to the discussion in this list in October 2004 and it seems this 
issue was never really clearly settled.  I thought it was settled because we 
explicitly left things like queueing, prioritization, and scheduling as OUT 
OF SCOPE.

> > >     4.13. SHOULD HAVE: The synthesizer should not try to split the
> > >       contents of the `s' SSML element into several independent pieces,
> > >       unless required by a markup inside.
> > >       Rationale: An application may have better information about the
> > >       synthesized text and perform its own utterance chunking.
> >
> > Comment: Synthesis authors might balk at this, as utterance chunking is
> > usually an integral part of synthesis.  There could be confusion over
> > terminology here.  For example, in Festival, "chunking" is the process of
> > analyzing a sentence for parts of speech and grouping the sentence into
> > noun phrases, verb phrases, etc.  I'm not sure anymore why this is even
> > here.
>
> I think what was meant rather is what KTTSD calls Sentence boundary
> detection, where the text is cut to several pieces for performance
> reasons. But I don't know more. Can someone clarify that please?

Yes, KTTSD does its own sentence boundary detection largely because of the 
limitations of the current synths.  If all synths followed this 
specification, sentence boundary detection would not be necessary (however 
see below).  I recommend removing 4.13.

> > > 	4 (NICE TO HAVE) tts:digits
> > > 		Indicates the content of the element is a number.
> > > 	The attribute "detail" is supported and can take a numerical
> > > 	value, meaning how many digits should the synthesizer group
> > > 	for reading. The value of 0 means the number should be
> > > 	pronounced as a whole, while any non-zero value means that a
> >
> > Suggest: I would use "grouping" rather than "detail".
>
> It seems that the SSML specification prefers not to introduce attributes
> that would only have a meaning with one value of another attribute and
> so keep the number of attributes minimal.
>
> You can see that on a couple of places. I think the most obvious is
> right in the say-as definition. If you look at 'ordinals', there
> 'detail' contains a list of separators. Seems to be a decision, but it
> is not explicitly commented in the specifications anywhere.
>
> I just kept the style of the SSML specs. But I don't have any strong
> opinion on that.

OK.  Consistency is good.

I want to add one other comment, which is also a concern of T.V. Raman in a 
private email.

>     4.12. SHOULD HAVE: The application must be able to pause and resume
>       a synthesis operation in progress while still being able to handle
>       other synthesis requests in the meantime.  In case of hardware
>       synthesizers, this means pausing and if possible resuming the
>       audio output as well.

Once again, assuming this document is targeted towards synthesis authors, this 
is asking them for a lot.  It means they must write multi-threaded synths, or 
at least be prepared to save the synth state into a queue in order to process 
another request.  This would greatly complicate their code in a task that is 
already quite complex.  We need to think carefully about this one.  Is there 
a way this can be implemented in higher-level modules?  One of the reasons 
KTTS does its own sentence boundary detection is for this reason.  Since KTTS 
sends only one sentence at a time to the synth, it can always pause one job 
(on a sentence boundary), do something else, and resume later, even if the 
synth does not support a pause or stop operation.  Of course, sentence 
boundary detection brings its own problems to the table -- adapting to 
multiple languages being one of them.

Actually, in KTTS its more complicated than that.  KTTS gets the audio from 
the synth and plays it itself.  This permits it to pause and stop instantly 
in mid-sentence.  While one sentence is playing, the synths are kept busy 
synthesizing the next few sentences.

Throw hardware synths into the picture, which do their own audio playback, and 
things get a lot more complicated.

All things considered, I don't think it is reasonable to ask synth authors to 
solve these problems.  We need to ask them for just enough basic 
functionality so that we can solve the problems ourselves.  I don't think we 
have a concensus yet just what that solution is, so it needs to be discussed.

-- 
Gary Cramblitt (aka PhantomsDad)
KDE Text-to-Speech Maintainer
http://accessibility.kde.org/developer/kttsd/index.php


More information about the accessibility mailing list