[fdo] Re: TTS API

Thu Nov 4 07:48:47 PST 2004

On Thursday 04 November 2004 07:15 am, Milan Zamazal wrote:
> I expanded the initial Olaf's document and tried to summarize the
> current state of our discussion about the requirements and related
> issues in it.  Please tell me if anything is missing or unclear and
> let's try to finish the requirements section by resolving the open
> questions.
>
> Regards,
>
> Milan Zamazal

>
> Common TTS Interface
> ====================

Nice work.   This is very helpful.

> The synthesis process:
>
> - Synthesis of a given piece of text expressed in a markup format
>   [unresolved: Which one?  SSML or its reasonable subset?  Isn't SSML
>   covered by patents preventing Free Software and Open Source programs
>   to use it?  How about character encoding -- would it suffice to use
>   UTF-8 everywhere?].

Based on input I received in another thread, SSML does not appear to be 
encumbered.  IANAL.  I also think SSML is probably the best choice as the 
basis for markup format, although we may need to simplify and/or extend.  For 
example, SSML provides a number of ways to control volume, including  
relative/absolute numerics, etc.  Within KTTSD we just use soft/normal/loud.  
(If users want finer control than that, they can adjust the volume of their 
audio device.)  What we want is a reasonable subset that provides a 
rich-enough speech environment to meet the needs of applications (especially 
screen readers) without burdening driver authors with having to do a full 
SSML implementation.  For KTTSD, I chose the following, but this is getting 
down into the details we can thrash out later:

volume ="soft/medium/loud"
gender="male/female/neutral"    (neutral means the synth doesn't know)
rate="slow/medium/fast"

and for synths that provide multiple voices and languages:

name="voice name"
lang="lanquage-code"

A synth might want to provide full volume and rate attribute support 
(numerics, etc.) but that would be optional.

>
> - Synthesis of characters and key names [possibly using custom SSML
>   attribute values?].  Rationale: It's not possible to express them just
>   in the form of an ordinary text without language specific knowledge.

SSML provides the "say-as" element for this but does not specify the actual 
attribute values.  The W3C working group is supposedly working on a separate 
document for these.

> - [Unresolved: Should the driver be able to receive the markup text to
>   synthesize in several pieces?  The motivation is to ease processing of
>   texts in KTTSD a bit.  I personally don't think it's a valid reason to
>   complicate the interface, considering it has nothing to do with the
>   speech synthesis process.  But maybe I still miss something.]

I'm not sure where this requirement came from or what is meant here.  I'm the 
KTTSD maintainer, and KTTSD does not need this for the low-level drivers.  
KTTSD does provide this capability to higher-levels, i.e., applications.   I 
agree that having a low-level driver interface for this is unnecessarily 
complicating.

KTTSD breaks text given to it by applications into smaller pieces (sentence 
parsing) before passing them on to the synthesis engine, but it gives each 
piece one-at-time to the engine.  Sentence parsing is done for several 
reasons, mostly to get around the limitations of the currently-available 
engines:

1.  It permits rewinding by sentence or part.  (I didn't quite understand that 
last sentence.  Repeat it please.)  Also permits fast forwarding.  It permits  
rewind without having to re-synthesize all the way from the beginning.

2.  It permits progress monitoring.  (See marker discussion below.)

3.  It permits faster time from request to first audio (Well, yes and no.  
This is a complex problem.  Some engines do a good job of this; others 
don't.)  By breaking long text into sentences, KTTSD endeavors to keep the 
synth engine synthesizing while simultaneously audibilizing.

4.  It permits injection of higher-priority messages into normal text, i.e., 
interruption.  "And here we have a detailed data analysis.  <ding> Incoming 
phone call from Mom. <ding> The x axis shows the widget count while the y 
axis shows.."

5.  It permits graceful aborting of speech when the synth engine doesn't 
provide a stop capability.  When the engine does not have a stop capability, 
there are two alternatives: 1) abort the process and pay in restartup time, 
or 2) just let the synth finish, but throw away the result.  When long text 
has been broken into smaller pieces, alternative 2 can give good performance.

On the other hand, sentence parsing can be a difficult problem, and is 
language and context dependent.  Some would argue that it should be done by 
the synthesis engine, since most engines must already do so to some extent 
already.

If a synth engine were available that had all the capabilities I mentioned 
above (and probably others I haven't mentioned), there would be no need to do 
sentence parsing in KTTSD, but adding these capabilities to the low-level 
driver would greatly complicate its interface.  All things considered, I 
think the low-level API should not provide a capability to receive text in 
pieces.  Leave that to higher levels.

> Software synthesis:
>
> - [Should it be allowed to return the audio data in several separate
>   pieces?  It complicates returning them, but what if the synthesizer
>   splits a long input text and is unable to merge the resulting wave
>   forms?  Should the driver be responsible for handling this?]
>
> Index markers:
>
> - Support for identifying where or when given places (index markers) in
>   the input text are reached.  [Not all synthesizers can support this --
>   should index marking be optional?]
>
> - Software synthesis must identify the positions of index markers within
>   the returned audio data.  [The question is how to do it.  If we are
>   able to return the audio output in several pieces, then we can think
>   about a linear sequence of audio samples and marker identifiers, where
>   each marker is placed at its position between separate audio samples.
>   Another possible way is to write times of reaching the markers in the
>   produced audio data to a separate stream; this works with single audio
>   output but it requires certain precautions to ensure the marker is not
>   missed on the marker stream when playing data from the audio stream.]
>
> - [The KTTSD approach to warnings and messages suggests it could be
>   useful if some sort of index markers could be inserted into the input
>   texts automatically, at breakable places.  I.e. places, where the
>   audio output can be interrupted without breaking the speech at an
>   unsuitable place (e.g. in the middle of a word or short sentence).
>   This can be useful for pausing the speech or for speaking unrelated
>   important messages when reading longer pieces of text.  What do you
>   think?]

Hmm.  More complication.  Since KTTSD already does sentence parsing, markers 
are easy to support, as long as accuracy is only required to the sentence 
level.  If we agree that sentence parsing should be handled in higher layers, 
marker support is unnecessary for software synths and also eliminates the 
need for wave form merging in the low-level driver.  (Marker support might 
still be needed for hardware synths.)

>
> Setting speech parameters:
>
> - It should be possible to switch reading modes of the synthesizer,
>   namely: punctuation mode, capital letter signalization mode, spelling
>   mode.

SSML "say-as" element.

>
> Retrieving available parameter values:
>
> - It should be possible to return a list of supported languages and
>   voices (identified by parameters matching voice selection mechanism of
>   the chosen input text markup).  [Other parameters?]  [Shouldn't this
>   be optional?  I'm not sure all synthesizers are able to provide this
>   information.]

If there is more than one driver available, how else would higher-levels 
decide which to use?  I'm missing something here.

-- 
Gary Cramblitt (aka PhantomsDad)
KDE Text-to-Speech Maintainer
http://accessibility.kde.org/developer/kttsd/index.php