[fdo] Re: TTS API
Milan Zamazal
pdm at freebsoft.org
Thu Nov 4 04:15:21 PST 2004
I expanded the initial Olaf's document and tried to summarize the
current state of our discussion about the requirements and related
issues in it. Please tell me if anything is missing or unclear and
let's try to finish the requirements section by resolving the open
questions.
Regards,
Milan Zamazal
Common TTS Interface
====================
* Introduction
The purpose of this document is to define a common interface to speech
synthesizers. The common interface is needed to avoid duplicate work
when implementing speech synthesizer drivers for different free software
higher level speech frameworks (like gnome-speech, KTTSD or Speech
Dispatcher) and to allow uniform access to all supported speech
synthesizers in the frameworks.
* Scope of the Document
The specification covers all the necessary text-to-speech functions
which are directly related to speech synthesizers. This concerns
especially all language dependent text-to-speech functions. Particular
areas covered by this specification include:
- Synthesizing given piece of text.
- Stopping the synthesis process.
- Managing audio output of hardware speech synthesizers.
- Setting basic speech synthesis parameters like pitch, rate or volume.
- Using different languages, voices and reading modes.
- Index marking.
- Configuration interface.
- Definition of a low level interface to be used to access the drivers.
On the other hand, the following areas are explicitly omitted:
- Message management (queueing, ordering, interleaving, etc.). This is
a job of the higher level speech frameworks, not of speech
synthesizers.
- Interfacing with audio devices. Again, this is a job of specialized
software (higher level speech frameworks and sound servers), not of
speech synthesizers. But in case of hardware speech synthesizers
using solely their own audio output it is necessary to manage their
audio output too.
- Higher level interfaces accessing the drivers, like IPC or socket
based interfaces. Different projects are designed and use different
forms of interprocess communication and with respect to current state
of things it's unlikely to find a consensus about a common high level
communication interface to drivers. So those interfaces are left to
be implemented separately by the projects that need them. They are
expected to be written as wrappers around the common low level access
interface.
- Interaction with other components of the operating system. The sole
process of speech synthesis is unlikely to mess with other parts of
the operating system in any unusual way. This may not apply to higher
level speech frameworks, but these are out of scope of this document.
* General Requirements on the TTS Interface
The synthesis process:
- Synthesis of a given piece of text expressed in a markup format
[unresolved: Which one? SSML or its reasonable subset? Isn't SSML
covered by patents preventing Free Software and Open Source programs
to use it? How about character encoding -- would it suffice to use
UTF-8 everywhere?].
- Synthesis of characters and key names [possibly using custom SSML
attribute values?]. Rationale: It's not possible to express them just
in the form of an ordinary text without language specific knowledge.
- Short time before delivering first playable audio data after
initiating new synthesis, even when it is needed to stop another
synthesis request issued before.
- No significant performance hits (like long response time or wasting
CPU time or memory) when many synthesizing requests come short after
each other (cancelling the previous one).
- [Unresolved: Should the driver be able to receive the markup text to
synthesize in several pieces? The motivation is to ease processing of
texts in KTTSD a bit. I personally don't think it's a valid reason to
complicate the interface, considering it has nothing to do with the
speech synthesis process. But maybe I still miss something.]
Software synthesis:
- [Still not clear consensus on how to return the synthesized audio
data. Maybe we could agree it should be written to a given binary
stream? But in which format? What if the audio data is split in
several pieces (see below)? And how to return index marker positions
(see below)?]
- [Should it be allowed to return the audio data in several separate
pieces? It complicates returning them, but what if the synthesizer
splits a long input text and is unable to merge the resulting wave
forms? Should the driver be responsible for handling this?]
Hardware synthesis:
- Management of the audio output: immediate stopping, pausing and
resuming.
Index markers:
- Support for identifying where or when given places (index markers) in
the input text are reached. [Not all synthesizers can support this --
should index marking be optional?]
- Hardware synthesis must signal reaching index markers via callbacks,
when the index marker is actually reached when playing the audio
output.
- Software synthesis must identify the positions of index markers within
the returned audio data. [The question is how to do it. If we are
able to return the audio output in several pieces, then we can think
about a linear sequence of audio samples and marker identifiers, where
each marker is placed at its position between separate audio samples.
Another possible way is to write times of reaching the markers in the
produced audio data to a separate stream; this works with single audio
output but it requires certain precautions to ensure the marker is not
missed on the marker stream when playing data from the audio stream.]
- [The KTTSD approach to warnings and messages suggests it could be
useful if some sort of index markers could be inserted into the input
texts automatically, at breakable places. I.e. places, where the
audio output can be interrupted without breaking the speech at an
unsuitable place (e.g. in the middle of a word or short sentence).
This can be useful for pausing the speech or for speaking unrelated
important messages when reading longer pieces of text. What do you
think?]
Setting speech parameters:
- It should be possible to set basic speech parameters like language,
voice, rate, pitch and volume. Rationale: The parameters can be set
by the input text markup, but there should be a way to set the
defaults.
- It should be possible to switch reading modes of the synthesizer,
namely: punctuation mode, capital letter signalization mode, spelling
mode.
Retrieving available parameter values:
- It should be possible to return a list of supported languages and
voices (identified by parameters matching voice selection mechanism of
the chosen input text markup). [Other parameters?] [Shouldn't this
be optional? I'm not sure all synthesizers are able to provide this
information.] Rationale: This allows the higher level speech
frameworks and/or applications to make decisions about selecting a
supported language (when more language alternatives are available on
the input) or about selecting particular supported voice for a given
piece of text without the danger the voice gets quietly mapped on the
same voice as the surrounding text.
Configuration:
- Getting and setting configuration parameters of the synthesizer.
[Should this be here or in a standard API for driver configuration
libraries?]
- This facility is optional.
* Interface Definition
[This is to be defined later, after the general requirements are settled
down. In the meantime, we can think about appropriate form of the low
level interface. Do we agree it should have the form of a shared
library accompanied with corresponding C header files?]
[Definition of the interface functions.]
[How to access the drivers in the operating system environment.]
* Final Remarks
It might be useful to extend the specification process (in separate
documents) to higher level speech frameworks and audio output systems in
future.
* Copying
[Any idea about a good free license for the final document?]
More information about the freedesktop
mailing list