[fdo] Re: TTS API

Thu Nov 4 04:15:21 PST 2004

I expanded the initial Olaf's document and tried to summarize the
current state of our discussion about the requirements and related
issues in it.  Please tell me if anything is missing or unclear and
let's try to finish the requirements section by resolving the open
questions.

Regards,

Milan Zamazal

Common TTS Interface
====================

* Introduction

The purpose of this document is to define a common interface to speech
synthesizers.  The common interface is needed to avoid duplicate work
when implementing speech synthesizer drivers for different free software
higher level speech frameworks (like gnome-speech, KTTSD or Speech
Dispatcher) and to allow uniform access to all supported speech
synthesizers in the frameworks.

* Scope of the Document

The specification covers all the necessary text-to-speech functions
which are directly related to speech synthesizers.  This concerns
especially all language dependent text-to-speech functions.  Particular
areas covered by this specification include:

- Synthesizing given piece of text.

- Stopping the synthesis process.

- Managing audio output of hardware speech synthesizers.

- Setting basic speech synthesis parameters like pitch, rate or volume.

- Using different languages, voices and reading modes.

- Index marking.

- Configuration interface.

- Definition of a low level interface to be used to access the drivers.

On the other hand, the following areas are explicitly omitted:

- Message management (queueing, ordering, interleaving, etc.).  This is
  a job of the higher level speech frameworks, not of speech
  synthesizers.

- Interfacing with audio devices.  Again, this is a job of specialized
  software (higher level speech frameworks and sound servers), not of
  speech synthesizers.  But in case of hardware speech synthesizers
  using solely their own audio output it is necessary to manage their
  audio output too.

- Higher level interfaces accessing the drivers, like IPC or socket
  based interfaces.  Different projects are designed and use different
  forms of interprocess communication and with respect to current state
  of things it's unlikely to find a consensus about a common high level
  communication interface to drivers.  So those interfaces are left to
  be implemented separately by the projects that need them.  They are
  expected to be written as wrappers around the common low level access
  interface.

- Interaction with other components of the operating system.  The sole
  process of speech synthesis is unlikely to mess with other parts of
  the operating system in any unusual way.  This may not apply to higher
  level speech frameworks, but these are out of scope of this document.

* General Requirements on the TTS Interface

The synthesis process:

- Synthesis of a given piece of text expressed in a markup format
  [unresolved: Which one?  SSML or its reasonable subset?  Isn't SSML
  covered by patents preventing Free Software and Open Source programs
  to use it?  How about character encoding -- would it suffice to use
  UTF-8 everywhere?].

- Synthesis of characters and key names [possibly using custom SSML
  attribute values?].  Rationale: It's not possible to express them just
  in the form of an ordinary text without language specific knowledge.

- Short time before delivering first playable audio data after
  initiating new synthesis, even when it is needed to stop another
  synthesis request issued before.

- No significant performance hits (like long response time or wasting
  CPU time or memory) when many synthesizing requests come short after
  each other (cancelling the previous one).

- [Unresolved: Should the driver be able to receive the markup text to
  synthesize in several pieces?  The motivation is to ease processing of
  texts in KTTSD a bit.  I personally don't think it's a valid reason to
  complicate the interface, considering it has nothing to do with the
  speech synthesis process.  But maybe I still miss something.]

Software synthesis:

- [Still not clear consensus on how to return the synthesized audio
  data.  Maybe we could agree it should be written to a given binary
  stream?  But in which format?  What if the audio data is split in
  several pieces (see below)?  And how to return index marker positions
  (see below)?]

- [Should it be allowed to return the audio data in several separate
  pieces?  It complicates returning them, but what if the synthesizer
  splits a long input text and is unable to merge the resulting wave
  forms?  Should the driver be responsible for handling this?]

Hardware synthesis:

- Management of the audio output: immediate stopping, pausing and
  resuming.

Index markers:

- Support for identifying where or when given places (index markers) in
  the input text are reached.  [Not all synthesizers can support this --
  should index marking be optional?]

- Hardware synthesis must signal reaching index markers via callbacks,
  when the index marker is actually reached when playing the audio
  output.

- Software synthesis must identify the positions of index markers within
  the returned audio data.  [The question is how to do it.  If we are
  able to return the audio output in several pieces, then we can think
  about a linear sequence of audio samples and marker identifiers, where
  each marker is placed at its position between separate audio samples.
  Another possible way is to write times of reaching the markers in the
  produced audio data to a separate stream; this works with single audio
  output but it requires certain precautions to ensure the marker is not
  missed on the marker stream when playing data from the audio stream.]

- [The KTTSD approach to warnings and messages suggests it could be
  useful if some sort of index markers could be inserted into the input
  texts automatically, at breakable places.  I.e. places, where the
  audio output can be interrupted without breaking the speech at an
  unsuitable place (e.g. in the middle of a word or short sentence).
  This can be useful for pausing the speech or for speaking unrelated
  important messages when reading longer pieces of text.  What do you
  think?]

Setting speech parameters:

- It should be possible to set basic speech parameters like language,
  voice, rate, pitch and volume.  Rationale: The parameters can be set
  by the input text markup, but there should be a way to set the
  defaults.

- It should be possible to switch reading modes of the synthesizer,
  namely: punctuation mode, capital letter signalization mode, spelling
  mode.

Retrieving available parameter values:

- It should be possible to return a list of supported languages and
  voices (identified by parameters matching voice selection mechanism of
  the chosen input text markup).  [Other parameters?]  [Shouldn't this
  be optional?  I'm not sure all synthesizers are able to provide this
  information.]  Rationale: This allows the higher level speech
  frameworks and/or applications to make decisions about selecting a
  supported language (when more language alternatives are available on
  the input) or about selecting particular supported voice for a given
  piece of text without the danger the voice gets quietly mapped on the
  same voice as the surrounding text.

Configuration:

- Getting and setting configuration parameters of the synthesizer.
  [Should this be here or in a standard API for driver configuration
  libraries?]

- This facility is optional.

* Interface Definition

[This is to be defined later, after the general requirements are settled
down.  In the meantime, we can think about appropriate form of the low
level interface.  Do we agree it should have the form of a shared
library accompanied with corresponding C header files?]

[Definition of the interface functions.]

[How to access the drivers in the operating system environment.]

* Final Remarks

It might be useful to extend the specification process (in separate
documents) to higher level speech frameworks and audio output systems in
future.

* Copying

[Any idea about a good free license for the final document?]