[Accessibility] TTS API document + introduction
hanke at brailcom.org
Mon Mar 6 06:26:24 PST 2006
here is the latest version of the TTS API document with a new
introduction section trying to summarize the previous private and
public discussions on this topic. Comments are welcomed.
* Introduction was written (clarification of intent, scope)
* Clarification of the meaning of MUST HAVE, SHOULD HAVE
* Point (4.11) was removed as not directly important for accessibility
(after discussions with Willie Walker who requested the point)
* Point (4.13) was removed because its purpose is not clear.
Even if this functionality is needed, the 's' SSML element is
not a good way to do it.
* Reformulation of (1.4), added 'temporarily' to (3.2), 'software
synthesizers' in (4.4), terminology in (4.13),
clarification in (B.1.3/2) and (B.1.3/3)
Common TTS Driver Interface
Document version: 2006-03-06
The purpose of this document is to define a common low-level interface
to access the various speech synthesizers on Free Software and Open
Source platforms. It is designed to be used by applications that do
not need the advanced functionality like message management and by
applications providing high-level interfaces (such as Speech
Dispatcher, Gnome Speech, KTTSD etc.) The purpose of this document is
not to define and force an API on the speech synthesizers. The
synthesizers might use different interfaces that will be handled by
This interface will be implemented by a simple layer integrating
available speech synthesis drivers and in some cases emulating some of
the functionality missing in the synthesizers themselves.
Advanced capabilities not directly related to speech, like message
management, priorities, synchronization etc. are left out of scope for
this low-level interface. They will be dealt with by higher-level
interfaces. (It is desirable to be able to agree on a common
higher-level interface too, but agreeing first on a low-level
interface is an easier task to accomplish.) Such high-level interface
(not necessarily limited to speech) will make good use of the already
existing low-level interface.
It is desirable that simple applications can use this API in a simple
way. However, the API must also be complex enough so that it doesn't
limit more advanced applications in use of the synthesizers.
The first part (A) of this document describes the requirements
gathered between projects like Gnome Speech, Speech Dispatcher, KTTSD,
Emacspeak and SpeakUp of what they might reasonably expect from speech
synthesis on a system. These requirements are not meant to be the
requirements on the synthesizers, although they might be a guide to
synthesizer authors as they plan future features and capabilities for
their products. Parts (B) and (C) describe the XML/SSML markup in use
and part (D) defines the interface.
Temporary note: The goal of this interface is real implementation in
foreseeable future. The next step will be merging the available
engine drivers in the various accessibility projects under this
interface and using this interface. For this reason, we need all
accessibility projects who want to participate in this common effort
to make sure all their requirements on a low-level speech output
interface are met and that such an interface is defined that it is
suitable for their needs.
Temporary note: Any comments about this draft are welcome and
useful. But since the goal of these requirements is real
implementation, we need to avoid endless discussions and keep the
comments focused and to the point.
This section defines a set of requirements on the interface and on
speech synthesizer drivers that need to support assistive
technologies on free software platforms.
1. Design Criteria
The Common TTS Driver Interface requirements will be developed
within the following broad design criteria:
1.1. Focus on supporting assistive technologies first. These
assistive technologies can be written in any programming language
and may provide specific support for particular environments such
as KDE or GNOME.
1.2. Simple and specific requirements win out over complex and
1.3. Use existing APIs and specs when possible.
1.4 All language dependent functionality with respect to text
processing for speech synthesis should be covered in the
synthesizers or synthesis drivers, not in applications.
1.5. Requirements will be categorized in the following priority
order: MUST HAVE, SHOULD HAVE, and NICE TO HAVE.
The priorities have the following meanings with respect
to the drivers available under this API:
MUST HAVE: All drivers must satisfy this requirement.
SHOULD HAVE: The driver will be usable without this feature, but
it is expected the feature is implemented in all drivers
intended for serious use.
NICE TO HAVE: Optional features.
Regardless of the priority, full interface will be provided
by the API, even when the given functionality is actually not
implemented behind the interface.
1.6. Requirements outside the scope of this document will be
labelled as OUTSIDE SCOPE.
1.7. An application must be able to determine if SHOULD HAVE
and NICE TO HAVE features are supported for a given driver.
2. Synthesizer Discovery Requirements
2.1. MUST HAVE: An application will be able to discover all speech
synthesizer drivers available to the machine.
2.2. MUST HAVE: An application will be able to discover all possible
voices available for a particular speech synthesizer driver.
2.3. MUST HAVE: An application will be able to determine the
supported languages, possibly including also a dialect or a
country, for each voice available for a particular speech
Rationale: Knowledge about available voices and languages is
necessary to select proper driver and to be able to select a
supported language or different voices in an application.
2.4. MUST HAVE: Applications may assume their interaction with the
speech synthesizer driver doesn't affect other operating system
components in any unexpected way.
2.5. OUTSIDE SCOPE: Higher level communication interfaces
to the speech synthesizer drivers. Exact form of the
communication protocol (text protocol, IPC etc).
Note: It is expected they will be implemented by particular
projects (Gnome Speech, KTTSD, Speech Dispatcher) as wrappers
around the low-level communication interface defined below.
3. Synthesizer Configuration Requirements
3.1. MUST HAVE: An application will be able to specify the default
voice to use for a particular synthesizer, and will be able to
change the default voice in between `speak' requests.
3.2. SHOULD HAVE: An application will be able to specify the default
prosody and style elements for a voice. These elements will match
those defined in the SSML specification, and the synthesizer may
choose which attributes it wishes to support. Note that prosody,
voice and style elements specified in SSML sent as a `speak'
will temporarily override the default values.
3.3. SHOULD HAVE: An application should be able to provide the
synthesizer with an application-specific pronunciation lexicon
addenda. Note that using `phoneme' element in SSML is another way
to accomplish this on a very localized basis, and will override
any pronunciation lexicon data for the synthesizer.
Rationale: This feature is necessary so that the application is
able to speak artificial words or words with explicitly modified
pronunciation (e.g. "the word ... is often mispronounced as ...
by foreign speakers").
3.4. MUST HAVE: Applications may assume they have their own local
copy of a synthesizer and voice. That is, one application's
configuration of a synthesizer or voice should not conflict with
another application's configuration settings.
3.5. MUST HAVE: Changing the default voice or voice/prosody element
attributes does not affect a `speak' in progress.
4. Synthesis Process Requirements
4.1. MUST HAVE: The speech synthesizer driver is able to process
plain text (i.e. text that is not marked up via SSML) encoded in
the UTF-8 character encoding.
4.2. MUST HAVE: The speech synthesizer driver is able to process
text formatted using extended SSML markup defined in part B of
this document and encoded in UTF-8. The synthesizer may choose
to ignore markup it cannot handle or even to ignore all markup
as long as it is able to process the text inside the markup.
4.3. SHOULD HAVE: The speech synthesizer driver is able to properly
process the extended SSML markup defined in the part B. of this
document as SHOULD HAVE. Analogically for NICE TO HAVE.
4.4. MUST HAVE: An application must be able to cancel a synthesis
operation in progress. In case of hardware synthesizers, or
synthesizers that produce their own audio, this means cancelling
the audio output as well.
4.5. MUST HAVE: The speech synthesizer driver must be able to
process long input texts in such a way that the audio output
starts to be available for playing as soon as possible. An
application is not required to split long texts into smaller
4.6. SHOULD HAVE: The speech synthesizer driver should honor the
Performance Guidelines described below.
4.7. NICE TO HAVE: It would be nice if a synthesizer were able to
support "rewind" and "repeat" functionality for an utterance (see
related descriptions in the MRCP specification).
Rationale: This allows moving over long texts without the need to
synthesize the whole text and without loosing context.
4.8. NICE TO HAVE: It would be nice if a synthesizer were able to
support multilingual utterances.
4.9. SHOULD HAVE: A synthesizer should support notification of
`mark' elements, and the application should be able to align
these events with the synthesized audio.
4.10. NICE TO HAVE: It would be nice if a synthesizer supported
"word started" and "word ended" events and allowed alignment of
the events similar to that in 4.9.
Rationale: This is useful to update cursor position as a displayed
text is spoken.
4.11. REMOVED (not directly important for accessibility)
The former version: It would be nice if a synthesizer supported
timing information at the phoneme level and allowed alignment of
the events similar to that in 4.9. Rationale: This is useful
for talking heads.
4.12. SHOULD HAVE: The application must be able to pause and resume
a synthesis operation in progress while still being able to handle
other synthesis requests in the meantime. In case of hardware
synthesizers, this means pausing and if possible resuming the
audio output as well.
4.13. REMOVED (not clear purpose, the SSML specs do not require
the 's' element to work this way)
The synthesizer should not try to split the
contents of the `s' SSML element into several independent pieces,
unless required by a markup inside.
Rationale: An application may have better information about the
synthesized text and perform its own splitting of sentences.
4.14. OUTSIDE SCOPE: Message management (queueing, ordering,
4.15. OUTSIDE SCOPE: Interfacing software synthesis with audio
4.16. OUT OF SCOPE: Specifying the audio format to be used by a
5. Performance Guidelines
In order to make the speech synthesizer driver actually usable with
assistive technologies, it must satisfy certain performance
expectations. The following text provides a clue to the driver
implementors to get a rough idea about what is needed in practice.
Typical scenarios when working with a speech enabled text editor:
5.1. Typed characters are spoken (echoed).
Reading of the characters and cancelling the synthesis must be
very fast, to catch up with a fast typist or even with
autorepeat. Consider a typical autorepeat rate 25 characters per
second. Ideally within each of the 40 ms intervals synthesis
should begin, produce some audio output and stop. To perform
all these actions within 100 ms (considering a fast typist and
some overhead of the application and the audio output) on a
common hardware is very desirable.
Appropriate character reading performance may be difficult to
achieve with contemporary software speech synthesizers, so it may
be necessary to use techniques like caching of the synthesized
characters. Also, it is necessary to ensure there is no initial
pause ("breathing in") within the synthesized character.
5.2. Moving over words or lines, each of them is spoken.
The sound sample needn't be available as quickly as in case of the
typed characters, but it still should be available without clearly
noticeable delay. As the user moves over the words or lines, he
must hear the text immediately. Cancelling the synthesis of the
previous word or line must be instant.
5.3. Reading a large text file.
In such a case, it is not necessary to start speaking instantly,
because reading a large text is not a very frequent operation.
One second long delay at the start is acceptable, although not
comfortable. Cancelling the speech must still be instant.
B. XML (extended SSML) Markup in Use
This section defines the set of XML markup and special
attribute values for use in input texts for the drivers.
The markup consists of two namespaces: 'SSML' (default)
and 'tts', where 'tts' introduces several new attributes
to be used with the 'say-as' element and a new element
If an SSML element is supported, all its mandatory attributes
by the definition of SSML 1.0 must be supported even if they
are not explicitly mentioned in this document.
This section also defines which functions the API
needs to provide for default prosody, voice and style settings,
according to (3.2).
Note: According to available information, SSML is not known
to suffer from any IP issues.
B.1. SHOULD HAVE: The following elements are supported
B.1.1. These SPEAK attributes are supported
1 (SHOULD HAVE): xml:lang
B.1.1. These VOICE attributes are supported
1 (SHOULD HAVE): xml:lang
2 (SHOULD HAVE): name
3 (NICE TO HAVE): gender
4 (NICE TO HAVE): age
5 (NICE TO HAVE): variant
B.1.2. These PROSODY attributes are supported
1 (SHOULD HAVE): pitch (with +/- %, "default")
2 (SHOULD HAVE): rate (with +/- %, "default")
3 (SHOULD HAVE): volume (with +/- %, "default")
4 (NICE TO HAVE): range (with +/- %, "default")
5 (NICE TO HAVE): 'pitch', 'rate', 'range'
with absolute value parameters
Note: The corresponding global relative prosody settings
commands (not markup) in TTS API represent the percentage
value as a percentage change with respect to the default
value for the given voice and parameter, not with respect
to previous settings.
B.1.3. The SAY-AS attribute 'interpret-as'
is supported with the following values
1 (SHOULD HAVE) characters
The format 'glyphs' is supported.
Rationale: This provides capability for spelling.
2 (SHOULD HAVE) tts:char
Indicates the content of the element is a single
character and it should be pronounced as a character.
The element's contents (CDATA) should only contain
a single character.
This is different than the interpret-as value "characters"
described in B.1.3.1. While "characters" is intended
for spelling words and sentences, "tts:char" means
pronouncing the given character (which might be subject
to different settings, as for example using sound icons to
If more than one character is present as the contents
of the element, this is considered an error.
Rationale: It is useful to have a separate attribute
for "single characters" as this can be used in TTS
configuration to distinguish the situation when
the user is moving with cursor over characters
from the situation of spelling. As well as in other
situations where the concept of "single character"
has some logical meaning.
3 (SHOULD HAVE) tts:key
The content of the element should be interpreted
as the name of a keyboard key or combination of keys. See
section (C) for possible string values of content of this
element. If a string is given which is not defined in section
(C), the behavior of the synthesizer is undefined.
4 (NICE TO HAVE) tts:digits
Indicates the content of the element is a number.
The attribute "detail" is supported and can take a numerical
value, meaning how many digits should the synthesizer group
for reading. The value of 0 means the number should be
pronounced as a whole appropriate for the language, while any
non-zero value means that a groups of so many digits should be
formed for reading, starting from left.
Example: The string "5431721838" would normally be read
as "five billion four hundred thirty seven million ..." but
when enclosed in the above say-as with detail set to 3, it
would be read as "five hundred forty three, one hundred
seventy two etc." or "five, four, three, seven etc." with
Note: This is an extension to SSML not defined in the
format itself, introduced under the namespace 'tts' (as
allowed in SSML 'say-as' specifications).
B.2. NICE TO HAVE: The following elements are supported
B.2.1. NICE TO HAVE: These P attributes are supported:
B.2.2. NICE TO HAVE: These S attributes are supported
B.3. SHOULD HAVE: An element `tts:style' (not defined in SSML 1.0)
This element can occur anywhere inside the SSML document.
It may contain all SSML elements except the element 'speak'
and it may also contain the element 'tts:style'.
It has two mandatory attributes 'field'
and 'mode' and an optional string attribute 'detail'. The
attribute 'field' can take the following values
If the parameter field is set to 'punctuation',
the 'mode' attribute can take the following values
3) (NICE TO HAVE) some
When set to 'none', no punctuation characters are explicitly
indicated. When it is set to 'all', all punctuation characters
in the text should be indicated by the synthesizer. When
set to 'some', the synthesizer will pronounce those
punctuation characters enumerated in the additional attribute
'detail' or will only speak those characters according to its
settings if no 'detail' attribute is specified.
The attribute detail takes the form of a string containing
the punctuation characters to read.
<tts:style field="punctuation" mode="some" detail=".?!">
If the parameters field is set to 'capital_letters',
the 'mode' attribute can take the following values
3) (NICE TO HAVE) icon
4) (NICE TO HAVE) pitch
When set to 'no', capital letters are not explicitly
indicated. When set to 'spell', capital letters are
spelled (e.g. "capital a"). When set to 'icon', a sound
is inserted before the capital letter, possibly leaving
the letter/word/sentence intact. When set to 'pitch',
the capital letter is pronounced with a higher pitch,
possibly leaving the letter/word/sentence intact.
Rationale: These are basic capabilities well established
in accessibility. However, SSML does not support them.
Introducing this additional element does not break the
possibility of outside applications to send valid SSML
into TTS API.
B.4. NICE TO HAVE: Support for the rest of elements and attributes
defined in SSML 1.0. However, this is of lower priority than
the enumerated subset above.
Open Issue: In many situations, it will be desirable to
preserve whitespace characters in the incoming document.
Should we require the application to use the 'xml:space'
attribute for the speak element or should we state 'preserve'
is the default value for 'xml:space' in the root 'speak'
element in this case?
C. Key names
Key name may contain any character excluding control characters (the
characters in the range 0 to 31 in the ASCII table and other
``invisible'' characters), spaces, dashes and underscores.
C.1 The recognized key names are:
1) Any single UTF-8 character, excluding the exceptions defined
2) Any of the symbolic key names defined bellow.
3) A combination of key names defined bellow using the
'_' (underscore) character for concatenation.
Examples of valid key names:
C.2 List of symbolic key names
C.2.1 Escaped keys
C.2.2 Auxiliary Keys
C.2.3 Control Character Keys
f2 ... f24
kp-1 ... kp-9
D. Interface Description
This section defines the low-level TTS driver interface for use by
all assistive technologies on free software platforms.
1. Speech Synthesis Driver Discovery
2. Speech Synthesis Driver Interface
Open Issue: Still not clear consensus on how to return the
synthesized audio data (if at all). The main issue here is
mostly with how to align marker and other time-related events
with the audio being played on the audio output device.
Proposal: There will be 2 possible ways to do it. The synthesized
data can be returned to the application (case A) or the
application can ask for them being played on the audio (which
will not be the task of TTS API, but will be handled by
another API) (case B).
In (case A), each time the application gets a piece of audio
data, it also gets a time-table of index marks and events
in that piece of data. This will be done on a separate socket
in asynchronous mode. (This is possible for software
synthesizers only, however.)
In (case B), the application will get asynchronous callbacks
(they might be realized by sending a defined string over
a socket, by calling a callback function or in some other
way -- the particular way of doing it is considered an
Rationale: Both approaches are useful in different situations
and each of them provides some capability that the other one
Open Issue: Will the interaction with the driver be synchronous
or asynchronous? For example, will a call to `speak'
wait to return until all the audio has been processed? If
not, what happens when a call to "speak" is made while the
synthesizer is still processing a prior call to "speak?"
Proposal: With the exception of events and index marks signalling,
the communication will be synchronous. When a speak request
is issued while the is still processing a prior call to speak
and the application didn't call pause before, this is
considered an error.
E. Related Specifications
(see requirements at the following URL:
SSML 'say-as' element attribute values:
F. Copying This Document
Copyright (C) 2006 ...
This specification is made available under a BSD-style license ...
More information about the accessibility