[Accessibility] TTS API document updated

Gary Cramblitt garycramblitt at comcast.net
Wed Feb 22 10:59:35 PST 2006


On Sunday 19 February 2006 10:55, Hynek Hanke wrote:
> Hello everyone,
>
> I'm trying to move our work on TTS API forward as this is something
> really important for accessibility right now. I'm sending an updated
> version of the TTS API requirements document (bellow) with the following
> updates and I request your comments. Most important, I've worked
> on the SSML subset specifications and tried to resolve the open issues.

Thank you for your work on this Hynek.  I've made few comments and corrections 
below, mostly cosmetic.

> Common TTS Driver Interface
> ============================
> Document version: 2006-18-02
>
> The purpose of this document is to define a common way to access
> speech synthesizers on Free Software and Open Source platforms.  It
> contains a list of general requirements on the speech synthesizer
> interface drivers and the definition of a low-level interface that can
> be used to access the speech synthesizer drivers.
>
> A. Requirements
>
>   This section defines a set of requirements on speech synthesizer
>   drivers that need to support assistive technologies on free software
>   platforms.
>
>   1. Design Criteria
>
>     The Common TTS Driver Interface requirements will be developed
>     within the following broad design criteria:
>
>     1.1. Focus on supporting assistive technologies first.  These
>       assistive technologies can be written in any programming language
>       and may provide specific support for particular environments such
>       as KDE or GNOME.
>
>     1.2. Simple and specific requirements win out over complex and
>       general requirements.
>
>     1.3. Use existing APIs and specs when possible.
>
>     1.4. All language dependent functionality should be covered here,
> not
>       in applications.

Comment: I don't understand this.  Perhaps we could elaborate more?

>
>     1.5. Requirements will be categorized in the following priority
>       order: MUST HAVE, SHOULD HAVE, and NICE TO HAVE.
>
>       The priorities have the following meanings:
>
>       MUST HAVE: All conforming drivers must satisfy this
>         requirement.
>
>       SHOULD HAVE: The driver will be usable without this feature, but
>         it is expected the feature is implemented in all drivers
>         intended for serious use.
>
>       NICE TO HAVE: Optional features.
>
>       Regardless of the priority, full interface must be always
>       provided, even when the given functionality is actually not
>       implemented behind the interface.
>
>     1.6. Requirements outside the scope of this document will be
>       labelled as OUTSIDE SCOPE.
>
>     1.7. An application must be able to determine if SHOULD HAVE
>       and NICE TO HAVE features are supported.
>
>
>   2. Synthesizer Discovery Requirements
>
>     2.1. MUST HAVE: An application will be able to discover all speech
>       synthesizer drivers available to the machine.
>
>     2.2. MUST HAVE: An application will be able to discover all possible
>       voices available for a particular speech synthesizer driver.
>
>     2.3. MUST HAVE: An application will be able to determine the
>       supported languages, possibly including also a dialect or a
>       country, for each voice available for a particular speech
>       synthesizer driver.
>
>       Rationale: Knowledge about available voices and languages is
>       necessary to select proper driver and to be able to select a
>       supported language or different voices in an application.
>
>     2.4. MUST HAVE: Applications may assume their interaction with the
>       speech synthesizer driver doesn't affect other operating system
>       components in any unexpected way.
>
>     2.5. OUTSIDE SCOPE: Higher level communication interfaces
> 	to the speech synthesizer drivers. Exact form of the
>         communication protocol (text protocol, IPC etc).
>
>       Note: It is expected they will be implemented by particular
>       projects (Gnome Speech, KTTSD, Speech Dispatcher) as wrappers
>       around the low-level communication interface defined below.
>
>
>   3. Synthesizer Configuration Requirements
>
>     3.1. MUST HAVE: An application will be able to specify the default
>       voice to use for a particular synthesizer, and will be able to
>       change the default voice in between `speak' requests.
>
>     3.2. SHOULD HAVE: An application will be able to specify the default
>       prosody and style elements for a voice.  These elements will match
>       those defined in the SSML specification, and the synthesizer may
>       choose which attributes it wishes to support.  Note that prosody,
>       voice and style elements specified in SSML sent as a `speak'
>       request will override the default values.

Suggest: request will temporarily override the default values.

>
>     3.3. SHOULD HAVE: An application should be able to provide the
>       synthesizer with an application-specific pronunciation lexicon
>       addenda.  Note that using `phoneme' element in SSML is another way
>       to accomplish this on a very localized basis, and will override
>       any pronunciation lexicon data for the synthesizer.
>
>       Rationale: This feature is necessary so that the application is
>       able to speak artificial words or words with explicitly modified
>       pronunciation (e.g. "the word ... is often mispronounced as ...
>       by foreign speakers").
>
>     3.4. MUST HAVE: Applications may assume they have their own local
>       copy of a synthesizer and voice.  That is, one application's
>       configuration of a synthesizer or voice should not conflict with
>       another application's configuration settings.
>
>     3.5. MUST HAVE: Changing the default voice or voice/prosody element
>       attributes does not affect a `speak' in progress.
>
>   4. Synthesis Process Requirements
>
>     4.1. MUST HAVE: The speech synthesizer driver is able to process
>       plain text (i.e. text that is not marked up via SSML) encoded in
>       the UTF-8 character encoding.
>
>     4.2. MUST HAVE: The speech synthesizer driver is able to process
>       text formatted using extended SSML markup defined in part B of
>       this document and encoded in UTF-8.  The synthesizer may choose
>       to ignore markup it cannot handle or even to ignore all markup
>       as long as it  is able to process the text inside the markup.

Suggest:  Change to SHOULD HAVE and add

        4.2.1 MUST HAVE: The application is able to discover if the 
synthesizer supports SSML markup at all.

        Rationale: It is recognized that supporting SSML is a difficult to 
implement.  Synthesis authors should strive to support it because it is 
needed for fully functional accessibility support.  If a synthesizer cannot 
support SSML, it should at least be able to ignore the markup and speak the 
contained text, but if not, it must at least inform the application so the 
application can remove the markup before sending it to the synthesizer.  
Speaking the markup would be the worst of all possibilities.

Comment: My thinking here is that many current synths don't support SSML.  
Adding support for SSML would typically involve adding dependencies on XML 
parsers, which synth authors might not want to do.  If a synth author doesn't 
want to support SSML, I'd prefer they be able to at least meet the rest of 
this specification.  If SSML is MUST HAVE, a synth author might say to 
himself, "Well I can't meet this MUST HAVE requirement, so I'll ignore the 
entire specification."

>
>     4.3. SHOULD HAVE: The speech synthesizer driver is able to properly
>       process the extended SSML markup defined in the part B. of this
>       document as SHOULD HAVE. Analogically for NICE TO HAVE.
>
>     4.4. MUST HAVE: An application must be able to cancel a synthesis
>       operation in progress.  In case of hardware synthesizers, this
>       means cancelling the audio output as well.

Change to: In case of hardware synthesizers, or synthesizers that produce 
their own audio, this means..

>
>     4.5. MUST HAVE: The speech synthesizer driver must be able to
>       process long input texts in such a way that the audio output
>       starts to be available for playing as soon as possible.  An
>       application is not required to split long texts into smaller
>       pieces.
>
>     4.6. SHOULD HAVE: The speech synthesizer driver should honor the
>       Performance Guidelines described below.
>
>     4.7. NICE TO HAVE: It would be nice if a synthesizer were able to
>       support "rewind" and "repeat" functionality for an utterance (see
>       related descriptions in the MRCP specification).
>
>       Rationale: This allows moving over long texts without the need to
>       synthesize the whole text and without loosing context.
>
>     4.8. NICE TO HAVE: It would be nice if a synthesizer were able to
>       support multilingual utterances.
>
>     4.9. SHOULD HAVE: A synthesizer should support notification of
>       `mark' elements, and the application should be able to align
>       these events with the synthesized audio.
>
>     4.10. NICE TO HAVE: It would be nice if a synthesizer supported
>       "word started" and "word ended" events and allowed alignment of
>       the events similar to that in 4.9.
>
>       Rationale: This is useful to update cursor position as a displayed
>       text is spoken.
>
>     4.11. NICE TO HAVE: It would be nice if a synthesizer supported
>       timing information at the phoneme level and allowed alignment of
>       the events similar to that in 4.9.
>
>       Rationale: This is useful for talking heads.
>
>     4.12. SHOULD HAVE: The application must be able to pause and resume
>       a synthesis operation in progress while still being able to handle
>       other synthesis requests in the meantime.  In case of hardware
>       synthesizers, this means pausing and if possible resuming the
>       audio output as well.
>
>     4.13. SHOULD HAVE: The synthesizer should not try to split the
>       contents of the `s' SSML element into several independent pieces,
>       unless required by a markup inside.
>
>       Rationale: An application may have better information about the
>       synthesized text and perform its own utterance chunking.

Comment: Synthesis authors might balk at this, as utterance chunking is 
usually an integral part of synthesis.  There could be confusion over 
terminology here.  For example, in Festival, "chunking" is the process of 
analyzing a sentence for parts of speech and grouping the sentence into noun 
phrases, verb phrases, etc.  I'm not sure anymore why this is even here.

>
>     4.14. OUTSIDE SCOPE: Message management (queueing, ordering,
>       interleaving, etc.).
>
>     4.15. OUTSIDE SCOPE: Interfacing software synthesis with audio
>       output.
>
>     4.16. OUT OF SCOPE: Specifying the audio format to be used by a
> 	synthesizer.
>
>    5. Performance Guidelines
>
>      In order to make the speech synthesizer driver actually usable with
>      assistive technologies, it must satisfy certain performance
>      expectations.  The following text provides a clue to the driver
>      implementors to get a rough idea about what is needed in practice.
>
>      Typical scenarios when working with a speech enabled text editor:
>
>      5.1. Typed characters are spoken (echoed).
>
>        Reading of the characters and cancelling the synthesis must be
>        very fast, to catch up with a fast typist or even with
>        autorepeat.  Consider a typical autorepeat rate 25 characters per
>        second.  Ideally within each of the 40 ms intervals synthesis
>        should begin, produce some audio output and stop.  To perform all
>        these actions within 100 ms (considering a fast typist and some
>        overhead of the application and the audio output) on a common
>        hardware is very desirable.
>
>        Appropriate character reading performance may be difficult to
>        achieve with contemporary software speech synthesizers, so it may
>        be necessary to use techniques like caching of the synthesized
>        characters.  Also, it is necessary to ensure there is no initial
>        pause ("breathing in") within the synthesized character.
>
>     5.2. Moving over words or lines, each of them is spoken.
>
>       The sound sample needn't be available as quickly as in case of the
>       typed characters, but it still should be available without clearly
>       noticeable delay.  As the user moves over the words or lines, he
>       must hear the text immediately.  Cancelling the synthesis of the
>       previous word or line must be instant.
>
>     5.3. Reading a large text file.
>
>       In such a case, it is not necessary to start speaking instantly,
>       because reading a large text is not a very frequent operation.
>       One second long delay at the start is acceptable, although not
>       comfortable.  Cancelling the speech must still be instant.
>
>
> B. XML (extended SSML) Markup in Use
>
>   This section defines the set of XML markup and special
>   attribute values for use in input texts for the drivers.
>   The markup consists of two namespaces: 'SSML' (default)
>   and 'tts', where 'tts' introduces several new attributes
>   to be used with the 'say-as' element and a new element
>   'style'.
>
>   If an SSML element is supported, all its mandatory attributes
>   by the definition of SSML 1.0 must be supported even if they
>   are not explicitly mentioned in this document.
>
>   This section also defines which functions the API
>   needs to provide for default prosody, voice and style settings,
>   according to (3.2).
>
>   Note: According to available information, SSML is not known
>   to suffer of any IP issues.

Correction: Note: According to available information, SSML is not known to 
suffer from any IP issues.

>
>
>   B.1. SHOULD HAVE: The following elements are supported
> 	speak
> 	voice
> 	prosody
> 	say-as
>
>   B.1.1. These SPEAK attributes are supported
> 	1 (SHOULD HAVE): xml:lang
>
>   B.1.1. These VOICE attributes are supported
> 	1 (SHOULD HAVE):  xml:lang
> 	2 (SHOULD HAVE):  name
> 	3 (NICE TO HAVE): gender
> 	4 (NICE TO HAVE): age
> 	5 (NICE TO HAVE): variant
>
>   B.1.2. These PROSODY attributes are supported
> 	1 (SHOULD HAVE): pitch  (with +/- %, "default")
> 	2 (SHOULD HAVE): rate   (with +/- %, "default")
> 	3 (SHOULD HAVE): volume (with +/- %, "default")
> 	4 (NICE TO HAVE): range  (with +/- %, "default")
> 	5 (NICE TO HAVE): 'pitch', 'rate', 'range'
>  	 		with absolute value parameters
>
>    Note: The corresponding global relative prosody settings
>    commands (not markup) in TTS API represent the percentage
>    value as a percentage change with respect to the default
>    value for the given voice and parameter, not with respect
>    to previous settings.
>
>
>   B.1.3. The SAY-AS attribute 'interpret-as'
> 	is supported with the following values
>
> 	1 (SHOULD HAVE) characters
> 		The format 'glyphs' is supported.

Comment: glyphs??

>
> 	Rationale: This provides capability for spelling.
>
> 	2 (SHOULD HAVE) tts:char
> 		Indicates the content of the element is a single
> 	character and it should be pronounced as a character.
> 	The elements CDATA should only contain a single character.

Correction: The element's contents should contain only a single character.

>
> 	This is different than the interpret-as value "characters"
> 	described in B.1.3.1. While "characters" is intended
> 	for spelling words and sentences, "tts:char" means
> 	pronouncing the given character (which might be subject
> 	to different settings, as for example using sound icons to
> 	represent symbols).
>
> 	If more than one character is present as the contents
> 	of the element, this is considered an error.
>
> 	Example:
> 	<speak>
> 	<say-as interpret-as="tts:char">@</say-as>
> 	</speak>
>
> 	Rationale: It is useful to have a separate attribute
> 	for "single characters" as this can be used in TTS
> 	configuration to distinguish the situation when
> 	the user is moving with cursor over characters
>         from the situation of spelling. As well as in other
> 	situations where the concept of "single character"
> 	has some logical meaning.
>
> 	3 (SHOULD HAVE) tts:key
> 		The content of the element should be interpreted
> 	as the name of a key. See section (C) for possible string

Correction: as the name of a keyboard key or combination of keys.

> 	values of content of this element. If a string is given
> 	which is not defined in section (C), the behavior of the
> 	synthesizer is undefined.
>
> 	Example:
> 	<speak>
> 	<say-as interpret-as="tts:char">shift_a</say-as>
> 	</speak>
>
> 	4 (NICE TO HAVE) tts:digits
> 		Indicates the content of the element is a number.
> 	The attribute "detail" is supported and can take a numerical
> 	value, meaning how many digits should the synthesizer group
> 	for reading. The value of 0 means the number should be
> 	pronounced as a whole, while any non-zero value means that a

Correction: pronounced as a whole appropriate for the language, while ..

Suggest: I would use "grouping" rather than "detail".

> 	groups of so many digits should be formed for reading,
> 	starting from left.
>
> 	Example: The string "5431721838" would normally be read
> 	as "five billions four hundred thirty seven millions ..." but

Correction: The string "5431721828 would normally be read in English as "five 
billion four hundred thirty-seven million ..."

> 	when enclosed in the above say-as with detail set to 3, it
> 	would be read as "five hundred forty three, one hundred
> 	seventy two etc." or "five, four, three, seven etc." with
> 	detail 1.
>
> 	Note: This is an extension to SSML not defined in the
> 	format itself, introduced under the namespace 'tts' (as
> 	allowed	in SSML 'say-as' specifications).

Comment: Is the "detail" attribute really needed?  Couldn't I do the same 
thing using markup like this:

<tts:digits>543</tts:digits><tts:digits>172</tts:digits><tts:digits>182</tts:digits>

>
>
>   B.2. NICE TO HAVE: The following elements are supported
> 	mark
> 	s
> 	p
> 	phoneme
> 	sub
>
>   B.2.1. NICE TO HAVE: These P attributes are supported:
> 	1 xml:lang
>
>   B.2.2. NICE TO HAVE: These S attributes are supported
> 	1 xml:lang
>
>   B.3. SHOULD HAVE: An element `tts:style' (not defined in SSML 1.0)
> 	is supported. It has two mandatory attributes 'field'
> 	and 'mode' and an optional string attribute 'detail'. The
> 	attribute 'field' can take the following values
> 		1) punctuation
> 		2) capital_letters
> 	defined bellow.

Correction: below

>
> 	If the parameter field is set to 'punctuation',
> 	the 'mode' attribute can take the following values
> 		1) none
> 		2) all
> 		3) (NICE TO HAVE) some
> 	When set to 'none', no punctuation characters are explicitly
> 	indicated. When it is set to 'all', all punctuation characters
> 	in the text should be indicated by the synthesizer.  When
> 	set to 'some', the synthesizer will pronounce those
> 	punctuation characters enumerated in the additional attribute
>         'detail' or will only speak those characters according to its
> 	settings if no 'detail' attribute is specified.
>
> 	The attribute detail takes the form of a string containing
> 	the punctuations characters to read.
>
> 	Example:
> 	<tts:style field="punctuation" mode="some" detail=".?!">
>
> 	If the parameters field is set to 'capital_letters',
> 	the 'mode' attribute can take the following values
> 		1) no
> 		2) spelling
> 		3) (NICE TO HAVE) icon
> 		4) (NICE TO HAVE) pitch
>
> 	When set to 'no', capital letters are not explicitly
> 	indicated. When set to 'spell', capital letters are
> 	spelled (e.g. "capital a"). When set to 'icon', a sound
> 	is inserted before the capital letter, possibly leaving
> 	the letter/word/sentence intact. When set to 'pitch',
> 	the capital letter is pronounced with a higher pitch,
> 	possibly leaving the letter/word/sentence intact.
>
>
> 	Rationale: These are basic capabilities well established
> 	in accessibility. However, SSML does not support them.
> 	Introducing this additional element does not break the
> 	possibility of outside applications to send valid SSML
> 	into TTS API.

Comment: Need to specify where the <tts:style> element may occur within SSML 
and whether it contains content.  I think you intend for it to occur within a 
<s> or <p> elements and contain the content to be spoken in the indicated 
style.  For example,

<s>The abbreviation <tts:style field="capital_letters" 
mode="spell">TTS</tts:style> stands for text to speech.</s>

>
>   B.4. NICE TO HAVE: Support for the rest of elements and attributes
> 	defined in SSML 1.0. However, this is of lower priority than
> 	the enumerated subset above.
>
>   Open Issue: In many situations, it will be desirable to
>    preserve whitespace characters in the incoming document.
>    Should we require the application to use the 'xml:space'
>    attribute for the speak element or should we state 'preserve'
>    is the default value for 'xml:space' in the root 'speak'
>    element in this case?
>
> C. Key names
>
> Key name may contain any character excluding control characters (the
> characters in the range 0 to 31 in the ASCII table and other
> ``invisible'' characters), spaces, dashes and underscores.
>
>   C.1 The recognized key names are:
>    1) Any single UTF-8 character, excluding the exceptions defined
>       above.
>
>    2) Any of the symbolic key names defined bellow.
>
>    3) A combination of key names defined bellow using the
> 	'_' (underscore) character for concatenation.
>
>    Examples of valid key names:
> 	A
> 	shift_a
> 	shift_A
> 	$
> 	enter
> 	shift_kp-enter
> 	control
> 	control_alt_delete
>
>   C.2 List of symbolic key names
>
>   C.2.1 Escaped keys
> 	space
> 	underscore
> 	dash
>
>   C.2.2 Auxiliary Keys
> 	alt
> 	control
> 	hyper
> 	meta
> 	shift
> 	super
>
>   C.2.3 Control Character Keys
> 	backspace
> 	break
> 	delete
> 	down
> 	end
> 	enter
> 	escape
> 	f1
> 	f2 ... f24
> 	home
> 	insert
> 	kp-*
> 	kp-+
> 	kp--
> 	kp-.
> 	kp-/
> 	kp-0
> 	kp-1 ... kp-9
> 	kp-2
> 	kp-enter
> 	left
> 	menu
> 	next
> 	num-lock
> 	pause
> 	print
> 	prior
> 	return
> 	right
> 	scroll-lock
> 	space
> 	tab
> 	up
> 	window
>
> D. Interface Description
>
>   This section defines the low-level TTS driver interface for use by
>   all assistive technologies on free software platforms.
>
>   1. Speech Synthesis Driver Discovery
>
>   ...
>
>   2. Speech Synthesis Driver Interface
>
>   ...
>
>   Open Issue: Still not clear consensus on how to return the
> 	synthesized audio data (if at all).  The main issue here is
> 	mostly with how to align marker and other time-related events
> 	with the audio  being played on the audio output device.
>
>   Proposal: There will be 2 possible ways to do it. The synthesized
> 	data can be returned to the application (case A) or the
> 	application can ask for them being played on the audio (which
> 	will not be the task of TTS API, but will be handled by
> 	another API) (case B).
>
> 	In (case A), each time the application gets a piece of audio
> 	data, it also gets a time-table of index marks and events
> 	in that piece of data. This will be done on a separate socket
> 	in asynchronous mode. (This is possible for software
> 	synthesizers only, however.)
>
> 	In (case B), the application will get asynchronous callbacks
> 	(they might be realized by sending a defined string over
> 	a socket, by calling a callback function or in some other
> 	way -- the particular way of doing it is considered an
> 	implementation detail).
>
> 	Rationale: Both approaches are useful in different situations
> 	and each of them provides some capability that the other one
> 	doesn't.
>
>   Open Issue: Will the interaction with the driver be synchronous
> 	or asynchronous?  For example, will a call to `speak'
> 	wait to return until all the audio has been processed?  If
> 	not, what happens when a call to "speak" is made while the
>  	synthesizer is still processing a prior call to "speak?"
>
>   Proposal: With the exception of events and index marks signalling,
> 	the communication will be synchronous. When a speak request
> 	is issued while the is still processing a prior call to speak
> 	and the application didn't call pause before, this is
> 	considered an error.

Correction: With the exception of events and index marks signalling, and pause 
requests,  the communication will be synchronous.

Comment: I assume synchronous is desired because that places the least burden 
on the synthesis authors.  But I assume that virtually all processing will 
have to done asynchronously at some level.

>
> E. Related Specifications
>
>     SSML: http://www.w3.org/TR/2004/REC-speech-synthesis-20040907/
>           (see requirements at the following URL:
> http://www.w3.org/TR/2004/REC-speech-synthesis-20040907/#ref-reqs)
>
>     SSML 'say-as' element attribute values:
>  	  http://www.w3.org/TR/2005/NOTE-ssml-sayas-20050526/
>
>     MRCP: http://www.ietf.org/html.charters/speechsc-charter.html
>
> F. Copying This Document
>
>   Copyright (C) 2006 ...
>   This specification is made available under a BSD-style license ...

-- 
Gary Cramblitt (aka PhantomsDad)
KDE Text-to-Speech Maintainer
http://accessibility.kde.org/developer/kttsd/index.php


More information about the accessibility mailing list