[Accessibility] TTS API document updated

Hynek Hanke hanke at brailcom.org
Sun Feb 19 07:55:12 PST 2006



Hello everyone,

I'm trying to move our work on TTS API forward as this is something
really important for accessibility right now. I'm sending an updated
version of the TTS API requirements document (bellow) with the following
updates and I request your comments. Most important, I've worked
on the SSML subset specifications and tried to resolve the open issues.

* I moved open issue 1.7 (should the synthesizer be able to discover
if SHOULD HAVE and NICE TO HAVE capabilities are supported) to MUST HAVEs.

* Point 4.12 was edited (dealing with pause and resume) in a way that
it is possible to synthesize another text when one is paused. It is
essential for accessibility that the pause doesn't block the user in
doing other things in the meantime.

* Since there were no suggestions for open issue 4.16 (Should we specify
which audio formats to use?), I changed it into OUT OF SCOPE. I believe
thats a problem to solve between synthesizers and multimedia framework,
not in TTS API.

* I've defined the SSML subset to cover the functionality currently provided
by SSIP, Speakup, Gnome Speech and KTTSD (to my knowledge), trying to sort
elements and attributes into SHOULD HAVE and NICE TO HAVE. This is in part B.

The SAY-AS element specifications draft has been released by W3C in
the meantime. However, it still doesn't fully cover our needs as discussed
before. Namely, say-as can't contain other markup, so it is not suitable
for punctuation signalling modes. Also, it misses a mode for reading keys
and some other minor things.

For this purpose, I've investigated the possibility of extending SSML to our
needs and I found this can well be done by introducing a new namespace and
in this way adding some new attributes to 'say-as' and adding one new element
for the purpose of signalling punctuation and capital letters.

* There were two more important open issues left which I felt are more
interface definition related issues than requirements, so I moved them
to part D (interface description). It seems this group doesn't have strong
opions on these issues as nobody said anything for the past year or so. But
we need to decide them and this should be rather soon. So I proposed a solution
and please speak up if you see a problem in it for your project.


Happy to hear any comments,
Hynek Hanke

Common TTS Driver Interface
============================
Document version: 2006-18-02

The purpose of this document is to define a common way to access
speech synthesizers on Free Software and Open Source platforms.  It
contains a list of general requirements on the speech synthesizer
interface drivers and the definition of a low-level interface that can
be used to access the speech synthesizer drivers.

A. Requirements

  This section defines a set of requirements on speech synthesizer
  drivers that need to support assistive technologies on free software
  platforms.

  1. Design Criteria

    The Common TTS Driver Interface requirements will be developed
    within the following broad design criteria:

    1.1. Focus on supporting assistive technologies first.  These
      assistive technologies can be written in any programming language
      and may provide specific support for particular environments such
      as KDE or GNOME.

    1.2. Simple and specific requirements win out over complex and
      general requirements.

    1.3. Use existing APIs and specs when possible.

    1.4. All language dependent functionality should be covered here,
not
      in applications.

    1.5. Requirements will be categorized in the following priority
      order: MUST HAVE, SHOULD HAVE, and NICE TO HAVE.

      The priorities have the following meanings:
          
      MUST HAVE: All conforming drivers must satisfy this
        requirement.

      SHOULD HAVE: The driver will be usable without this feature, but
        it is expected the feature is implemented in all drivers
        intended for serious use.

      NICE TO HAVE: Optional features.

      Regardless of the priority, full interface must be always
      provided, even when the given functionality is actually not
      implemented behind the interface.

    1.6. Requirements outside the scope of this document will be
      labelled as OUTSIDE SCOPE.

    1.7. An application must be able to determine if SHOULD HAVE
      and NICE TO HAVE features are supported.


  2. Synthesizer Discovery Requirements

    2.1. MUST HAVE: An application will be able to discover all speech
      synthesizer drivers available to the machine.

    2.2. MUST HAVE: An application will be able to discover all possible
      voices available for a particular speech synthesizer driver.

    2.3. MUST HAVE: An application will be able to determine the
      supported languages, possibly including also a dialect or a
      country, for each voice available for a particular speech
      synthesizer driver.

      Rationale: Knowledge about available voices and languages is
      necessary to select proper driver and to be able to select a
      supported language or different voices in an application.

    2.4. MUST HAVE: Applications may assume their interaction with the
      speech synthesizer driver doesn't affect other operating system
      components in any unexpected way.

    2.5. OUTSIDE SCOPE: Higher level communication interfaces 
	to the speech synthesizer drivers. Exact form of the
        communication protocol (text protocol, IPC etc).

      Note: It is expected they will be implemented by particular
      projects (Gnome Speech, KTTSD, Speech Dispatcher) as wrappers
      around the low-level communication interface defined below.


  3. Synthesizer Configuration Requirements

    3.1. MUST HAVE: An application will be able to specify the default
      voice to use for a particular synthesizer, and will be able to
      change the default voice in between `speak' requests.

    3.2. SHOULD HAVE: An application will be able to specify the default
      prosody and style elements for a voice.  These elements will match
      those defined in the SSML specification, and the synthesizer may
      choose which attributes it wishes to support.  Note that prosody,
      voice and style elements specified in SSML sent as a `speak'
      request will override the default values.

    3.3. SHOULD HAVE: An application should be able to provide the
      synthesizer with an application-specific pronunciation lexicon
      addenda.  Note that using `phoneme' element in SSML is another way
      to accomplish this on a very localized basis, and will override
      any pronunciation lexicon data for the synthesizer.

      Rationale: This feature is necessary so that the application is
      able to speak artificial words or words with explicitly modified
      pronunciation (e.g. "the word ... is often mispronounced as ...
      by foreign speakers").

    3.4. MUST HAVE: Applications may assume they have their own local
      copy of a synthesizer and voice.  That is, one application's
      configuration of a synthesizer or voice should not conflict with
      another application's configuration settings.

    3.5. MUST HAVE: Changing the default voice or voice/prosody element
      attributes does not affect a `speak' in progress.
          
  4. Synthesis Process Requirements

    4.1. MUST HAVE: The speech synthesizer driver is able to process
      plain text (i.e. text that is not marked up via SSML) encoded in
      the UTF-8 character encoding.

    4.2. MUST HAVE: The speech synthesizer driver is able to process
      text formatted using extended SSML markup defined in part B of
      this document and encoded in UTF-8.  The synthesizer may choose
      to ignore markup it cannot handle or even to ignore all markup
      as long as it  is able to process the text inside the markup.

    4.3. SHOULD HAVE: The speech synthesizer driver is able to properly
      process the extended SSML markup defined in the part B. of this
      document as SHOULD HAVE. Analogically for NICE TO HAVE.

    4.4. MUST HAVE: An application must be able to cancel a synthesis
      operation in progress.  In case of hardware synthesizers, this
      means cancelling the audio output as well.

    4.5. MUST HAVE: The speech synthesizer driver must be able to
      process long input texts in such a way that the audio output
      starts to be available for playing as soon as possible.  An
      application is not required to split long texts into smaller
      pieces.

    4.6. SHOULD HAVE: The speech synthesizer driver should honor the
      Performance Guidelines described below.

    4.7. NICE TO HAVE: It would be nice if a synthesizer were able to
      support "rewind" and "repeat" functionality for an utterance (see
      related descriptions in the MRCP specification).

      Rationale: This allows moving over long texts without the need to
      synthesize the whole text and without loosing context.

    4.8. NICE TO HAVE: It would be nice if a synthesizer were able to
      support multilingual utterances.

    4.9. SHOULD HAVE: A synthesizer should support notification of
      `mark' elements, and the application should be able to align
      these events with the synthesized audio.

    4.10. NICE TO HAVE: It would be nice if a synthesizer supported
      "word started" and "word ended" events and allowed alignment of
      the events similar to that in 4.9.

      Rationale: This is useful to update cursor position as a displayed
      text is spoken.

    4.11. NICE TO HAVE: It would be nice if a synthesizer supported
      timing information at the phoneme level and allowed alignment of
      the events similar to that in 4.9. 

      Rationale: This is useful for talking heads.

    4.12. SHOULD HAVE: The application must be able to pause and resume
      a synthesis operation in progress while still being able to handle
      other synthesis requests in the meantime.  In case of hardware
      synthesizers, this means pausing and if possible resuming the
      audio output as well.

    4.13. SHOULD HAVE: The synthesizer should not try to split the
      contents of the `s' SSML element into several independent pieces,
      unless required by a markup inside.

      Rationale: An application may have better information about the
      synthesized text and perform its own utterance chunking.

    4.14. OUTSIDE SCOPE: Message management (queueing, ordering,
      interleaving, etc.).

    4.15. OUTSIDE SCOPE: Interfacing software synthesis with audio
      output.

    4.16. OUT OF SCOPE: Specifying the audio format to be used by a
	synthesizer.

   5. Performance Guidelines

     In order to make the speech synthesizer driver actually usable with
     assistive technologies, it must satisfy certain performance
     expectations.  The following text provides a clue to the driver
     implementors to get a rough idea about what is needed in practice.

     Typical scenarios when working with a speech enabled text editor:

     5.1. Typed characters are spoken (echoed).
     
       Reading of the characters and cancelling the synthesis must be
       very fast, to catch up with a fast typist or even with
       autorepeat.  Consider a typical autorepeat rate 25 characters per
       second.  Ideally within each of the 40 ms intervals synthesis
       should begin, produce some audio output and stop.  To perform all
       these actions within 100 ms (considering a fast typist and some
       overhead of the application and the audio output) on a common
       hardware is very desirable.

       Appropriate character reading performance may be difficult to
       achieve with contemporary software speech synthesizers, so it may
       be necessary to use techniques like caching of the synthesized
       characters.  Also, it is necessary to ensure there is no initial
       pause ("breathing in") within the synthesized character.

    5.2. Moving over words or lines, each of them is spoken.

      The sound sample needn't be available as quickly as in case of the
      typed characters, but it still should be available without clearly
      noticeable delay.  As the user moves over the words or lines, he
      must hear the text immediately.  Cancelling the synthesis of the
      previous word or line must be instant.

    5.3. Reading a large text file.

      In such a case, it is not necessary to start speaking instantly,
      because reading a large text is not a very frequent operation.
      One second long delay at the start is acceptable, although not
      comfortable.  Cancelling the speech must still be instant.


B. XML (extended SSML) Markup in Use

  This section defines the set of XML markup and special
  attribute values for use in input texts for the drivers.
  The markup consists of two namespaces: 'SSML' (default)
  and 'tts', where 'tts' introduces several new attributes
  to be used with the 'say-as' element and a new element
  'style'.

  If an SSML element is supported, all its mandatory attributes
  by the definition of SSML 1.0 must be supported even if they
  are not explicitly mentioned in this document.

  This section also defines which functions the API
  needs to provide for default prosody, voice and style settings,
  according to (3.2).

  Note: According to available information, SSML is not known
  to suffer of any IP issues.


  B.1. SHOULD HAVE: The following elements are supported
	speak
	voice
	prosody
	say-as

  B.1.1. These SPEAK attributes are supported
	1 (SHOULD HAVE): xml:lang

  B.1.1. These VOICE attributes are supported
	1 (SHOULD HAVE):  xml:lang
	2 (SHOULD HAVE):  name
	3 (NICE TO HAVE): gender
	4 (NICE TO HAVE): age
	5 (NICE TO HAVE): variant

  B.1.2. These PROSODY attributes are supported
	1 (SHOULD HAVE): pitch  (with +/- %, "default")
	2 (SHOULD HAVE): rate   (with +/- %, "default")
	3 (SHOULD HAVE): volume (with +/- %, "default")
	4 (NICE TO HAVE): range  (with +/- %, "default")
	5 (NICE TO HAVE): 'pitch', 'rate', 'range'
 	 		with absolute value parameters
		
   Note: The corresponding global relative prosody settings
   commands (not markup) in TTS API represent the percentage
   value as a percentage change with respect to the default
   value for the given voice and parameter, not with respect
   to previous settings.


  B.1.3. The SAY-AS attribute 'interpret-as'
	is supported with the following values

	1 (SHOULD HAVE) characters
		The format 'glyphs' is supported.

	Rationale: This provides capability for spelling.

	2 (SHOULD HAVE) tts:char
		Indicates the content of the element is a single
	character and it should be pronounced as a character.
	The elements CDATA should only contain a single character.

	This is different than the interpret-as value "characters"
	described in B.1.3.1. While "characters" is intended
	for spelling words and sentences, "tts:char" means
	pronouncing the given character (which might be subject
	to different settings, as for example using sound icons to
	represent symbols).	

	If more than one character is present as the contents
	of the element, this is considered an error.

	Example:
	<speak>
	<say-as interpret-as="tts:char">@</say-as>
	</speak>		

	Rationale: It is useful to have a separate attribute
	for "single characters" as this can be used in TTS
	configuration to distinguish the situation when
	the user is moving with cursor over characters
        from the situation of spelling. As well as in other
	situations where the concept of "single character"
	has some logical meaning.
		
	3 (SHOULD HAVE) tts:key
		The content of the element should be interpreted
	as the name of a key. See section (C) for possible string
	values of content of this element. If a string is given
	which is not defined in section (C), the behavior of the
	synthesizer is undefined.

	Example:
	<speak>
	<say-as interpret-as="tts:char">shift_a</say-as>
	</speak>

	4 (NICE TO HAVE) tts:digits
		Indicates the content of the element is a number.
	The attribute "detail" is supported and can take a numerical
	value, meaning how many digits should the synthesizer group
	for reading. The value of 0 means the number should be
	pronounced as a whole, while any non-zero value means that a
	groups of so many digits should be formed for reading,
	starting from left.

	Example: The string "5431721838" would normally be read
	as "five billions four hundred thirty seven millions ..." but
	when enclosed in the above say-as with detail set to 3, it
	would be read as "five hundred forty three, one hundred
	seventy two etc." or "five, four, three, seven etc." with
	detail 1.

	Note: This is an extension to SSML not defined in the
	format itself, introduced under the namespace 'tts' (as
	allowed	in SSML 'say-as' specifications).


  B.2. NICE TO HAVE: The following elements are supported
	mark
	s
	p
	phoneme
	sub

  B.2.1. NICE TO HAVE: These P attributes are supported:
	1 xml:lang

  B.2.2. NICE TO HAVE: These S attributes are supported 
	1 xml:lang

  B.3. SHOULD HAVE: An element `tts:style' (not defined in SSML 1.0)
	is supported. It has two mandatory attributes 'field'
	and 'mode' and an optional string attribute 'detail'. The
	attribute 'field' can take the following values
		1) punctuation
		2) capital_letters
	defined bellow.

	If the parameter field is set to 'punctuation',
	the 'mode' attribute can take the following values
		1) none
		2) all
		3) (NICE TO HAVE) some
	When set to 'none', no punctuation characters are explicitly
	indicated. When it is set to 'all', all punctuation characters
	in the text should be indicated by the synthesizer.  When
	set to 'some', the synthesizer will pronounce those
	punctuation characters enumerated in the additional attribute
        'detail' or will only speak those characters according to its
	settings if no 'detail' attribute is specified.

	The attribute detail takes the form of a string containing
	the punctuations characters to read.

	Example:
	<tts:style field="punctuation" mode="some" detail=".?!">

	If the parameters field is set to 'capital_letters',
	the 'mode' attribute can take the following values
		1) no
		2) spelling
		3) (NICE TO HAVE) icon
		4) (NICE TO HAVE) pitch

	When set to 'no', capital letters are not explicitly
	indicated. When set to 'spell', capital letters are
	spelled (e.g. "capital a"). When set to 'icon', a sound
	is inserted before the capital letter, possibly leaving
	the letter/word/sentence intact. When set to 'pitch',
	the capital letter is pronounced with a higher pitch,
	possibly leaving the letter/word/sentence intact.


	Rationale: These are basic capabilities well established
	in accessibility. However, SSML does not support them. 
	Introducing this additional element does not break the
	possibility of outside applications to send valid SSML
	into TTS API.

  B.4. NICE TO HAVE: Support for the rest of elements and attributes
	defined in SSML 1.0. However, this is of lower priority than
	the enumerated subset above.

  Open Issue: In many situations, it will be desirable to
   preserve whitespace characters in the incoming document.
   Should we require the application to use the 'xml:space'
   attribute for the speak element or should we state 'preserve'
   is the default value for 'xml:space' in the root 'speak'
   element in this case?

C. Key names

Key name may contain any character excluding control characters (the
characters in the range 0 to 31 in the ASCII table and other
``invisible'' characters), spaces, dashes and underscores.

  C.1 The recognized key names are:
   1) Any single UTF-8 character, excluding the exceptions defined
      above.

   2) Any of the symbolic key names defined bellow.

   3) A combination of key names defined bellow using the
	'_' (underscore) character for concatenation.

   Examples of valid key names:
	A
	shift_a
	shift_A
	$
	enter
	shift_kp-enter
	control
	control_alt_delete
  
  C.2 List of symbolic key names

  C.2.1 Escaped keys
	space
	underscore
	dash

  C.2.2 Auxiliary Keys
	alt
	control
	hyper
	meta
	shift
	super

  C.2.3 Control Character Keys
	backspace
	break
	delete
	down
	end
	enter
	escape
	f1
	f2 ... f24
	home
	insert
	kp-*
	kp-+
	kp--
	kp-.
	kp-/
	kp-0 
	kp-1 ... kp-9
	kp-2
	kp-enter
	left
	menu
	next
	num-lock
	pause
	print
	prior
	return
	right
	scroll-lock
	space
	tab
	up
	window

D. Interface Description

  This section defines the low-level TTS driver interface for use by
  all assistive technologies on free software platforms.

  1. Speech Synthesis Driver Discovery
   
  ...

  2. Speech Synthesis Driver Interface

  ...

  Open Issue: Still not clear consensus on how to return the
	synthesized audio data (if at all).  The main issue here is
	mostly with how to align marker and other time-related events
	with the audio  being played on the audio output device.

  Proposal: There will be 2 possible ways to do it. The synthesized
	data can be returned to the application (case A) or the
	application can ask for them being played on the audio (which
	will not be the task of TTS API, but will be handled by
	another API) (case B).

	In (case A), each time the application gets a piece of audio
	data, it also gets a time-table of index marks and events
	in that piece of data. This will be done on a separate socket
	in asynchronous mode. (This is possible for software
	synthesizers only, however.)

	In (case B), the application will get asynchronous callbacks
	(they might be realized by sending a defined string over
	a socket, by calling a callback function or in some other
	way -- the particular way of doing it is considered an
	implementation detail).

	Rationale: Both approaches are useful in different situations
	and each of them provides some capability that the other one
	doesn't.

  Open Issue: Will the interaction with the driver be synchronous
	or asynchronous?  For example, will a call to `speak'
	wait to return until all the audio has been processed?  If
	not, what happens when a call to "speak" is made while the
 	synthesizer is still processing a prior call to "speak?"

  Proposal: With the exception of events and index marks signalling,
	the communication will be synchronous. When a speak request
	is issued while the is still processing a prior call to speak
	and the application didn't call pause before, this is
	considered an error.

E. Related Specifications

    SSML: http://www.w3.org/TR/2004/REC-speech-synthesis-20040907/
          (see requirements at the following URL:
http://www.w3.org/TR/2004/REC-speech-synthesis-20040907/#ref-reqs)
	
    SSML 'say-as' element attribute values:
 	  http://www.w3.org/TR/2005/NOTE-ssml-sayas-20050526/

    MRCP: http://www.ietf.org/html.charters/speechsc-charter.html

F. Copying This Document

  Copyright (C) 2006 ...
  This specification is made available under a BSD-style license ...



More information about the accessibility mailing list