[Accessibility] Updated requirements document

Milan Zamazal pdm at brailcom.org
Mon Nov 15 14:42:50 PST 2004


Here is an updated version of the requirements document.  It is based on
a rewrite of the document by Willie Walker (we also had discussion about
the requirements in private, resulting in some changes and additions to
them).  It should reflect the current state of the whole discussion, so
if something is wrong or missing, tell me.

For now, I put everything into a single document, it's simpler for me.

If there are other parties who might be interested in the TTS interface,
someone should inform them about the mailing list.  Please note I'm not
going to do it!

Common TTS Driver Interface
============================
Document version: 2004-11-15

The purpose of this document is to define a common way to access speech
synthesizers on Free Software and Open Source platforms.  It contains a
list of general requirements on the speech synthesizer interface drivers
implementing this specification and the definition of a low-level
interface that can be used to access the speech synthesizer drivers.

A. Requirements

  This section defines a set of requirements on speech synthesizer
  drivers that need to support assistive technologies on free software
  platforms.

  1. Design Criteria

    The Common TTS Driver Interface requirements will be developed
    within the following broad design criteria:

    1.1. Focus on supporting assistive technologies first.  These
      assistive technologies can be written in any programming language
      and may provide specific support for particular environments such
      as KDE or GNOME.

    1.2. Simple and specific requirements win out over complex and
      general requirements.

    1.3. Use existing APIs and specs when possible.

    1.4. All language dependent functionality should be covered here, not
      in applications.

    1.5. Requirements will be categorized in the following priority
      order: MUST HAVE, SHOULD HAVE, and NICE TO HAVE.

      The priorities have the following meanings:
          
      MUST HAVE: All conforming drivers must satisfy this
        requirement.

      SHOULD HAVE: The driver will be usable without this feature, but
        it is expected the feature is implemented in all drivers
        intended for serious use.

      NICE TO HAVE: Optional features.

      Regardless of the priority, full interface must be always
      provided, even when the given functionality is actually not
      implemented behind the interface.

    1.6. Requirements outside the scope of this document will be
      labelled as OUTSIDE SCOPE.

    OPEN ISSUE:

    - Should an application be able to determine if SHOULD HAVE and NICE
      TO HAVE features are supported or not?


  2. Synthesizer Discovery Requirements

    2.1. MUST HAVE: An application will be able to discover all speech
      synthesizer drivers available to the machine.

    2.2. MUST HAVE: An application will be able to discover all possible
      voices available for a particular speech synthesizer driver.

    2.3. MUST HAVE: An application will be able to determine the
      supported languages, possibly including also a dialect or a
      country, for each voice available for a particular speech
      synthesizer driver.

      Rationale: Knowledge about available voices and languages is
      necessary to select proper driver and to be able to select a
      supported language or different voices in an application.

    2.4. MUST HAVE: Applications may assume their interaction with the
      speech synthesizer driver doesn't affect other operating system
      components in any unexpected way.

    2.5. OUTSIDE SCOPE: Higher level communication interfaces (like IPC
      services or text protocols) to the speech synthesizer drivers.

      Note: It is expected they will be implemented by particular
      projects (gnome-speech, KTTSD, Speech Dispatcher) as wrappers
      around the low-level communication interface defined below.


  3. Synthesizer Configuration Requirements

    3.1. MUST HAVE: An application will be able to specify the default
      voice to use for a particular synthesizer, and will be able to
      change the default voice in between `speak' requests.

    3.2. SHOULD HAVE: An application will be able to specify the default
      prosody and style elements for a voice.  These elements will match
      those defined in the SSML specification, and the synthesizer may
      choose which attributes it wishes to support.  Note that prosody
      and style elements specified in SSML sent as a `speak' request
      will override the default values.

    3.3. SHOULD HAVE: An application should be able to provide the
      synthesizer with an application-specific pronunciation lexicon
      addenda.  Note that using `phoneme' element in SSML is another way
      to accomplish this on a very localized basis, and will override
      any pronunciation lexicon data for the synthesizer.

      Rationale: This feature is necessary so that the application is
      able to speak artificial words or words with explicitly modified
      pronunciation (e.g. "the word ... is often mispronounced as ...
      by foreign speakers").

    3.4. MUST HAVE: Applications may assume they have their own local
      copy of a synthesizer and voice.  That is, one application's
      configuration of a synthesizer or voice should not conflict with
      another application's configuration settings.

    3.5. MUST HAVE: Changing the default voice or voice/prosody element
      attributes does not effect a `speak' in progress.

          
  4. Synthesis Process Requirements

    4.1. MUST HAVE: The speech synthesizer driver is able to process
      plain text (i.e. text that is not marked up via SSML) encoded in
      the UTF-8 character encoding.

    4.2. MUST HAVE: The speech synthesizer driver is able to process
      text formatted using SSML and encoded in UTF-8.  The synthesizer
      may choose to ignore markup it cannot handle or even to ignore all
      markup as long as it is able to process the text inside the
      markup.

    4.3. SHOULD HAVE: The speech synthesizer driver is able to properly
      process the SSML markup subset defined in the part B. of this
      document.

    4.4. MUST HAVE: An application must be able to cancel a synthesis
      operation in progress.  In case of hardware synthesizers, this
      means cancelling the audio output as well.

    4.5. MUST HAVE: The speech synthesizer driver must be able to
      process long input texts in such a way that the audio output
      starts to be available for playing as soon as possible.  An
      application is not required to split long texts into smaller
      pieces.

    4.6. SHOULD HAVE: The speech synthesizer driver should honor the
      Performance Guidelines described below.

    4.7. NICE TO HAVE: It would be nice if a synthesizer were able to
      support "rewind" and "repeat" functionality for an utterance (see
      related descriptions in the MRCP specification).

      Rationale: This allows moving over long texts without the need to
      synthesize the whole text and without loosing context.

    4.8. NICE TO HAVE: It would be nice if a synthesizer were able to
      support multilingual utterances.

    4.9. SHOULD HAVE: A synthesizer should support notification of
      `mark' elements, and the application should be able to align these
      events with the synthesized audio.

    4.10. NICE TO HAVE: It would be nice if a synthesizer supported
      "word started" and "word ended" events and allowed alignment of
      the events similar to that in 4.8.

      Rationale: This is useful to update cursor position as a displayed
      text is spoken.

    4.11. NICE TO HAVE: It would be nice if a synthesizer supported
      timing information at the phoneme level and allowed alignment of
      the events similar to that in 4.8.

      Rationale: This is useful for talking heads.

    4.12. SHOULD HAVE: The application must be able to pause and resume
      a synthesis operation in progress.  In case of hardware
      synthesizers, this means pausing and resuming the audio output
      as well.

    4.13. SHOULD HAVE: The synthesizer should not try to split the
      contents of the `s' SSML element into several independent pieces,
      unless required by a markup inside.

      Rationale: An application may have better information about the
      synthesized text and perform its own utterance chunking.

    4.14. OUTSIDE SCOPE: Message management (queueing, ordering,
      interleaving, etc.).

    4.15. OUTSIDE SCOPE: Interfacing software synthesis with audio
      output.

    OPEN ISSUES:

    - Still not clear consensus on how to return the synthesized audio
      data (if at all).  The main issue here is mostly with how to align
      marker and other time-related events with the audio being played
      on the audio output device.

    - Not clear on how to (or if we even should) specify the audio
      format to be used by a synthesizer.

    - Implementation issue: Will the interaction with the driver be
      synchronous or asynchronous?  For example, will a call to `speak'
      wait to return until all the audio has been processed?  If not,
      what happens when a call to "speak" is made while the synthesizer
      is still processing a prior call to "speak?"

          
   5. Performance Guidelines

     In order to make the speech synthesizer driver actually usable with
     assistive technologies, it must satisfy certain performance
     expectations.  The following text provides a clue to the driver
     implementors to get a rough idea about what is needed in practice.

     Typical scenarios when working with a speech enabled text editor:

     5.1. Typed characters are spoken (echoed).
     
       Reading of the characters and cancelling the synthesis must be
       very fast, to catch up with a fast typist or even with
       autorepeat.  Consider a typical autorepeat rate 25 characters per
       second.  Ideally within each of the 40 ms intervals synthesis
       should begin, produce some audio output and stop.  To perform all
       these actions within 100 ms (considering a fast typist and some
       overhead of the application and the audio output) on a common
       hardware is very desirable.

       Appropriate character reading performance may be difficult to
       achieve with contemporary software speech synthesizers, so it may
       be necessary to use techniques like caching of the synthesized
       characters.  Also, it is necessary to ensure there is no initial
       pause ("breathing in") within the synthesized character.

    5.2. Moving over words or lines, each of them is spoken.

      The sound sample needn't be available as quickly as in case of the
      typed characters, but it still should be available without clearly
      noticeable delay.  As the user moves over the words or lines, he
      must hear the text immediately.  Cancelling the synthesis of the
      previous word or line must be instant.

    5.3. Reading a large text file.

      In such a case, it is not necessary to start speaking instantly,
      because reading a large text is not a very frequent operation.
      One second long delay at the start is acceptable, although not
      comfortable.  Cancelling the speech must still be instant.


  6. Related Specifications

    SSML: http://www.w3.org/TR/2004/REC-speech-synthesis-20040907/
          (see requirements at the following URL:
          http://www.w3.org/TR/2004/REC-speech-synthesis-20040907/#ref-reqs)

    MRCP: http://www.ietf.org/html.charters/speechsc-charter.html


B. SSML Subset in Use

  This section defines the subset of the SSML markup and special
  attribute values for use in input texts to the drivers.

  Note: According to available information, SSML is not known to suffer
  of any IP issues.

  ...
  
  OPEN ISSUES:

  - Need to specify which SSML elements should (or must) be supported.
  
  - Need to specify which SSML `voice' element attributes should (or
    must) be supported.

  - Need to specify which SSML `prosody' element attributes should (or
    must) be supported.  Especially pitch, rate and volume settings
    should be supported.

  - Definition of supported `say-as' attribute values.  We probably want
    to wait for a special `say-as' W3C specification that is expected to
    come soon.  Especially needed: speaking characters, speaking key
    names, spelling mode, capital letter signalling mode, punctuation
    modes.

  - Perhaps we should identify an ordered priority list of the SSML
    elements that should be supported?

  - ...


C. Interface Description

  This section defines the low-level TTS driver interface for use by all
  assistive technologies on free software platforms.

  1. Speech Synthesis Driver Discovery
   
  ...

  2. Speech Synthesis Driver Interface

  ...

   
D. Copying This Document

  Copyright (C) 2004 ...
  This specification is made available under a BSD-style license ...


More information about the Accessibility mailing list