[fdo] Re: TTS API

Thu Nov 4 07:51:27 PST 2004

Thanks Rich, for the helpful clarification.  TV is certainly an
authoritative source.

- Bill

On Thu, 2004-11-04 at 15:46, Richard Schwerdtfeger wrote:
> I checked with TV Raman regarding IP issues regarding the Speech
> Markup. Raman was involved with its specification. The W3C requires
> all IP regarding a new specification to be released to the W3C as part
> of the effort. He knows of no patents disclosed to the working group
> on SSML.
> 
> Rich
> 
> 
> Rich Schwerdtfeger
> STSM, Software Group Accessibility Strategist/Master Inventor
> Emerging Internet Technologies
> Chair, IBM Accessibility Architecture Review Board
> schwer at us.ibm.com, Phone: 512-838-4593,T/L: 678-4593
> 
> "Two roads diverged in a wood, and I - 
> I took the one less traveled by, and that has made all the
> difference.", Frost
> 
> Inactive hide details for Milan Zamazal <pdm at freebsoft.org>Milan
> Zamazal <pdm at freebsoft.org>
> 
> 
>                                 Milan Zamazal <pdm at freebsoft.org>
>                                 Sent by: Milan Zamazal <pdm at zamazal.org>
>                                 
>                                 11/04/2004 06:15 AM
>                         
>                 
>         
> 
>                To
> 
> Willie Walker
> <William.Walker at Sun.COM>, Olaf Jan Schmidt <ojschmidt at kde.org>, Gary Cramblitt <garycramblitt at comcast.net>, Roger Butenuth <butenuth at online.de>, Janina Sajka <janina at rednote.net>, Peter Korn <Peter.Korn at Sun.COM>, Gunnar Schmi Dt <gunnar at schmi-dt.de>, Aaron Leventhal <aaronleventhal at moonset.net>, Janina Sajka <janina at freestandards.org>, Harald Fernengel <harald at trolltech.com>, freedesktop at freedesktop.org, George Kraft/Austin/IBM at IBMUS, Marc Mulcahy <marc at plbb.net>, bill.haneman at Sun.COM, Marco Skambraks <marco at suse.de>, Rich Burridge <Rich.Burridge at Sun.COM>, Richard Schwerdtfeger/Austin/IBM at IBMUS
> 
>                cc
> 
> 
>           Subject
> 
> Re: TTS API
> 
> 
> I expanded the initial Olaf's document and tried to summarize the
> current state of our discussion about the requirements and related
> issues in it.  Please tell me if anything is missing or unclear and
> let's try to finish the requirements section by resolving the open
> questions.
> 
> Regards,
> 
> Milan Zamazal
> 
> Common TTS Interface
> ====================
> 
> * Introduction
> 
> The purpose of this document is to define a common interface to speech
> synthesizers.  The common interface is needed to avoid duplicate work
> when implementing speech synthesizer drivers for different free
> software
> higher level speech frameworks (like gnome-speech, KTTSD or Speech
> Dispatcher) and to allow uniform access to all supported speech
> synthesizers in the frameworks.
> 
> * Scope of the Document
> 
> The specification covers all the necessary text-to-speech functions
> which are directly related to speech synthesizers.  This concerns
> especially all language dependent text-to-speech functions.
> Particular
> areas covered by this specification include:
> 
> - Synthesizing given piece of text.
> 
> - Stopping the synthesis process.
> 
> - Managing audio output of hardware speech synthesizers.
> 
> - Setting basic speech synthesis parameters like pitch, rate or
> volume.
> 
> - Using different languages, voices and reading modes.
> 
> - Index marking.
> 
> - Configuration interface.
> 
> - Definition of a low level interface to be used to access the
> drivers.
> 
> On the other hand, the following areas are explicitly omitted:
> 
> - Message management (queueing, ordering, interleaving, etc.).  This
> is
>  a job of the higher level speech frameworks, not of speech
>  synthesizers.
> 
> - Interfacing with audio devices.  Again, this is a job of specialized
>  software (higher level speech frameworks and sound servers), not of
>  speech synthesizers.  But in case of hardware speech synthesizers
>  using solely their own audio output it is necessary to manage their
>  audio output too.
> 
> - Higher level interfaces accessing the drivers, like IPC or socket
>  based interfaces.  Different projects are designed and use different
>  forms of interprocess communication and with respect to current state
>  of things it's unlikely to find a consensus about a common high level
>  communication interface to drivers.  So those interfaces are left to
>  be implemented separately by the projects that need them.  They are
>  expected to be written as wrappers around the common low level access
>  interface.
> 
> - Interaction with other components of the operating system.  The sole
>  process of speech synthesis is unlikely to mess with other parts of
>  the operating system in any unusual way.  This may not apply to
> higher
>  level speech frameworks, but these are out of scope of this document.
> 
> * General Requirements on the TTS Interface
> 
> The synthesis process:
> 
> - Synthesis of a given piece of text expressed in a markup format
>  [unresolved: Which one?  SSML or its reasonable subset?  Isn't SSML
>  covered by patents preventing Free Software and Open Source programs
>  to use it?  How about character encoding -- would it suffice to use
>  UTF-8 everywhere?].
> 
> - Synthesis of characters and key names [possibly using custom SSML
>  attribute values?].  Rationale: It's not possible to express them
> just
>  in the form of an ordinary text without language specific knowledge.
> 
> - Short time before delivering first playable audio data after
>  initiating new synthesis, even when it is needed to stop another
>  synthesis request issued before.
> 
> - No significant performance hits (like long response time or wasting
>  CPU time or memory) when many synthesizing requests come short after
>  each other (cancelling the previous one).
> 
> - [Unresolved: Should the driver be able to receive the markup text to
>  synthesize in several pieces?  The motivation is to ease processing
> of
>  texts in KTTSD a bit.  I personally don't think it's a valid reason
> to
>  complicate the interface, considering it has nothing to do with the
>  speech synthesis process.  But maybe I still miss something.]
> 
> Software synthesis:
> 
> - [Still not clear consensus on how to return the synthesized audio
>  data.  Maybe we could agree it should be written to a given binary
>  stream?  But in which format?  What if the audio data is split in
>  several pieces (see below)?  And how to return index marker positions
>  (see below)?]
> 
> - [Should it be allowed to return the audio data in several separate
>  pieces?  It complicates returning them, but what if the synthesizer
>  splits a long input text and is unable to merge the resulting wave
>  forms?  Should the driver be responsible for handling this?]
> 
> Hardware synthesis:
> 
> - Management of the audio output: immediate stopping, pausing and
>  resuming.
> 
> Index markers:
> 
> - Support for identifying where or when given places (index markers)
> in
>  the input text are reached.  [Not all synthesizers can support this
> --
>  should index marking be optional?]
> 
> - Hardware synthesis must signal reaching index markers via callbacks,
>  when the index marker is actually reached when playing the audio
>  output.
> 
> - Software synthesis must identify the positions of index markers
> within
>  the returned audio data.  [The question is how to do it.  If we are
>  able to return the audio output in several pieces, then we can think
>  about a linear sequence of audio samples and marker identifiers,
> where
>  each marker is placed at its position between separate audio samples.
>  Another possible way is to write times of reaching the markers in the
>  produced audio data to a separate stream; this works with single
> audio
>  output but it requires certain precautions to ensure the marker is
> not
>  missed on the marker stream when playing data from the audio stream.]
> 
> - [The KTTSD approach to warnings and messages suggests it could be
>  useful if some sort of index markers could be inserted into the input
>  texts automatically, at breakable places.  I.e. places, where the
>  audio output can be interrupted without breaking the speech at an
>  unsuitable place (e.g. in the middle of a word or short sentence).
>  This can be useful for pausing the speech or for speaking unrelated
>  important messages when reading longer pieces of text.  What do you
>  think?]
> 
> Setting speech parameters:
> 
> - It should be possible to set basic speech parameters like language,
>  voice, rate, pitch and volume.  Rationale: The parameters can be set
>  by the input text markup, but there should be a way to set the
>  defaults.
> 
> - It should be possible to switch reading modes of the synthesizer,
>  namely: punctuation mode, capital letter signalization mode, spelling
>  mode.
> 
> Retrieving available parameter values:
> 
> - It should be possible to return a list of supported languages and
>  voices (identified by parameters matching voice selection mechanism
> of
>  the chosen input text markup).  [Other parameters?]  [Shouldn't this
>  be optional?  I'm not sure all synthesizers are able to provide this
>  information.]  Rationale: This allows the higher level speech
>  frameworks and/or applications to make decisions about selecting a
>  supported language (when more language alternatives are available on
>  the input) or about selecting particular supported voice for a given
>  piece of text without the danger the voice gets quietly mapped on the
>  same voice as the surrounding text.
>  
> Configuration:
> 
> - Getting and setting configuration parameters of the synthesizer.
>  [Should this be here or in a standard API for driver configuration
>  libraries?]
> 
> - This facility is optional.
> 
> * Interface Definition
> 
> [This is to be defined later, after the general requirements are
> settled
> down.  In the meantime, we can think about appropriate form of the low
> level interface.  Do we agree it should have the form of a shared
> library accompanied with corresponding C header files?]
> 
> [Definition of the interface functions.]
> 
> [How to access the drivers in the operating system environment.]
> 
> * Final Remarks
> 
> It might be useful to extend the specification process (in separate
> documents) to higher level speech frameworks and audio output systems
> in
> future.
> 
> * Copying
> 
> [Any idea about a good free license for the final document?]
>