[fdo] Re: TTS API
Bill Haneman
Bill.Haneman at Sun.COM
Thu Nov 4 07:51:27 PST 2004
Thanks Rich, for the helpful clarification. TV is certainly an
authoritative source.
- Bill
On Thu, 2004-11-04 at 15:46, Richard Schwerdtfeger wrote:
> I checked with TV Raman regarding IP issues regarding the Speech
> Markup. Raman was involved with its specification. The W3C requires
> all IP regarding a new specification to be released to the W3C as part
> of the effort. He knows of no patents disclosed to the working group
> on SSML.
>
> Rich
>
>
> Rich Schwerdtfeger
> STSM, Software Group Accessibility Strategist/Master Inventor
> Emerging Internet Technologies
> Chair, IBM Accessibility Architecture Review Board
> schwer at us.ibm.com, Phone: 512-838-4593,T/L: 678-4593
>
> "Two roads diverged in a wood, and I -
> I took the one less traveled by, and that has made all the
> difference.", Frost
>
> Inactive hide details for Milan Zamazal <pdm at freebsoft.org>Milan
> Zamazal <pdm at freebsoft.org>
>
>
> Milan Zamazal <pdm at freebsoft.org>
> Sent by: Milan Zamazal <pdm at zamazal.org>
>
> 11/04/2004 06:15 AM
>
>
>
>
> To
>
> Willie Walker
> <William.Walker at Sun.COM>, Olaf Jan Schmidt <ojschmidt at kde.org>, Gary Cramblitt <garycramblitt at comcast.net>, Roger Butenuth <butenuth at online.de>, Janina Sajka <janina at rednote.net>, Peter Korn <Peter.Korn at Sun.COM>, Gunnar Schmi Dt <gunnar at schmi-dt.de>, Aaron Leventhal <aaronleventhal at moonset.net>, Janina Sajka <janina at freestandards.org>, Harald Fernengel <harald at trolltech.com>, freedesktop at freedesktop.org, George Kraft/Austin/IBM at IBMUS, Marc Mulcahy <marc at plbb.net>, bill.haneman at Sun.COM, Marco Skambraks <marco at suse.de>, Rich Burridge <Rich.Burridge at Sun.COM>, Richard Schwerdtfeger/Austin/IBM at IBMUS
>
> cc
>
>
> Subject
>
> Re: TTS API
>
>
> I expanded the initial Olaf's document and tried to summarize the
> current state of our discussion about the requirements and related
> issues in it. Please tell me if anything is missing or unclear and
> let's try to finish the requirements section by resolving the open
> questions.
>
> Regards,
>
> Milan Zamazal
>
> Common TTS Interface
> ====================
>
> * Introduction
>
> The purpose of this document is to define a common interface to speech
> synthesizers. The common interface is needed to avoid duplicate work
> when implementing speech synthesizer drivers for different free
> software
> higher level speech frameworks (like gnome-speech, KTTSD or Speech
> Dispatcher) and to allow uniform access to all supported speech
> synthesizers in the frameworks.
>
> * Scope of the Document
>
> The specification covers all the necessary text-to-speech functions
> which are directly related to speech synthesizers. This concerns
> especially all language dependent text-to-speech functions.
> Particular
> areas covered by this specification include:
>
> - Synthesizing given piece of text.
>
> - Stopping the synthesis process.
>
> - Managing audio output of hardware speech synthesizers.
>
> - Setting basic speech synthesis parameters like pitch, rate or
> volume.
>
> - Using different languages, voices and reading modes.
>
> - Index marking.
>
> - Configuration interface.
>
> - Definition of a low level interface to be used to access the
> drivers.
>
> On the other hand, the following areas are explicitly omitted:
>
> - Message management (queueing, ordering, interleaving, etc.). This
> is
> a job of the higher level speech frameworks, not of speech
> synthesizers.
>
> - Interfacing with audio devices. Again, this is a job of specialized
> software (higher level speech frameworks and sound servers), not of
> speech synthesizers. But in case of hardware speech synthesizers
> using solely their own audio output it is necessary to manage their
> audio output too.
>
> - Higher level interfaces accessing the drivers, like IPC or socket
> based interfaces. Different projects are designed and use different
> forms of interprocess communication and with respect to current state
> of things it's unlikely to find a consensus about a common high level
> communication interface to drivers. So those interfaces are left to
> be implemented separately by the projects that need them. They are
> expected to be written as wrappers around the common low level access
> interface.
>
> - Interaction with other components of the operating system. The sole
> process of speech synthesis is unlikely to mess with other parts of
> the operating system in any unusual way. This may not apply to
> higher
> level speech frameworks, but these are out of scope of this document.
>
> * General Requirements on the TTS Interface
>
> The synthesis process:
>
> - Synthesis of a given piece of text expressed in a markup format
> [unresolved: Which one? SSML or its reasonable subset? Isn't SSML
> covered by patents preventing Free Software and Open Source programs
> to use it? How about character encoding -- would it suffice to use
> UTF-8 everywhere?].
>
> - Synthesis of characters and key names [possibly using custom SSML
> attribute values?]. Rationale: It's not possible to express them
> just
> in the form of an ordinary text without language specific knowledge.
>
> - Short time before delivering first playable audio data after
> initiating new synthesis, even when it is needed to stop another
> synthesis request issued before.
>
> - No significant performance hits (like long response time or wasting
> CPU time or memory) when many synthesizing requests come short after
> each other (cancelling the previous one).
>
> - [Unresolved: Should the driver be able to receive the markup text to
> synthesize in several pieces? The motivation is to ease processing
> of
> texts in KTTSD a bit. I personally don't think it's a valid reason
> to
> complicate the interface, considering it has nothing to do with the
> speech synthesis process. But maybe I still miss something.]
>
> Software synthesis:
>
> - [Still not clear consensus on how to return the synthesized audio
> data. Maybe we could agree it should be written to a given binary
> stream? But in which format? What if the audio data is split in
> several pieces (see below)? And how to return index marker positions
> (see below)?]
>
> - [Should it be allowed to return the audio data in several separate
> pieces? It complicates returning them, but what if the synthesizer
> splits a long input text and is unable to merge the resulting wave
> forms? Should the driver be responsible for handling this?]
>
> Hardware synthesis:
>
> - Management of the audio output: immediate stopping, pausing and
> resuming.
>
> Index markers:
>
> - Support for identifying where or when given places (index markers)
> in
> the input text are reached. [Not all synthesizers can support this
> --
> should index marking be optional?]
>
> - Hardware synthesis must signal reaching index markers via callbacks,
> when the index marker is actually reached when playing the audio
> output.
>
> - Software synthesis must identify the positions of index markers
> within
> the returned audio data. [The question is how to do it. If we are
> able to return the audio output in several pieces, then we can think
> about a linear sequence of audio samples and marker identifiers,
> where
> each marker is placed at its position between separate audio samples.
> Another possible way is to write times of reaching the markers in the
> produced audio data to a separate stream; this works with single
> audio
> output but it requires certain precautions to ensure the marker is
> not
> missed on the marker stream when playing data from the audio stream.]
>
> - [The KTTSD approach to warnings and messages suggests it could be
> useful if some sort of index markers could be inserted into the input
> texts automatically, at breakable places. I.e. places, where the
> audio output can be interrupted without breaking the speech at an
> unsuitable place (e.g. in the middle of a word or short sentence).
> This can be useful for pausing the speech or for speaking unrelated
> important messages when reading longer pieces of text. What do you
> think?]
>
> Setting speech parameters:
>
> - It should be possible to set basic speech parameters like language,
> voice, rate, pitch and volume. Rationale: The parameters can be set
> by the input text markup, but there should be a way to set the
> defaults.
>
> - It should be possible to switch reading modes of the synthesizer,
> namely: punctuation mode, capital letter signalization mode, spelling
> mode.
>
> Retrieving available parameter values:
>
> - It should be possible to return a list of supported languages and
> voices (identified by parameters matching voice selection mechanism
> of
> the chosen input text markup). [Other parameters?] [Shouldn't this
> be optional? I'm not sure all synthesizers are able to provide this
> information.] Rationale: This allows the higher level speech
> frameworks and/or applications to make decisions about selecting a
> supported language (when more language alternatives are available on
> the input) or about selecting particular supported voice for a given
> piece of text without the danger the voice gets quietly mapped on the
> same voice as the surrounding text.
>
> Configuration:
>
> - Getting and setting configuration parameters of the synthesizer.
> [Should this be here or in a standard API for driver configuration
> libraries?]
>
> - This facility is optional.
>
> * Interface Definition
>
> [This is to be defined later, after the general requirements are
> settled
> down. In the meantime, we can think about appropriate form of the low
> level interface. Do we agree it should have the form of a shared
> library accompanied with corresponding C header files?]
>
> [Definition of the interface functions.]
>
> [How to access the drivers in the operating system environment.]
>
> * Final Remarks
>
> It might be useful to extend the specification process (in separate
> documents) to higher level speech frameworks and audio output systems
> in
> future.
>
> * Copying
>
> [Any idea about a good free license for the final document?]
>
More information about the freedesktop
mailing list