[fdo] RE: TTS API

Fri Oct 29 14:07:52 PDT 2004

Not all drivers synthesize the entire waveform before sending it to the
audio device.  and in fact, if they do, this presents a serious latency
problem.  the right thing to assume is that you'll get audio.  whether you
get it streamed in parallel with the synthesis process, or only after the
synthesis is finished, is up to the engine.

So registering callbacks becomes a two-step process.  Register a callback
with the engine, and register a callback with the audio backend.  You'll
first get a callback from the engine specifying that a particular position
in the synthesis has been reached.  at this point, set a callback in the
audio backend to notify you when the current audio input position has been
reached.  When you get the callback from the audio backend, notify the
client.

-----Original Message-----
From: Bill Haneman [mailto:Bill.Haneman at Sun.COM]
Sent: Friday, October 29, 2004 7:59 AM
To: Olaf Jan Schmidt
Cc: Milan Zamazal; Peter Korn; Marc Mulcahy; Rich Burridge; Gunnar Schmi
Dt; Richard Schwerdtfeger; Aaron Leventhal; Willie Walker; Janina Sajka;
Marco Skambraks; George Kraft; Gary Cramblitt; Roger Butenuth;
freedesktop at freedesktop.org; Harald Fernengel; Janina Sajka; Marc
Mulcahy; Milan Zamazal
Subject: Re: TTS API

Olaf Jan Schmidt wrote:

>>Or more generally a sequence of audio samples.  Motivation: I think
>>most software synthesizers we are likely to support perform processing
>>of the whole text in several steps, only last of them being writing the
>>whole produced audio sample somewhere.  When synthesizing long texts,
>>it is desirable to allow the synthesizer to split the input into
>>several pieces so that we don't wait for the first coming audio data
>>too long.
>>
>
>
> KTTSD already does this, and I think it would be duplication of work to do
> it in every driver if the higher speech system can take care of this.
> Doing it before sending the phrases to the engines allows to interupt a
> longer text with warnings, etc.

But this isn't always what you want.

>>    OJS> 2.b) For hardware speech: possibility to set markers and to
>>get OJS> feedback whenever a marker has been reached.
>>
>>Markers should be available for both software and hardware synthesis.
>>But they differ in their form: While with hardware synthesis feedback
>>should be received whenever the marker is reached in the audio output,
>>with software synthesis positions of the markers in the returned audio
>>sample should be returned.  Or the audio sample can be returned in
>>several pieces as described above, it can be especially split on marker
>>positions and the returned list could contain not only the audio
>>samples, but also the reached markers.

I think you probably do not want to return audio samples
from the TTS driver API in most cases.  It's better to have some
API for connecting the driver with an audio sink.

>
>
> Is there any advantage to send the whole text at once to the drivers,
> rather than sending it smaller pieces which each return an audio stream?

Yes; some drivers do a lot of semantic/contextual processing, which
can't be done properly with smaller text snippets.

Again, there is a tradeoff between size/latency and quality - but it's
important to allow the client to do this both ways.  The client can then
decide whether to send small chunks or large ones.

The callback API must allow for sending big chunks, and getting
finer-grained notification before the whole request has completed.  Of
course different TTS engines will have different marker capabilities (as
was noted above).

> If sending it in a bigger piece avaiod lags, then it might perhaps be
> worthwile the bigger complexity in the API, but if the lags would be
> small anyway, then I would suggest to keep the API simpler.
>
>
>>Good remark.  But if I understand it correctly, this doesn't concern
>>the TTS API directly, it can just receive and process the pieces
>>separately, one by one, so there's no need for the drivers to be able
>>to process a list of strings?
>>
>
>
> If you have markup within a phrase, then we cannot pass parts of the
> phrase indepentently of each other. So we would need a string list in
> this case.
>
> A driver can easily turn the string list back to a string easily, it would
> only help those drivers that would parse the the string for tags rather
> than passing it on to an xml-supporting engine.
>
>
>>I'd suggest using SSML instead of VoiceXML.  If I'm not mistaken, SSML
>>is what is aimed at TTS, while the purpose of VoiceXML is different.

There are some licensing issues to be careful of here - we must use an
unencumbered XML markup flavor.

>
> I thought that the GSAPI used some extention of VoiceXML, but maybe I am
> misinformed here.

The proposed "GSAPI 1.0" called for some XML markup; I think it's a good
idea.  I will re-check my notes to make sure which version we proposed;
it was at the time the clear winner based on licensing issues and
end-user adoption.

> We should use the same syntax in any case. We can
> discuss the different possibilities on the list once it has been set up.
>
>
>>I'm not sure values other than languages are needed (except for the
>>purpose of configuration as described in C. below).  Application can
>>decide in which language to send the text depending on the available
>>languages, but could available voice names or genders involve the
>>application behavior in any significant way?

I think the voice name should be determined at the higher level API, and
the drivers should operate on a "voice" or "speaker".  I think that
changing speaker within a single marked-up string is an unusual case.

> KTTSD allows the user to select the preferred voices by name, and it needs
> to know which languages and genders are supported by the engines to
> switch to the correct driver if several are installed. Using different
> voices for diffferent porposes (long texts, messages, navigation
> feedback) is also only possible if it is know which voices exists and
> which driver can must be used to use them.
>
>
>>5. Other features needed (some of them are included and can be
>>expressed in SSML):
>>
>>- Enabling/disabling spelling mode.

Not sure this makes sense at the low-level.

>>
>>- Switching punctuation and capital character signalling modes.
>>
>
>
> I am not sure what exactly you mean by these two.
>
>
>>- Setting rate and pitch.
>>
>
>
> There are xml tags for this, but there should be a way to set a default.

I don't think we should rely _solely_ on XML for this, so I agree with
you.  There should be a way to set the "base" or "current" parameters on
a given voice or speaker (if the voice/speaker supports this).

>
>
>>- Reading single characters and key names.
>>
>
>
> Would this make more sense on the driver level, or should the higher
> speech system deal with this to have this consistent for all drivers?

Probably should be the job of the higher speech system.

>>    OJS> We could either add these functions to the driver API, or we
>>    OJS> could define a standard API for driver configuration
>>libraries.
>>
>>This functionality would be nice, but it should be optional, not to put
>>more burden on the drivers than absolutely needed.
>>
>
>
> Sure, if a driver has no configuration options to be shown in the kttsd
> configuration module, then this is not needed. I only want to avoid that
> kttsd, gnome-speech, SpeechDispatcher etc. all have to write their own
> configuration functions for the same drivers.
>
>
>>First we should agree on the form of the drivers.  Do we want just some
>>code base providing the defined features or do we want to define some
>>form of a particular API, possibly to be used by alternative APIs?
>>
>
>
> Could you explain the differences between the two options a bit?
>
> Olaf
>
> - --
> Olaf Jan Schmidt, KDE Accessibility Project
> KDEAP co-maintainer, maintainer of http://accessibility.kde.org
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.0.6 (GNU/Linux)
> Comment: For info see http://www.gnupg.org
>
> iEYEARECAAYFAkGCRzgACgkQoLYC8AehV8d8eQCgrwAwmMRfTe7ytZJjwIvqUYFx
> 5dgAnRx7aMeJhSSOORJGT53oYQfETxss
> =N8eb
> -----END PGP SIGNATURE-----
>