[fdo] Re: TTS API

Fri Oct 29 06:35:51 PDT 2004

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Milan!

Thanks for your comments ont he requirement list.

[Milan Zamazal, Dienstag, 26. Oktober 2004 20:58]
> [Since the mailing list apparently hasn't been created yet, I continue
> in private not to freeze the discussion for too long.]
>

I have just asked David Stone when we can start using the list.

> BTW, this might be subject of another standardization step.  I'd like
> to look at kttsd features -- is there some reasonable description or
> documentation of kttsd available?
>

http://accessibility.kde.org/developer/kttsd/

> Or more generally a sequence of audio samples.  Motivation: I think
> most software synthesizers we are likely to support perform processing
> of the whole text in several steps, only last of them being writing the
> whole produced audio sample somewhere.  When synthesizing long texts,
> it is desirable to allow the synthesizer to split the input into
> several pieces so that we don't wait for the first coming audio data
> too long.
>

KTTSD already does this, and I think it would be duplication of work to do 
it in every driver if the higher speech system can take care of this. 
Doing it before sending the phrases to the engines allows to interupt a 
longer text with warnings, etc.

>     OJS> 2.b) For hardware speech: possibility to set markers and to
> get OJS> feedback whenever a marker has been reached.
>
> Markers should be available for both software and hardware synthesis.
> But they differ in their form: While with hardware synthesis feedback
> should be received whenever the marker is reached in the audio output,
> with software synthesis positions of the markers in the returned audio
> sample should be returned.  Or the audio sample can be returned in
> several pieces as described above, it can be especially split on marker
> positions and the returned list could contain not only the audio
> samples, but also the reached markers.
>

Is there any advantage to send the whole text at once to the drivers, 
rather than sending it smaller pieces which each return an audio stream? 
If sending it in a bigger piece avaiod lags, then it might perhaps be 
worthwile the bigger complexity in the API, but if the lags would be 
small anyway, then I would suggest to keep the API simpler.

> Good remark.  But if I understand it correctly, this doesn't concern
> the TTS API directly, it can just receive and process the pieces
> separately, one by one, so there's no need for the drivers to be able
> to process a list of strings?
>

If you have markup within a phrase, then we cannot pass parts of the 
phrase indepentently of each other. So we would need a string list in 
this case.

A driver can easily turn the string list back to a string easily, it would 
only help those drivers that would parse the the string for tags rather 
than passing it on to an xml-supporting engine.

> I'd suggest using SSML instead of VoiceXML.  If I'm not mistaken, SSML
> is what is aimed at TTS, while the purpose of VoiceXML is different.
>

I thought that the GSAPI used some extention of VoiceXML, but maybe I am 
misinformed here. We should use the same syntax in any case. We can 
discuss the different possibilities on the list once it has been set up.

> I'm not sure values other than languages are needed (except for the
> purpose of configuration as described in C. below).  Application can
> decide in which language to send the text depending on the available
> languages, but could available voice names or genders involve the
> application behavior in any significant way?

KTTSD allows the user to select the preferred voices by name, and it needs 
to know which languages and genders are supported by the engines to 
switch to the correct driver if several are installed. Using different 
voices for diffferent porposes (long texts, messages, navigation 
feedback) is also only possible if it is know which voices exists and 
which driver can must be used to use them.

> 5. Other features needed (some of them are included and can be
> expressed in SSML):
>
> - Enabling/disabling spelling mode.
>
> - Switching punctuation and capital character signalling modes.
>

I am not sure what exactly you mean by these two.

> - Setting rate and pitch.
>

There are xml tags for this, but there should be a way to set a default.

> - Reading single characters and key names.
>

Would this make more sense on the driver level, or should the higher 
speech system deal with this to have this consistent for all drivers?

>     OJS> We could either add these functions to the driver API, or we
>     OJS> could define a standard API for driver configuration
> libraries.
>
> This functionality would be nice, but it should be optional, not to put
> more burden on the drivers than absolutely needed.
>

Sure, if a driver has no configuration options to be shown in the kttsd 
configuration module, then this is not needed. I only want to avoid that 
kttsd, gnome-speech, SpeechDispatcher etc. all have to write their own 
configuration functions for the same drivers.

> First we should agree on the form of the drivers.  Do we want just some
> code base providing the defined features or do we want to define some
> form of a particular API, possibly to be used by alternative APIs?
>

Could you explain the differences between the two options a bit?

Olaf

- -- 
Olaf Jan Schmidt, KDE Accessibility Project
KDEAP co-maintainer, maintainer of http://accessibility.kde.org

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: For info see http://www.gnupg.org

iEYEARECAAYFAkGCRzgACgkQoLYC8AehV8d8eQCgrwAwmMRfTe7ytZJjwIvqUYFx
5dgAnRx7aMeJhSSOORJGT53oYQfETxss
=N8eb
-----END PGP SIGNATURE-----