[fdo] Re: TTS API

Fri Oct 29 13:42:27 PDT 2004

Thanks, Olaf and Bill, for your participation in the idea and keeping it
going forward!

>>>>> "OJS" == Olaf Jan Schmidt <ojschmidt at kde.org> writes:
>>>>> "BH" == Bill Haneman <Bill.Haneman at Sun.COM> writes:

    OJS> I have just asked David Stone when we can start using the list.

Thanks.

    >> Or more generally a sequence of audio samples.  Motivation: I
    >> think most software synthesizers we are likely to support perform
    >> processing of the whole text in several steps, only last of them
    >> being writing the whole produced audio sample somewhere.  When
    >> synthesizing long texts, it is desirable to allow the synthesizer
    >> to split the input into several pieces so that we don't wait for
    >> the first coming audio data too long.

    OJS> KTTSD already does this, and I think it would be duplication of
    OJS> work to do it in every driver if the higher speech system can
    OJS> take care of this.

I think the higher level speech system can't do this, since it requires
utterance chunking which is a typical TTS function.  Utterance chunking
must be performed by a low-level TTS system for two main reasons:
1. It's language dependent; 2. only the TTS system knows how large
pieces of texts it needs to produce its speech output in the appropriate
quality.

If you care about TTS systems which can't perform what we need here,
then we can make this functionality optional in the drivers.  In case
the TTS system/driver doesn't support it, then the higher-level speech
system can perform some heuristics as what KTTSD does I guess.

But from the point of view of duplicated efforts I think it would be
still better to do it in drivers rather than in the higher-level speech
system.  If the drivers have common code base, then the fallback
utterance chunking can be just a library function shared by all the
drivers which don't support their own version and there's no need to
implement it in KTTSD, GNOME Speech, Speech Dispatcher, etc.

    OJS> Doing it before sending the phrases to the engines allows to
    OJS> interupt a longer text with warnings, etc.

I don't understand what you mean here exactly.

    OJS> 2.b) For hardware speech: possibility to set markers and to
    >> get OJS> feedback whenever a marker has been reached.
    >> 
    >> Markers should be available for both software and hardware
    >> synthesis.  But they differ in their form: While with hardware
    >> synthesis feedback should be received whenever the marker is
    >> reached in the audio output, with software synthesis positions of
    >> the markers in the returned audio sample should be returned.  Or
    >> the audio sample can be returned in several pieces as described
    >> above, it can be especially split on marker positions and the
    >> returned list could contain not only the audio samples, but also
    >> the reached markers.

    OJS> Is there any advantage to send the whole text at once to the
    OJS> drivers, rather than sending it smaller pieces which each
    OJS> return an audio stream?  If sending it in a bigger piece avaiod
    OJS> lags, then it might perhaps be worthwile the bigger complexity
    OJS> in the API, but if the lags would be small anyway, then I would
    OJS> suggest to keep the API simpler.

    BH> Yes; some drivers do a lot of semantic/contextual processing,
    BH> which can't be done properly with smaller text snippets.

    BH> Again, there is a tradeoff between size/latency and quality -
    BH> but it's important to allow the client to do this both ways.
    BH> The client can then decide whether to send small chunks or large
    BH> ones.

    BH> The callback API must allow for sending big chunks, and getting
    BH> finer-grained notification before the whole request has
    BH> completed.  Of course different TTS engines will have different
    BH> marker capabilities (as was noted above).

See above and I agree with what Bill writes here.

    BH> I think you probably do not want to return audio samples from
    BH> the TTS driver API in most cases.  It's better to have some API
    BH> for connecting the driver with an audio sink.

I agree the right way to return the produced audio data is to write it
to a given stream.  We could probably agree that the API shouldn't
specify which kind of stream it is (whether some kind of audio sink, a
file stream or any other kind of a binary stream).

    >> Good remark.  But if I understand it correctly, this doesn't
    >> concern the TTS API directly, it can just receive and process the
    >> pieces separately, one by one, so there's no need for the drivers
    >> to be able to process a list of strings?
    >> 

    OJS> If you have markup within a phrase, then we cannot pass parts
    OJS> of the phrase indepentently of each other. So we would need a
    OJS> string list in this case.

I see.  I'm not sure it is a clean technique, but I must think about it
more before making my opinion on it.

    >> I'd suggest using SSML instead of VoiceXML.  If I'm not mistaken,
    >> SSML is what is aimed at TTS, while the purpose of VoiceXML is
    >> different.

    BH> There are some licensing issues to be careful of here - we must
    BH> use an unencumbered XML markup flavor.

You probably mean patent issues?  We should definitely avoid them.  Are
you aware of particular problems with SSML or its subset relevant to our
API?

    OJS> I thought that the GSAPI used some extention of VoiceXML, but
    OJS> maybe I am misinformed here. We should use the same syntax in
    OJS> any case. We can discuss the different possibilities on the
    OJS> list once it has been set up.

    BH> The proposed "GSAPI 1.0" called for some XML markup; I think
    BH> it's a good idea.  I will re-check my notes to make sure which
    BH> version we proposed; it was at the time the clear winner based
    BH> on licensing issues and end-user adoption.

OK, thanks.

    >> I'm not sure values other than languages are needed (except for
    >> the purpose of configuration as described in C. below).
    >> Application can decide in which language to send the text
    >> depending on the available languages, but could available voice
    >> names or genders involve the application behavior in any
    >> significant way?

    OJS> KTTSD allows the user to select the preferred voices by name,
    OJS> and it needs to know which languages and genders are supported
    OJS> by the engines to switch to the correct driver if several are
    OJS> installed. Using different voices for diffferent porposes (long
    OJS> texts, messages, navigation feedback) is also only possible if
    OJS> it is know which voices exists and which driver can must be
    OJS> used to use them.

OK, I understand the purpose now.  We could probably select the exact
parameter set to be consistent with the voice selection features of the
chosen markup?

    BH> I think the voice name should be determined at the higher level
    BH> API, and the drivers should operate on a "voice" or "speaker".

I don't understand what you mean here with "voice" or "speaker" exactly.

    BH> I think that changing speaker within a single marked-up string
    BH> is an unusual case.

I can imagine it very easily -- consider faces in Emacs.  For instance,
it may be very convenient to read a comment inside a line of source code
by a different speaker.

    >> 5. Other features needed (some of them are included and can be
    >> expressed in SSML):
    >> 
    >> - Enabling/disabling spelling mode.

    BH> Not sure this makes sense at the low-level.

It makes, since it is language dependent.

    >> - Switching punctuation and capital character signalling modes.
    >> 

    OJS> I am not sure what exactly you mean by these two.

In spelling mode, the given text is spelled.

Punctuation modes handle punctuation reading in different ways,
e.g. there may be modes for explicit reading of all punctuation
characters or not reading any punctuation characters or reading
punctuation characters as they would likely to be read by a human
reader.

Capital character signalling mode signals capital characters within the
text, e.g. by beeping before each of them.

    >> - Setting rate and pitch.
    >> 

    OJS> There are xml tags for this, but there should be a way to set a
    OJS> default.

    BH> I don't think we should rely _solely_ on XML for this, so I
    BH> agree with you.  There should be a way to set the "base" or
    BH> "current" parameters on a given voice or speaker (if the
    BH> voice/speaker supports this).

Agreed.

    >> - Reading single characters and key names.
    >> 

    OJS> Would this make more sense on the driver level, or should the
    OJS> higher speech system deal with this to have this consistent for
    OJS> all drivers?

    BH> Probably should be the job of the higher speech system.

Again, it is language dependent.  Moreover, there can be ambiguities
between texts, characters, and keys (e.g. `a' may be a word or a
character in English).  Maybe it could be technically solved in some way
on the higher level using some language dependent tables or so, but I'd
prefer not to mess with any lower level TTS functionality in the higher
level speech systems -- let's them just allow to clearly express what
they want to synthesize and let's let the synthesizer to do it.

    OJS> Sure, if a driver has no configuration options to be shown in
    OJS> the kttsd configuration module,

... or if the driver author doesn't like to spend his expensive time
designing and writing such functionality ...

    OJS> then this is not needed. I only want to avoid that kttsd,
    OJS> gnome-speech, SpeechDispatcher etc. all have to write their own
    OJS> configuration functions for the same drivers.

Agreed.

    >> First we should agree on the form of the drivers.  Do we want
    >> just some code base providing the defined features or do we want
    >> to define some form of a particular API, possibly to be used by
    >> alternative APIs?
    >> 

    OJS> Could you explain the differences between the two options a
    OJS> bit?

Maybe there's actually none. :-)  But we should agree on the kind of
interface anyway.  Shared library?

Regards,

Milan Zamazal

-- 
If we are going to start removing packages because of the quality of the
software, wonderful.  I move to remove all traces of the travesty of editors,
vi, from Debian, since obviously as editors they are less than alpha quality
software.                                   -- Manoj Srivastava in debian-devel