[fdo] Re: TTS API

Fri Nov 5 03:46:23 PST 2004

Thanks for your comments and suggestions!

>>>>> "GC" == Gary Cramblitt <garycramblitt at comcast.net> writes:

* SSML

    GC> I also think SSML is probably the best choice as the basis for
    GC> markup format, although we may need to simplify and/or extend.
    GC> For example, SSML provides a number of ways to control volume,
    GC> including relative/absolute numerics, etc.  Within KTTSD we just
    GC> use soft/normal/loud.  (If users want finer control than that,
    GC> they can adjust the volume of their audio device.)  What we want
    GC> is a reasonable subset that provides a rich-enough speech
    GC> environment to meet the needs of applications (especially screen
    GC> readers) without burdening driver authors with having to do a
    GC> full SSML implementation.  

I agree, having implemented partial SSML support for Festival I can say
it's mostly a lot of boring work.  As for prosody attribute values,
there are two basic options:

1. To use absolute numeric values only.  This is simple for the driver
   and any legal value can be given.

2. To use full sets of the symbolic values, which can be mapped (and
   possibly configured) in the driver to some particular synthesizer
   specific values.  This limits the range of possible values a bit and
   complicates the driver slightly, but it adds some flexibility for the
   user who can change the driver mappings.  For instance, this way he
   could mutually adjust different default rates or pitches of various
   synthesizers or even voices.

    GC> A synth might want to provide full volume and rate attribute
    GC> support (numerics, etc.) but that would be optional.

Yes.

    >> - Synthesis of characters and key names [possibly using custom
    >> SSML attribute values?].  Rationale: It's not possible to express
    >> them just in the form of an ordinary text without language
    >> specific knowledge.

    GC> SSML provides the "say-as" element for this but does not specify
    GC> the actual attribute values.

Exactly.

    GC> The W3C working group is supposedly working on a separate
    GC> document for these.

Do you know any details?  Is some proposal available or do we have to
invent our own values for now?

    >> Setting speech parameters:
    >> 
    >> - It should be possible to switch reading modes of the
    >> synthesizer, namely: punctuation mode, capital letter
    >> signalization mode, spelling mode.

    GC> SSML "say-as" element.

OK.

We should probably write down a proposal what subset of SSML to use and
what `say-as' attribute values we define and use.

* Utterance chunking and index marking

    GC> I'm not sure where this requirement came from or what is meant
    GC> here.  

Cleared up by Olaf (my fault, I didn't understand it fully to describe
it precisely) and commented below.  You wrote about a different issue,
but it is important and must be resolved too.

    GC> KTTSD breaks text given to it by applications into smaller
    GC> pieces (sentence parsing) before passing them on to the
    GC> synthesis engine, but it gives each piece one-at-time to the
    GC> engine.  Sentence parsing is done for several reasons, mostly to
    GC> get around the limitations of the currently-available engines:

I can understand it.  We did it in Speech Dispatcher too, but later
decided to move it to its drivers.  The reason was exactly as you say:

    GC> On the other hand, sentence parsing can be a difficult problem,
    GC> and is language and context dependent.  Some would argue that it
    GC> should be done by the synthesis engine, since most engines must
    GC> already do so to some extent already.

You make the following suggestion:

    GC> If a synth engine were available that had all the capabilities I
    GC> mentioned above (and probably others I haven't mentioned), there
    GC> would be no need to do sentence parsing in KTTSD, but adding
    GC> these capabilities to the low-level driver would greatly
    GC> complicate its interface.  All things considered, I think the
    GC> low-level API should not provide a capability to receive text in
    GC> pieces.  Leave that to higher levels.

But this means that the difficult and language dependent task of
sentence parsing should be implemented by all the higher level
frameworks.  Moreover, what if the speech synthesizer is sophisticated
enough to make its synthesis decisions based on contexts wider than a
sentence?  I think the suggested simplification brings complications
into other places.

Let's try to improve it slightly.  I think all the capabilities you
mention can be available even when utterance chunking is performed by
the drivers, if the drivers can provide marking information about the
utterance borders.  So how about moving the sentence parsing code from
KTTSD to a common driver library?  It would have the following
advantages:

- All the higher level tools no longer need to implement utterance
  chunking.

- KTTSD is no longer responsible for it, so in case something is wrong
  with the parsing, you can complain to the common driver library. ;-)

- Sophisticated synthesizers can perform their own utterance chunking.

Synthesizers which can't do it can simply use the library.

    >> - [The KTTSD approach to warnings and messages suggests it could
    >> be useful if some sort of index markers could be inserted into
    >> the input texts automatically, at breakable places.  I.e. places,
    >> where the audio output can be interrupted without breaking the
    >> speech at an unsuitable place (e.g. in the middle of a word or
    >> short sentence).  This can be useful for pausing the speech or
    >> for speaking unrelated important messages when reading longer
    >> pieces of text.  What do you think?]

    GC> Hmm.  More complication.  Since KTTSD already does sentence
    GC> parsing, 

But other higher level tools don't.  And the support in KTTSD is
probably only very simple and incomplete I guess?

    GC> markers are easy to support, as long as accuracy is only
    GC> required to the sentence level.  

What if the word level is required?  What if I want index marks on line
breaks?  How about text which is not plain text (source code, e-mails,
...)?  All of it can be added, but doesn't it make complications at
inappropriate places?  The idea of avoiding index markers is tempting,
but we should be careful.

* Retrieving available parameter values:

    >> - It should be possible to return a list of supported languages
    >> and voices (identified by parameters matching voice selection
    >> mechanism of the chosen input text markup).  [Other parameters?]
    >> [Shouldn't this be optional?  I'm not sure all synthesizers are
    >> able to provide this information.]

    GC> If there is more than one driver available, how else would
    GC> higher-levels decide which to use?  I'm missing something here.

They have to make an arbitrary choice.  If the information is not
available, the driver can't provide it.  But usually there should be a
way to get the information and we should encourage driver authors to
support this feature.

* Lists of strings

>>>>> "OJS" == Olaf Jan Schmidt <ojschmidt at kde.org> writes:

    OJS> My reason for suggestion a string list is that since kttsd
    OJS> already splits the text into tags and text to be spoken, it
    OJS> could pass a list of strings containing either a) exactly only
    OJS> tag or b) text without any tags included.

Oh, I finally understand what you meant when you talked about duplicate
XML parsing.

    OJS> The API would be nicer if the driver is simply gicen a single
    OJS> string containing mark-up, but my question is whether splitting
    OJS> the string into tags and text several times (forst in kttsd,
    OJS> then in the driver, then maybe in the speech synthesisser
    OJS> itself) would slow down the speech synthesis.

I doubt the intended optimization would work in practice.  Some reasons
were already given by Gary, I can add:

- High level frameworks other than KTTSD may not perform their own SSML
  splitting/parsing, so the driver or synthesizer must support full SSML
  parsing anyway.

- Once SSML is supported in the driver or synthesizer, the driver author
  can hardly be expected to implement another way of SSML processing.
  Most likely, he simply converts the list into full SSML.

- Despite XML is not very easy to parse, the speech synthesis process
  requires much more computation, so there's no strong reason to a
  priori assume significant performance problems with duplicate XML
  parsing.

Regards,

Milan Zamazal

-- 
I think any law that restricts independent use of brainpower is suspect.
                                               -- Kent Pitman in comp.lang.lisp