[fdo] Re: TTS API
Gary Cramblitt
garycramblitt at comcast.net
Fri Nov 5 07:53:38 PST 2004
On Friday 05 November 2004 06:46 am, Milan Zamazal wrote:
> GC> The W3C working group is supposedly working on a separate
> GC> document for these.
>
> Do you know any details? Is some proposal available or do we have to
> invent our own values for now?
Sorry, no inside information. I was parroting the SSML spec. I guess we
should contact the W3C voice working group.
> You wrote about a different issue,
> but it is important and must be resolved too.
>
> GC> KTTSD breaks text given to it by applications into smaller
> GC> pieces (sentence parsing) before passing them on to the
> GC> synthesis engine, but it gives each piece one-at-time to the
> GC> engine. Sentence parsing is done for several reasons, mostly to
> GC> get around the limitations of the currently-available engines:
>
> I can understand it. We did it in Speech Dispatcher too, but later
> decided to move it to its drivers. The reason was exactly as you say:
>
> GC> On the other hand, sentence parsing can be a difficult problem,
> GC> and is language and context dependent. Some would argue that it
> GC> should be done by the synthesis engine, since most engines must
> GC> already do so to some extent already.
>
> You make the following suggestion:
>
> GC> If a synth engine were available that had all the capabilities I
> GC> mentioned above (and probably others I haven't mentioned), there
> GC> would be no need to do sentence parsing in KTTSD, but adding
> GC> these capabilities to the low-level driver would greatly
> GC> complicate its interface. All things considered, I think the
> GC> low-level API should not provide a capability to receive text in
> GC> pieces. Leave that to higher levels.
>
> But this means that the difficult and language dependent task of
> sentence parsing should be implemented by all the higher level
> frameworks. Moreover, what if the speech synthesizer is sophisticated
> enough to make its synthesis decisions based on contexts wider than a
> sentence? I think the suggested simplification brings complications
> into other places.
>
> Let's try to improve it slightly. I think all the capabilities you
> mention can be available even when utterance chunking is performed by
> the drivers, if the drivers can provide marking information about the
> utterance borders. So how about moving the sentence parsing code from
> KTTSD to a common driver library? It would have the following
> advantages:
>
> - All the higher level tools no longer need to implement utterance
> chunking.
>
> - KTTSD is no longer responsible for it, so in case something is wrong
> with the parsing, you can complain to the common driver library. ;-)
>
> - Sophisticated synthesizers can perform their own utterance chunking.
>
> Synthesizers which can't do it can simply use the library.
>
> >> - [The KTTSD approach to warnings and messages suggests it could
> >> be useful if some sort of index markers could be inserted into
> >> the input texts automatically, at breakable places. I.e. places,
> >> where the audio output can be interrupted without breaking the
> >> speech at an unsuitable place (e.g. in the middle of a word or
> >> short sentence). This can be useful for pausing the speech or
> >> for speaking unrelated important messages when reading longer
> >> pieces of text. What do you think?]
>
> GC> Hmm. More complication. Since KTTSD already does sentence
> GC> parsing,
>
> But other higher level tools don't. And the support in KTTSD is
> probably only very simple and incomplete I guess?
>
> GC> markers are easy to support, as long as accuracy is only
> GC> required to the sentence level.
>
> What if the word level is required? What if I want index marks on line
> breaks? How about text which is not plain text (source code, e-mails,
> ...)? All of it can be added, but doesn't it make complications at
> inappropriate places? The idea of avoiding index markers is tempting,
> but we should be careful.
>
You are right that the sentence parsing in KTTSD right now is very simple and
primitive. It uses regular expressions and defaults to English language
semantics for end-of-sentence delimiters. This is something that is on my
TODO list. Because the range of available languages is somewhat limited
right now, it hasn't been a top priority to improve it. Applications can
work around the current KTTSD limitations in two ways: 1) They can change
KTTSD's regular expression, or 2) They can do their own utterance chunking
and use KTTSD's appendText() method to pass the chunks to KTTSD.
As you point out, utterance chunking is highly dependent upon the context of
the text source. For example, the Kate text editor is already aware I'm
working with C++ code and does syntax highlighting for me, so Kate could in
theory do very good chunking.
It would be difficult for a low-level driver to sense the proper context given
only the raw text, although I suppose it could be done. Maybe the answer is
the SSML <say-as> tag with its "interpret-as", "format", and "detail"
attributes. We'd have to come up with a list of standard attribute values.
I like your idea of a separate library, but I would design it so that the
library would be available to both synth drivers and higher levels. That
way, applications could either let the driver do the chunking, or they could
do it themselves. If the higher level does the chunking, it can wrap the
chunked text in <say-as interpret="no-chunk"> or something like that to tell
the driver not to break the text into smaller chunks. So if Kate chunked
some C++ code, it might pass the following chunk to the driver
<say-as interpret="no-chunk, spell-unknown-words">if (obj.value ==
Null)</say-as>
In this way, the driver knows not to split the chunk at the period, to say the
words "if", "value", and "Null", but spell out "o b j". Or if Kate didn't do
the chunking and the driver supported C++ semantics, Kate could pass
<say-as "interpret="c++">if (obj.value == Null) counter++;</say-as>
I'd be interested in working on a "chunking" library, since KTTSD needs that
enhancement anyway. I believe synths call this "text normalization"? I'm
open to pointers to existing code we can adopt.
--
Gary Cramblitt (aka PhantomsDad)
KDE Text-to-Speech Maintainer
http://accessibility.kde.org/developer/kttsd/index.php
More information about the freedesktop
mailing list