[Accessibility] Re: Updated requirements document

Milan Zamazal pdm@brailcom.org
Sat Jan 8 04:04:14 PST 2005


Thank you, Olaf, for your comments!

>>>>> "OS" == Olaf Schmidt <ojschmidt@kde.org> writes:

    OS> [Milan Zamazal, Montag, 15. November 2004 23:42]
    >> OPEN ISSUE:
    >> 
    >> - Should an application be able to determine if SHOULD HAVE and
    >> NICE TO HAVE features are supported or not?

    OS> Yes, because the higher level speech framework might decide to
    OS> avoid the features otherwise, or to emulate them.

OK.

    >> 3.1. MUST HAVE: An application will be able to specify the
    >> default voice to use for a particular synthesizer, and will be
    >> able to change the default voice in between `speak' requests.

    OS> Selecting a default language here would also be needed, because
    OS> in some rare cases, a voice could be able to speak several
    OS> languages. 

I agree, but maybe it's just a terminology issue.  The term _voice_ must
be defined first in the requirements.  I understand it close to its
meaning in SSML, i.e. _voice_ is more a set of certain properties
(language, gender, age, name, etc.) than just a symbolic name.  In such
a case if a named voice can speak two languages, then these are two
different _voices_ by definition.

I believe if _voice_ was defined in such a sense, it would satisfy all
our requirements.  What do you think?

    OS> Perhaps we could also make the setting of the default voice
    OS> could be language specific, but I guess this would complicate
    OS> things too much.

I think so.

    >> - Still not clear consensus on how to return the synthesized
    >> audio data (if at all).  The main issue here is mostly with how
    >> to align marker and other time-related events with the audio
    >> being played on the audio output device.

    OS> I see three possibilities here:

    OS> 1. Return a series of raw audio streams (as function result or
    OS> to a callback function). It would be the task of the application
    OS> to play the right stream whenever it wished to jump to a certain
    OS> marker.  

I can see two slightly refined scenarios here:

A. The application calls the TTS API to synthesize a given text and then
   the application asks the TTS API for the audio streams by repeated
   calls of another TTS API function until no next stream is available.
   The audio data is written to a stream given by the application to
   each of the repeated calls.

B. The application calls the TTS API to synthesize a given text and
   gives it a callback function to be called whenever new audio stream
   is available.

    OS> 2. Return a single raw audio stream and information that marker
    OS> A starts at time A1 after a number of A2 bytes (as function
    OS> result or to a callback function).

The A1 and A2 values can't be returned as a function results, since that
wouldn't work for long texts (the audio and marker data must be
generally available sooner than the whole synthesis finishes).  So the
only option here is to return them by a callback function.

The callback function must be called and must return before any audio
data after the given marker is written to the audio stream so that the
application can't miss the marker when playing the audio data.

    OS> 3. Use a library like portaudio to handle the playing in speech
    OS> drivers themselves.

This would limit the use of the API -- it should no way specify how the
synthesized data is used.

I'd exclude 3.  I think all 1.A., 1.B. and 2. are possible, but I'm not
sure about their advantages or disadvantages now.

I'd only like to mention that it is likely we will implement the TTS API
in the form of a shared library and it would be nice if it could be
easily handled not only by C (and relative) programs, but also by
foreign interfaces.  So the following points should be considered:

- It is probably better to use file descriptors rather than C streams
  (but I'm not sure whether using file descriptors is a sufficiently
  portable solution).

- I don't know, whether using callback functions can mean any problem to
  foreign interfaces.  AFAIK calling shared library functions from non-C
  programs usually makes no problems as long as the function arguments
  are of simple types and the C function doesn't make unwanted side
  effects, but I'm not sure about the other way round.

    >> - Not clear on how to (or if we even should) specify the audio
    >> format to be used by a synthesizer.

    OS> A multimedia developer told me that the format of raw,
    OS> uncompressed audio data is recognised by all multimedia
    OS> frameworks, so I don't think we need to pass any special
    OS> information back to the applications.

Another point is that speech synthesizers often use their preferred
audio formats for audio output.  Should the TTS API be required to
return the audio data in some particular format or in a format from a
given set of supported audio formats?

    >> - Implementation issue: Will the interaction with the driver be
    >> synchronous or asynchronous?  For example, will a call to `speak'
    >> wait to return until all the audio has been processed?

    OS> I think both synchronous and asynchronous would be possible. In
    OS> one case, we could use an id for every call and a callback
    OS> function for passing the audio stream. In the other case, the
    OS> speak function could return a pointer to the audio stream.

Do you know about any particular advantages or difficulties (like ease
of implementation on both sides) of one or another approach?

    >> If not, what happens when a call to "speak" is made while the
    >> synthesizer is still processing a prior call to "speak?"

    OS> This should be up to the driver. 

Let's define at least some necessary properties, e.g.:

- Such a call is allowed.

- It doesn't make the synthesis to proceed in unexpected ways.

- Etc. (depending on the chosen way the audio data is returned)

    OS> An SSML tag at the end of the first text snippet might change
    OS> the parameters that are used for the second text snippet, so at
    OS> least the XML parsing of the first call needs to be finished
    OS> before the second is synthesised.

Good remark.

Regards,

Milan Zamazal

-- 
If we are going to start removing packages because of the quality of the
software, wonderful.  I move to remove all traces of the travesty of editors,
vi, from Debian, since obviously as editors they are less than alpha quality
software.                                   -- Manoj Srivastava in debian-devel


More information about the Accessibility mailing list