[Accessibility] RE: Multimedia framework requirements

Marc Mulcahy marc at plbb.net
Mon Mar 7 12:11:46 PST 2005


pete, it's just another case of needing the ability to start and stop speech
immediately-- not a case of a need for low-latency audio.  the scenario is:

* Speech is talking
* User presses a key
* Speech is interrupted (could be by halting DMA and/or resetting the sound
card)
* key is echoed (synthesis is started and audio starts streaming to the
audio device)

note that hardware latency isn't a factor in either the start or stop
scenario-- in the stop scenario, it's more a factor of how fast the sound
card can be reset (DMA halted).  this has nothing to do with hardware
latency.

in the start case-- the perceived response time for the user is based on how
fast the sound card can start transfering data from RAM to the hardware
audio buffer, not a factor of how big the chunks being transferred are.

When you start mixing in software, the latency of the software mixer does
play a factor-- since the stream to the soundcard is then continuous.  but
when characterizing the accessibility requirement, I think specifying
low-latency is the wrong terminology-- what we need is quick start and shut
up times.

Marc
  -----Original Message-----
  From: accessibility-bounces at lists.freedesktop.org
[mailto:accessibility-bounces at lists.freedesktop.org]On Behalf Of Pete Brunet
  Sent: Monday, March 07, 2005 12:24 AM
  To: accessibility at lists.freedesktop.org
  Subject: [Accessibility] RE: Multimedia framework requirements



  Marc, How does the need for instant echoing of keyed characters when
entering text fit in with this situation?  Thanks, Pete

  =====
  Pete Brunet, (512) 838-4594, TL 678-4594, brunet at us.ibm.com, ws4g
  IBM Accessibility Architecture and Development, 11501 Burnet Road, MS
9026D020, Austin, TX  78758

  ----------------------------------------------------------------------
  Date: Sat, 5 Mar 2005 17:55:27 -0700
  From: "Marc Mulcahy" <marc at plbb.net>
  Subject: RE: [Accessibility] Multimedia framework requirements
  To: "Gary Cramblitt" <garycramblitt at comcast.net>,
                   <accessibility at lists.freedesktop.org>
  Message-ID: <KKEGJCDELINGIGICHANAGEKBEDAA.marc at plbb.net>
  Content-Type: text/plain;                 charset="iso-8859-2"

  Well, for what it's worth, here is my $.02.

  1. We in the accessibility community will never succeed in trying to
  re-invent the multimedia server.  There have been many attempts by people
  with expertise in multimedia with varying degrees of success.  So I think
  the right approach is to focus on selecting an existing solution that
comes
  closest to what we need, and either living with it, or proposing changes
  which will bring it closer to what we need.

  2. The biggest oversight in gnome-speech was that it did not directly
handle
  the audio coming out of software synthesizers.  Given my experience with
  several commercial and open source speech engines, I came to the
conclusion
  that the speech framework *must* have control over the audio samples and
  where they go.  If we leave it up to the speech engines, they will all
  implement things differently, and we have much less of a good chance of
  providing a good experience for the end user.  Having control over the
audio
  gives us better control over quick startup and stop times, as well as the
  ability to route speech to diferent destinations-- files, headsets,
  speakers, telephone lines, etc.

  3. To my mind, ALSA comes the closest to what we need in an audio
framework
  on Linux.  It's now standard, and provides methods for mixing audio
streams
  on soundcards which can't do it in hardware.  The prioritization of
audio--
  I.E., muting the MP3 player when the computer needs to speak something or
  when a user receives an internet phone call, is the only piece which
appears
  to be missing.

  Another audio-related aside...  I think there's been some
  mischaracterization of a requirement.  Everyone seems to suggest that what
  we need is low-latency in an audio server or environment, and I'm not
  convinced that this is the case.  You need low-latency, or at least good
  synchronization, if for example you want to animate a character using
  text-to-speech as the voice.  But, I think from an accessibility point of
  view, what we really need is quick start and shut up times, not
necessarily
  low latency, although low latency is better.  For example, from a blind
  usability point of view, I don't care if the ap sends the sound card a 128
  KB buffer of audio or a 1 KB buffer of audio, as long as the sound stops
  immediately when I press a key, and as long as it starts immediately when
  there's something to be spoken.

  My experience shows that low-latency is in fact not necessarily desirable
  when working with speech.  Presumably speech is a background process which
  goes on while other more intensive tasks are happening in the foreground--
  copying a file, filtering audio, or something of that sort.  the lower the
  latency, the harder it is to keep speech happy in the background,
especially
  during periods of high disk activity or network load.

  Rather than having to feed the soundcard 1 K blocks of data, I'd rather
  synthesize 64 K of data, and dump it to the sound card, and let the DMA
  controller transfer it while the processor does something else.  and as
long
  as I can shut it up immediately, the user doesn't know the difference.

  Marc

  -----Original Message-----
  From: accessibility-bounces at lists.freedesktop.org
  [mailto:accessibility-bounces at lists.freedesktop.org]On Behalf Of Gary
  Cramblitt
  Sent: Saturday, March 05, 2005 6:20 AM
  To: accessibility at lists.freedesktop.org
  Subject: Re: [Accessibility] Multimedia framework requirements


  On Friday 04 March 2005 03:43 pm, Hynek Hanke wrote:
  > 2 Audio requirements

  You may want to think about supported audio formats.  Most existing synths
  seem to produce .wav files (Microsoft riff) or .au.

  Also, there's the issue of how to deliver the audio to the audio
framework.
  Streams could be more efficient than files?  The TTS API discussion has
this
  as an unresolved item.

  --
  Gary Cramblitt (aka PhantomsDad)
  KDE Text-to-Speech Maintainer
  http://accessibility.kde.org/developer/kttsd/index.php

  ------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.freedesktop.org/archives/accessibility/attachments/20050307/6d9fabfe/attachment.html


More information about the Accessibility mailing list