<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META http-equiv=Content-Type content="text/html; charset=us-ascii">
<META content="MSHTML 6.00.2900.2523" name=GENERATOR></HEAD>
<BODY>
<DIV><SPAN class=236190320-07032005><FONT face=Arial color=#0000ff size=2>pete,
it's just another case of needing the ability to start and stop speech
immediately-- not a case of a need for low-latency audio. the scenario
is:</FONT></SPAN></DIV>
<DIV><SPAN class=236190320-07032005><FONT face=Arial color=#0000ff
size=2></FONT></SPAN> </DIV>
<DIV><SPAN class=236190320-07032005><FONT face=Arial color=#0000ff size=2>*
Speech is talking</FONT></SPAN></DIV>
<DIV><SPAN class=236190320-07032005><FONT face=Arial color=#0000ff size=2>* User
presses a key</FONT></SPAN></DIV>
<DIV><SPAN class=236190320-07032005><FONT face=Arial color=#0000ff size=2>*
Speech is interrupted (could be by halting DMA and/or resetting the sound
card)</FONT></SPAN></DIV>
<DIV><SPAN class=236190320-07032005><FONT face=Arial color=#0000ff size=2>* key
is echoed (synthesis is started and audio starts streaming to the audio
device)</FONT></SPAN></DIV>
<DIV><SPAN class=236190320-07032005><FONT face=Arial color=#0000ff
size=2></FONT></SPAN> </DIV>
<DIV><SPAN class=236190320-07032005><FONT face=Arial color=#0000ff size=2>note
that hardware latency isn't a factor in either the start or stop scenario-- in
the stop scenario, it's more a factor of how fast the sound card can be reset
(DMA halted). this has nothing to do with hardware
latency.</FONT></SPAN></DIV>
<DIV><SPAN class=236190320-07032005><FONT face=Arial color=#0000ff
size=2></FONT></SPAN> </DIV>
<DIV><SPAN class=236190320-07032005><FONT face=Arial color=#0000ff size=2>in the
start case-- the perceived response time for the user is based on how fast the
sound card can start transfering data from RAM to the hardware audio buffer, not
a factor of how big the chunks being transferred are.</FONT></SPAN></DIV>
<DIV><SPAN class=236190320-07032005><FONT face=Arial color=#0000ff
size=2></FONT></SPAN> </DIV>
<DIV><SPAN class=236190320-07032005><FONT face=Arial color=#0000ff size=2>When
you start mixing in software, the latency of the software mixer does play a
factor-- since the stream to the soundcard is then continuous. but when
characterizing the accessibility requirement, I think specifying low-latency is
the wrong terminology-- what we need is quick start and shut up
times.</FONT></SPAN></DIV>
<DIV><SPAN class=236190320-07032005><FONT face=Arial color=#0000ff
size=2></FONT></SPAN> </DIV>
<DIV><SPAN class=236190320-07032005><FONT face=Arial color=#0000ff
size=2>Marc</FONT></SPAN></DIV>
<BLOCKQUOTE>
<DIV class=OutlookMessageHeader dir=ltr align=left><FONT face=Tahoma
size=2>-----Original Message-----<BR><B>From:</B>
accessibility-bounces@lists.freedesktop.org
[mailto:accessibility-bounces@lists.freedesktop.org]<B>On Behalf Of </B>Pete
Brunet<BR><B>Sent:</B> Monday, March 07, 2005 12:24 AM<BR><B>To:</B>
accessibility@lists.freedesktop.org<BR><B>Subject:</B> [Accessibility] RE:
Multimedia framework requirements<BR><BR></FONT></DIV><BR><FONT
face=sans-serif size=2>Marc, How does the need for instant echoing of keyed
characters when entering text fit in with this situation? Thanks,
Pete</FONT> <BR><FONT face=sans-serif size=2><BR>=====<BR>Pete Brunet, (512)
838-4594, TL 678-4594, brunet@us.ibm.com, ws4g<BR>IBM Accessibility
Architecture and Development, 11501 Burnet Road, MS 9026D020, Austin, TX
78758</FONT> <BR><BR><FONT face=sans-serif
size=2>----------------------------------------------------------------------</FONT>
<BR><FONT face=sans-serif size=2>Date: Sat, 5 Mar 2005 17:55:27 -0700</FONT>
<BR><FONT face=sans-serif size=2>From: "Marc Mulcahy"
<marc@plbb.net></FONT> <BR><FONT face=sans-serif size=2>Subject: RE:
[Accessibility] Multimedia framework requirements</FONT> <BR><FONT
face=sans-serif size=2>To: "Gary Cramblitt"
<garycramblitt@comcast.net>,</FONT> <BR><FONT face=sans-serif
size=2>
<accessibility@lists.freedesktop.org></FONT> <BR><FONT
face=sans-serif size=2>Message-ID:
<KKEGJCDELINGIGICHANAGEKBEDAA.marc@plbb.net></FONT> <BR><FONT
face=sans-serif size=2>Content-Type: text/plain;
charset="iso-8859-2"</FONT> <BR><BR><FONT
face=sans-serif size=2>Well, for what it's worth, here is my $.02.</FONT>
<BR><BR><FONT face=sans-serif size=2>1. We in the accessibility community will
never succeed in trying to</FONT> <BR><FONT face=sans-serif size=2>re-invent
the multimedia server. There have been many attempts by people</FONT>
<BR><FONT face=sans-serif size=2>with expertise in multimedia with varying
degrees of success. So I think</FONT> <BR><FONT face=sans-serif
size=2>the right approach is to focus on selecting an existing solution that
comes</FONT> <BR><FONT face=sans-serif size=2>closest to what we need, and
either living with it, or proposing changes</FONT> <BR><FONT face=sans-serif
size=2>which will bring it closer to what we need.</FONT> <BR><BR><FONT
face=sans-serif size=2>2. The biggest oversight in gnome-speech was that it
did not directly handle</FONT> <BR><FONT face=sans-serif size=2>the audio
coming out of software synthesizers. Given my experience with</FONT>
<BR><FONT face=sans-serif size=2>several commercial and open source speech
engines, I came to the conclusion</FONT> <BR><FONT face=sans-serif size=2>that
the speech framework *must* have control over the audio samples and</FONT>
<BR><FONT face=sans-serif size=2>where they go. If we leave it up to the
speech engines, they will all</FONT> <BR><FONT face=sans-serif
size=2>implement things differently, and we have much less of a good chance
of</FONT> <BR><FONT face=sans-serif size=2>providing a good experience for the
end user. Having control over the audio</FONT> <BR><FONT face=sans-serif
size=2>gives us better control over quick startup and stop times, as well as
the</FONT> <BR><FONT face=sans-serif size=2>ability to route speech to
diferent destinations-- files, headsets,</FONT> <BR><FONT face=sans-serif
size=2>speakers, telephone lines, etc.</FONT> <BR><BR><FONT face=sans-serif
size=2>3. To my mind, ALSA comes the closest to what we need in an audio
framework</FONT> <BR><FONT face=sans-serif size=2>on Linux. It's now
standard, and provides methods for mixing audio streams</FONT> <BR><FONT
face=sans-serif size=2>on soundcards which can't do it in hardware. The
prioritization of audio--</FONT> <BR><FONT face=sans-serif size=2>I.E., muting
the MP3 player when the computer needs to speak something or</FONT> <BR><FONT
face=sans-serif size=2>when a user receives an internet phone call, is the
only piece which appears</FONT> <BR><FONT face=sans-serif size=2>to be
missing.</FONT> <BR><BR><FONT face=sans-serif size=2>Another audio-related
aside... I think there's been some</FONT> <BR><FONT face=sans-serif
size=2>mischaracterization of a requirement. Everyone seems to suggest
that what</FONT> <BR><FONT face=sans-serif size=2>we need is low-latency in an
audio server or environment, and I'm not</FONT> <BR><FONT face=sans-serif
size=2>convinced that this is the case. You need low-latency, or at
least good</FONT> <BR><FONT face=sans-serif size=2>synchronization, if for
example you want to animate a character using</FONT> <BR><FONT face=sans-serif
size=2>text-to-speech as the voice. But, I think from an accessibility
point of</FONT> <BR><FONT face=sans-serif size=2>view, what we really need is
quick start and shut up times, not necessarily</FONT> <BR><FONT
face=sans-serif size=2>low latency, although low latency is better. For
example, from a blind</FONT> <BR><FONT face=sans-serif size=2>usability point
of view, I don't care if the ap sends the sound card a 128</FONT> <BR><FONT
face=sans-serif size=2>KB buffer of audio or a 1 KB buffer of audio, as long
as the sound stops</FONT> <BR><FONT face=sans-serif size=2>immediately when I
press a key, and as long as it starts immediately when</FONT> <BR><FONT
face=sans-serif size=2>there's something to be spoken.</FONT> <BR><BR><FONT
face=sans-serif size=2>My experience shows that low-latency is in fact not
necessarily desirable</FONT> <BR><FONT face=sans-serif size=2>when working
with speech. Presumably speech is a background process which</FONT>
<BR><FONT face=sans-serif size=2>goes on while other more intensive tasks are
happening in the foreground--</FONT> <BR><FONT face=sans-serif size=2>copying
a file, filtering audio, or something of that sort. the lower the</FONT>
<BR><FONT face=sans-serif size=2>latency, the harder it is to keep speech
happy in the background, especially</FONT> <BR><FONT face=sans-serif
size=2>during periods of high disk activity or network load.</FONT>
<BR><BR><FONT face=sans-serif size=2>Rather than having to feed the soundcard
1 K blocks of data, I'd rather</FONT> <BR><FONT face=sans-serif
size=2>synthesize 64 K of data, and dump it to the sound card, and let the
DMA</FONT> <BR><FONT face=sans-serif size=2>controller transfer it while the
processor does something else. and as long</FONT> <BR><FONT
face=sans-serif size=2>as I can shut it up immediately, the user doesn't know
the difference.</FONT> <BR><BR><FONT face=sans-serif size=2>Marc</FONT>
<BR><BR><FONT face=sans-serif size=2>-----Original Message-----</FONT>
<BR><FONT face=sans-serif size=2>From:
accessibility-bounces@lists.freedesktop.org</FONT> <BR><FONT face=sans-serif
size=2>[mailto:accessibility-bounces@lists.freedesktop.org]On Behalf Of
Gary</FONT> <BR><FONT face=sans-serif size=2>Cramblitt</FONT> <BR><FONT
face=sans-serif size=2>Sent: Saturday, March 05, 2005 6:20 AM</FONT> <BR><FONT
face=sans-serif size=2>To: accessibility@lists.freedesktop.org</FONT>
<BR><FONT face=sans-serif size=2>Subject: Re: [Accessibility] Multimedia
framework requirements</FONT> <BR><BR><BR><FONT face=sans-serif size=2>On
Friday 04 March 2005 03:43 pm, Hynek Hanke wrote:</FONT> <BR><FONT
face=sans-serif size=2>> 2 Audio requirements</FONT> <BR><BR><FONT
face=sans-serif size=2>You may want to think about supported audio formats.
Most existing synths</FONT> <BR><FONT face=sans-serif size=2>seem to
produce .wav files (Microsoft riff) or .au.</FONT> <BR><BR><FONT
face=sans-serif size=2>Also, there's the issue of how to deliver the audio to
the audio framework.</FONT> <BR><FONT face=sans-serif size=2>Streams could be
more efficient than files? The TTS API discussion has this</FONT>
<BR><FONT face=sans-serif size=2>as an unresolved item.</FONT> <BR><BR><FONT
face=sans-serif size=2>--</FONT> <BR><FONT face=sans-serif size=2>Gary
Cramblitt (aka PhantomsDad)</FONT> <BR><FONT face=sans-serif size=2>KDE
Text-to-Speech Maintainer</FONT> <BR><FONT face=sans-serif
size=2>http://accessibility.kde.org/developer/kttsd/index.php</FONT>
<BR><BR><FONT face=sans-serif
size=2>------------------------------</FONT></BLOCKQUOTE></BODY></HTML>