[gst-devel] Supporting text-to-speech (and other text handling/processing workflows)

Wed Nov 11 09:38:41 CET 2009

Hi,

I have been looking at creating a text-to-speech engine and supporting
GUI. In theory, these can fit nicely into GStreamer, as they take text
and convert it to audio (which you can then plug into a GStreamer
backend for playback or recording). There are some aspects of
text-to-speech (event notifications; data view; text-to-text
workflows) that I am not sure fit directly into the GStreamer model.

Anyway, here are my current thoughts on the architecture of a
text-to-speech engine (without going into the details of how
text-to-phoneme and phoneme-to-text is handled).

----- 8< -----

# Data Sources: file; string buffer; stdin
# Data Sinks: file; string buffer; stdout
# Readers: source => stream
# Writers: stream => sink
# Archives/Compression (Readers): zip; flate; gzip; ...

   1. Archive Offset -- position of the first byte in the specified
file in the archive
   2. File Name -- name of the current file source

# Encodings (Readers/Writers): ascii; utf8; ...

   1. Raw Byte Offset -- position in the stream in bytes
   2. Encoded Character Offset -- position in the stream in characters
   3. Need to change encodings -- e.g. xml encoding attribute (ascii
=> utf8; ...) and html meta/content-type tag

# File Formats (Readers): text; html; pdf; epub; odf; rtf; ssml; smil; ...

   1. Stream Offset -- byte/character offset in the raw data stream
(what to do when changing encodings?)
   2. Text Offset -- character offset in the text
   3. Viewer -- presenting the file in a text reader (Gtk+; Qt; ncurses; ...)
   4. File formats may change data source (zipped stream; multi-file
format; ...)
   5. File Reader: Data Source => Archive/Compression => Encoding => File Format
   6. Some formats (e.g. SSML) require understanding phoneme sets:
need to pass this as a phoneme stream
   7. Need a meta-format to transform the source to:
         1. text sequence -- offset/file information; language (may be
different languages; pass xml:lang data; ...); text
         2. phoneme sequence -- offset/file information; phoneme set; prosody
         3. additional instructions -- pauses; volume; rate; pitch; ...
         4. audio files/data? -- e.g. from ssml or smil data
   8. Should support reading/writing the wire format from the File
Format Reader/Writer
         1. format identification
         2. versioning
         3. byte order? -- for binary data (audio; anything else?)
         4. meta-data? -- RDF/Turtle?
         5. encoding? -- text; phoneme sequences; audio data

# Phoneme Sets (Readers/Writers): ipa; sampa; kirshenbaum; cmu-en_US;
festival-en_US; cepstral-[language]; ...

   1. IPA is a Unicode phoneme set -- U32 data stream
   2. The other phoneme sets use ascii characters only -- U8 data stream

# Workflows:

   1. File Reader => Text => Encoding => Data Sink
         1. Test a file reader (e.g. is it handling SSML data correctly).
   2. File Reader => Text => [Text-to-Phoneme] => Phonemes => Phoneme
Set => Encoding => Data Sink
         1. Record the phoneme sequence to a file.
         2. Useful for testing language rules.
         3. dictionary -- use a dictionary to look up words to give
the phoneme (and possibly parts-of-speech) sequence
         4. letter-to-phoneme -- use letter-to-phoneme rules for where
there is no dictionary match.
         5. accent/dialect -- apply accent/dialect phoneme-to-phoneme
transformation rules (e.g. /ɒ/ => /ɑ/ (cot-caught merger) in General
American).
         6. target phoneme set -- the phoneme set being written
(default=ipa+utf8)
         7. encoding -- the target encoding for the phoneme set to be
written out as (ascii; utf8; ...)
   3. Data Source => Encoding => Phoneme Set => Phonemes => Phoneme
Set => Encoding => Data Sink
         1. Phoneme set transcoding (e.g. Unicode IPA to Kirshenbaum).
         2. Useful for testing phoneme set support.
         3. source phoneme set -- the phoneme set being read (encode
in file stream? -- better than asking the user to know this)
         4. target phoneme set -- the phoneme set being written
(default=ipa+utf8)
         5. encoding -- the target encoding for the phoneme set to be
written out as (ascii; utf8; ...)
   4. File Reader => Text => [Text-to-Phoneme] => Phonemes =>
[Phoneme-to-audio] => Raw Audio => GStreamer
         1. Playback to an audio sink (alsa; oss; pulseaudio; jack;
portaudio; ...).
         2. Record to a file (raw pcm; wav; ogg; flac; ...).
         3. Hook into compatible media players (totem; ...).
         4. How to handle text-to-speech events (e.g. for highlighting
the current word being spoken; for playback progress; ...)?
   5. Other combinations/workflows are possible.

----- >8 -----

Some of this (character encodings, text-based file format readers,
etc.) is shared with other text/document viewers (okular, firefox,
chromium, ...), while other bits are shared with media players
(specifically the audio back end).

There are also other text-to-speech engines (eSpeak, festival,
Cepstral, ...) that support file in (text, ssml, ...) and audio out
for the 'Text => [Text-to-Phoneme] => Phonemes => [Phoneme-to-audio]
=> Raw Audio' part of the processing chain.

In addition to this, the system above is suited to text file
conversion workflows (e.g. pdf => text, odf => rdf, ...).

This could also be useful for accessibility APIs that make use of
text-to-speech (in gnome, kde and others).

So... can this be supported in GStreamer?

If so, how (my investigation didn't find any useful documentation on
writing your own sources/sinks, or different models)? Can it support
callbacks/events (e.g. for highlighting words being read)?

- Reece