[gst-devel] Supporting text-to-speech (and other text handling/processing workflows)

Thu Nov 12 22:08:52 CET 2009

Reece Dunn schrieb:
> Hi,
> 
> I have been looking at creating a text-to-speech engine and supporting
> GUI. In theory, these can fit nicely into GStreamer, as they take text
> and convert it to audio (which you can then plug into a GStreamer
> backend for playback or recording). There are some aspects of
> text-to-speech (event notifications; data view; text-to-text
> workflows) that I am not sure fit directly into the GStreamer model.
> 
> Anyway, here are my current thoughts on the architecture of a
> text-to-speech engine (without going into the details of how
> text-to-phoneme and phoneme-to-text is handled).

just go ahead and do it :)

text-to-speech : festival (ftlite would be nice)
speech-to-text : pocketsphinx

those are not perfect, but a good starting point. Now please write a
google-translate plugin with src-language and target-language parameters, use
sentence events from pocketsphinx to kick translations of the text via
google-web-service and voila - we have the star-trek universal translator.

But seriously, al that should generaly work. You might also want to look at
subtitle stuff which is handling sparse text streams.

Stefan

> 
> ----- 8< -----
> 
> # Data Sources: file; string buffer; stdin
> # Data Sinks: file; string buffer; stdout
> # Readers: source => stream
> # Writers: stream => sink
> # Archives/Compression (Readers): zip; flate; gzip; ...
> 
>    1. Archive Offset -- position of the first byte in the specified
> file in the archive
>    2. File Name -- name of the current file source
> 
> # Encodings (Readers/Writers): ascii; utf8; ...
> 
>    1. Raw Byte Offset -- position in the stream in bytes
>    2. Encoded Character Offset -- position in the stream in characters
>    3. Need to change encodings -- e.g. xml encoding attribute (ascii
> => utf8; ...) and html meta/content-type tag
> 
> # File Formats (Readers): text; html; pdf; epub; odf; rtf; ssml; smil; ...
> 
>    1. Stream Offset -- byte/character offset in the raw data stream
> (what to do when changing encodings?)
>    2. Text Offset -- character offset in the text
>    3. Viewer -- presenting the file in a text reader (Gtk+; Qt; ncurses; ...)
>    4. File formats may change data source (zipped stream; multi-file
> format; ...)
>    5. File Reader: Data Source => Archive/Compression => Encoding => File Format
>    6. Some formats (e.g. SSML) require understanding phoneme sets:
> need to pass this as a phoneme stream
>    7. Need a meta-format to transform the source to:
>          1. text sequence -- offset/file information; language (may be
> different languages; pass xml:lang data; ...); text
>          2. phoneme sequence -- offset/file information; phoneme set; prosody
>          3. additional instructions -- pauses; volume; rate; pitch; ...
>          4. audio files/data? -- e.g. from ssml or smil data
>    8. Should support reading/writing the wire format from the File
> Format Reader/Writer
>          1. format identification
>          2. versioning
>          3. byte order? -- for binary data (audio; anything else?)
>          4. meta-data? -- RDF/Turtle?
>          5. encoding? -- text; phoneme sequences; audio data
> 
> # Phoneme Sets (Readers/Writers): ipa; sampa; kirshenbaum; cmu-en_US;
> festival-en_US; cepstral-[language]; ...
> 
>    1. IPA is a Unicode phoneme set -- U32 data stream
>    2. The other phoneme sets use ascii characters only -- U8 data stream
> 
> # Workflows:
> 
>    1. File Reader => Text => Encoding => Data Sink
>          1. Test a file reader (e.g. is it handling SSML data correctly).
>    2. File Reader => Text => [Text-to-Phoneme] => Phonemes => Phoneme
> Set => Encoding => Data Sink
>          1. Record the phoneme sequence to a file.
>          2. Useful for testing language rules.
>          3. dictionary -- use a dictionary to look up words to give
> the phoneme (and possibly parts-of-speech) sequence
>          4. letter-to-phoneme -- use letter-to-phoneme rules for where
> there is no dictionary match.
>          5. accent/dialect -- apply accent/dialect phoneme-to-phoneme
> transformation rules (e.g. /ɒ/ => /ɑ/ (cot-caught merger) in General
> American).
>          6. target phoneme set -- the phoneme set being written
> (default=ipa+utf8)
>          7. encoding -- the target encoding for the phoneme set to be
> written out as (ascii; utf8; ...)
>    3. Data Source => Encoding => Phoneme Set => Phonemes => Phoneme
> Set => Encoding => Data Sink
>          1. Phoneme set transcoding (e.g. Unicode IPA to Kirshenbaum).
>          2. Useful for testing phoneme set support.
>          3. source phoneme set -- the phoneme set being read (encode
> in file stream? -- better than asking the user to know this)
>          4. target phoneme set -- the phoneme set being written
> (default=ipa+utf8)
>          5. encoding -- the target encoding for the phoneme set to be
> written out as (ascii; utf8; ...)
>    4. File Reader => Text => [Text-to-Phoneme] => Phonemes =>
> [Phoneme-to-audio] => Raw Audio => GStreamer
>          1. Playback to an audio sink (alsa; oss; pulseaudio; jack;
> portaudio; ...).
>          2. Record to a file (raw pcm; wav; ogg; flac; ...).
>          3. Hook into compatible media players (totem; ...).
>          4. How to handle text-to-speech events (e.g. for highlighting
> the current word being spoken; for playback progress; ...)?
>    5. Other combinations/workflows are possible.
> 
> ----- >8 -----
> 
> Some of this (character encodings, text-based file format readers,
> etc.) is shared with other text/document viewers (okular, firefox,
> chromium, ...), while other bits are shared with media players
> (specifically the audio back end).
> 
> There are also other text-to-speech engines (eSpeak, festival,
> Cepstral, ...) that support file in (text, ssml, ...) and audio out
> for the 'Text => [Text-to-Phoneme] => Phonemes => [Phoneme-to-audio]
> => Raw Audio' part of the processing chain.
> 
> In addition to this, the system above is suited to text file
> conversion workflows (e.g. pdf => text, odf => rdf, ...).
> 
> This could also be useful for accessibility APIs that make use of
> text-to-speech (in gnome, kde and others).
> 
> So... can this be supported in GStreamer?
> 
> If so, how (my investigation didn't find any useful documentation on
> writing your own sources/sinks, or different models)? Can it support
> callbacks/events (e.g. for highlighting words being read)?
> 
> - Reece
> 
> ------------------------------------------------------------------------------
> Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
> trial. Simplify your report design, integration and deployment - and focus on 
> what you do best, core application coding. Discover what's new with
> Crystal Reports now.  http://p.sf.net/sfu/bobj-july
> _______________________________________________
> gstreamer-devel mailing list
> gstreamer-devel at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/gstreamer-devel