[gst-devel] Supporting text-to-speech (and other text handling/processing workflows)
Stefan Kost
ensonic at hora-obscura.de
Thu Nov 12 22:08:52 CET 2009
Reece Dunn schrieb:
> Hi,
>
> I have been looking at creating a text-to-speech engine and supporting
> GUI. In theory, these can fit nicely into GStreamer, as they take text
> and convert it to audio (which you can then plug into a GStreamer
> backend for playback or recording). There are some aspects of
> text-to-speech (event notifications; data view; text-to-text
> workflows) that I am not sure fit directly into the GStreamer model.
>
> Anyway, here are my current thoughts on the architecture of a
> text-to-speech engine (without going into the details of how
> text-to-phoneme and phoneme-to-text is handled).
just go ahead and do it :)
text-to-speech : festival (ftlite would be nice)
speech-to-text : pocketsphinx
those are not perfect, but a good starting point. Now please write a
google-translate plugin with src-language and target-language parameters, use
sentence events from pocketsphinx to kick translations of the text via
google-web-service and voila - we have the star-trek universal translator.
But seriously, al that should generaly work. You might also want to look at
subtitle stuff which is handling sparse text streams.
Stefan
>
> ----- 8< -----
>
> # Data Sources: file; string buffer; stdin
> # Data Sinks: file; string buffer; stdout
> # Readers: source => stream
> # Writers: stream => sink
> # Archives/Compression (Readers): zip; flate; gzip; ...
>
> 1. Archive Offset -- position of the first byte in the specified
> file in the archive
> 2. File Name -- name of the current file source
>
> # Encodings (Readers/Writers): ascii; utf8; ...
>
> 1. Raw Byte Offset -- position in the stream in bytes
> 2. Encoded Character Offset -- position in the stream in characters
> 3. Need to change encodings -- e.g. xml encoding attribute (ascii
> => utf8; ...) and html meta/content-type tag
>
> # File Formats (Readers): text; html; pdf; epub; odf; rtf; ssml; smil; ...
>
> 1. Stream Offset -- byte/character offset in the raw data stream
> (what to do when changing encodings?)
> 2. Text Offset -- character offset in the text
> 3. Viewer -- presenting the file in a text reader (Gtk+; Qt; ncurses; ...)
> 4. File formats may change data source (zipped stream; multi-file
> format; ...)
> 5. File Reader: Data Source => Archive/Compression => Encoding => File Format
> 6. Some formats (e.g. SSML) require understanding phoneme sets:
> need to pass this as a phoneme stream
> 7. Need a meta-format to transform the source to:
> 1. text sequence -- offset/file information; language (may be
> different languages; pass xml:lang data; ...); text
> 2. phoneme sequence -- offset/file information; phoneme set; prosody
> 3. additional instructions -- pauses; volume; rate; pitch; ...
> 4. audio files/data? -- e.g. from ssml or smil data
> 8. Should support reading/writing the wire format from the File
> Format Reader/Writer
> 1. format identification
> 2. versioning
> 3. byte order? -- for binary data (audio; anything else?)
> 4. meta-data? -- RDF/Turtle?
> 5. encoding? -- text; phoneme sequences; audio data
>
> # Phoneme Sets (Readers/Writers): ipa; sampa; kirshenbaum; cmu-en_US;
> festival-en_US; cepstral-[language]; ...
>
> 1. IPA is a Unicode phoneme set -- U32 data stream
> 2. The other phoneme sets use ascii characters only -- U8 data stream
>
> # Workflows:
>
> 1. File Reader => Text => Encoding => Data Sink
> 1. Test a file reader (e.g. is it handling SSML data correctly).
> 2. File Reader => Text => [Text-to-Phoneme] => Phonemes => Phoneme
> Set => Encoding => Data Sink
> 1. Record the phoneme sequence to a file.
> 2. Useful for testing language rules.
> 3. dictionary -- use a dictionary to look up words to give
> the phoneme (and possibly parts-of-speech) sequence
> 4. letter-to-phoneme -- use letter-to-phoneme rules for where
> there is no dictionary match.
> 5. accent/dialect -- apply accent/dialect phoneme-to-phoneme
> transformation rules (e.g. /ɒ/ => /ɑ/ (cot-caught merger) in General
> American).
> 6. target phoneme set -- the phoneme set being written
> (default=ipa+utf8)
> 7. encoding -- the target encoding for the phoneme set to be
> written out as (ascii; utf8; ...)
> 3. Data Source => Encoding => Phoneme Set => Phonemes => Phoneme
> Set => Encoding => Data Sink
> 1. Phoneme set transcoding (e.g. Unicode IPA to Kirshenbaum).
> 2. Useful for testing phoneme set support.
> 3. source phoneme set -- the phoneme set being read (encode
> in file stream? -- better than asking the user to know this)
> 4. target phoneme set -- the phoneme set being written
> (default=ipa+utf8)
> 5. encoding -- the target encoding for the phoneme set to be
> written out as (ascii; utf8; ...)
> 4. File Reader => Text => [Text-to-Phoneme] => Phonemes =>
> [Phoneme-to-audio] => Raw Audio => GStreamer
> 1. Playback to an audio sink (alsa; oss; pulseaudio; jack;
> portaudio; ...).
> 2. Record to a file (raw pcm; wav; ogg; flac; ...).
> 3. Hook into compatible media players (totem; ...).
> 4. How to handle text-to-speech events (e.g. for highlighting
> the current word being spoken; for playback progress; ...)?
> 5. Other combinations/workflows are possible.
>
> ----- >8 -----
>
> Some of this (character encodings, text-based file format readers,
> etc.) is shared with other text/document viewers (okular, firefox,
> chromium, ...), while other bits are shared with media players
> (specifically the audio back end).
>
> There are also other text-to-speech engines (eSpeak, festival,
> Cepstral, ...) that support file in (text, ssml, ...) and audio out
> for the 'Text => [Text-to-Phoneme] => Phonemes => [Phoneme-to-audio]
> => Raw Audio' part of the processing chain.
>
> In addition to this, the system above is suited to text file
> conversion workflows (e.g. pdf => text, odf => rdf, ...).
>
> This could also be useful for accessibility APIs that make use of
> text-to-speech (in gnome, kde and others).
>
> So... can this be supported in GStreamer?
>
> If so, how (my investigation didn't find any useful documentation on
> writing your own sources/sinks, or different models)? Can it support
> callbacks/events (e.g. for highlighting words being read)?
>
> - Reece
>
> ------------------------------------------------------------------------------
> Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
> trial. Simplify your report design, integration and deployment - and focus on
> what you do best, core application coding. Discover what's new with
> Crystal Reports now. http://p.sf.net/sfu/bobj-july
> _______________________________________________
> gstreamer-devel mailing list
> gstreamer-devel at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/gstreamer-devel
More information about the gstreamer-devel
mailing list