[Accessibility] Sentence Boundary Detection (Chunking)

Fri Dec 17 08:30:34 PST 2004

In a recent discussion on a common speech API, we mentioned that we ought to 
develop a standard library for "chunking" text into sentences, which could be 
used by speech synthesis authors.  This message is to explain what my little 
bit of research on the subject has revealed.  I apologize if you already know 
this, but thought it might be informative to some.

Sentence "chunking" is more properly called "Sentence Boundary Detection", 
which I'll abbreviate to SBD.  The term "chunking" is used for breaking 
sentences up into smaller fragments -- noun phrases, verb phrases, etc.

By far, the hardest thing about SBD is dealing with English periods.

3 techniques are commonly used for SBD:

1.  Regular expressions based on punctuation.  This is what KTTSD uses.
2.  Classification and regression trees (CART).  These are decision trees very 
much like a state machine.  The Festival speech synthesizer has a fairly 
simple CART built in.  (info:/festival/Utterance chunking).  In my testing, 
this technique is about as effective as regular expressions.
3.  Part of speech (POS) analysis.  In this technique, the text is analyzed 
and parts of speech assigned to each word -- noun, adjective, verb, adverb, 
etc.  Periods are then examined in the context of the surrounding words.  A 
viterbi algorithm is then used to predict the probably as to whether the 
periods are an end of sentence or not.  A period preceded by a object phrase 
and followed by a noun phrase, for instance, is very likely an end of 
sentence, but a period followed by a verb is not.  Note that this technique 
requires a probability database for the target language, both for the POS 
analysis and the viterbi algorithm.  In my testing, this method is only 
marginally better than 1 or 2.  (I apologize if I've mangled the description 
of this method.)

The good news is that both 2 and 3 are already available in the 
freely-distributable Edinburgh Speech Tools library, which is part of 
Festival.  The bad news is that the databases are only available for English 
and one or two other languages.

If you'd like to play with the SBD capabilities of Festival, I've attached a 
scheme script.  It is a modified version of the text2pos script that comes 
with festival in the examples directory.  Use like this

echo "This is a sentence.  This is the second sentence." | ./text2sen

The output looks like this:

"BB"/0
"This"/"dt"
"is"/"vbz"
"a"/"dt"
"sentence"/"nn"
"."/"punc"
"BB"/0
"This"/"dt"
"is"/"vbz"
"the"/"dt"
"second"/"jj"
"sentence"/"nn"
"."/"punc"

The "BB" stands for "big break", i.e., sentence boundaries.

-- 
Gary Cramblitt (aka PhantomsDad)
KDE Text-to-Speech Maintainer
http://accessibility.kde.org/developer/kttsd/index.php
-------------- next part --------------
#!/usr/bin/festival --script
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;-*-mode:scheme-*-
;;                                                                       ;;
;;                Centre for Speech Technology Research                  ;;
;;                     University of Edinburgh, UK                       ;;
;;                       Copyright (c) 1996,1997                         ;;
;;                        All Rights Reserved.                           ;;
;;                                                                       ;;
;;  Permission is hereby granted, free of charge, to use and distribute  ;;
;;  this software and its documentation without restriction, including   ;;
;;  without limitation the rights to use, copy, modify, merge, publish,  ;;
;;  distribute, sublicense, and/or sell copies of this work, and to      ;;
;;  permit persons to whom this work is furnished to do so, subject to   ;;
;;  the following conditions:                                            ;;
;;   1. The code must retain the above copyright notice, this list of    ;;
;;      conditions and the following disclaimer.                         ;;
;;   2. Any modifications must be clearly marked as such.                ;;
;;   3. Original authors' names are not deleted.                         ;;
;;   4. The authors' names are not used to endorse or promote products   ;;
;;      derived from this software without specific prior written        ;;
;;      permission.                                                      ;;
;;                                                                       ;;
;;  THE UNIVERSITY OF EDINBURGH AND THE CONTRIBUTORS TO THIS WORK        ;;
;;  DISCLAIM ALL WARRANTIES WITH REGARD TO THIS SOFTWARE, INCLUDING      ;;
;;  ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS, IN NO EVENT   ;;
;;  SHALL THE UNIVERSITY OF EDINBURGH NOR THE CONTRIBUTORS BE LIABLE     ;;
;;  FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES    ;;
;;  WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN   ;;
;;  AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION,          ;;
;;  ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF       ;;
;;  THIS SOFTWARE.                                                       ;;
;;                                                                       ;;
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;;;           Author:  Alan W Black
;;;           Date:    August 1996
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;;;
;;;  Reads in text from stdin and outputs text/pos on stdout
;;;
;;;  Designed to show how simple filters can be written in Festival
;;;
;;;  First we defined a function that processes an utterance enough
;;;  to predict part of speech, namely, tokenize it, find the words
;;;  and then run the POS tagger on it.
;;;  Then we define a function to extract the word and pos tag itself
;;;
;;;  We redefine the basic functions run on utterances during text to
;;;  speech to be our two newly-defined function and then simply
;;;  run tts on standard input.
;;;

;;; Because this is a --script type file I has to explicitly
;;; load the initfiles: init.scm and user's .festivalrc
(load (path-append datadir "init.scm"))

;;;(set! Phrase_Method 'cart_tree)
;;;(set! Phrase_Method 'prob_models)

(define (find-pos utt)
"Main function for processing TTS utterances.  Predicts POS and
prints words with their POS"
  (Token utt)
  (POS utt)
  (Phrasify utt)
)

(define (output-pos utt)
"Output the word/pos for each word in utt"
 (mapcar
  (lambda (pair)
    (format t "%l/%l\n" (car pair) (car (cdr pair)) ))
;;;    (print (car pair)))
  (utt.features utt 'Phrase '(name pos))))

;;;
;;; Redefine what happens to utterances during text to speech 
;;;
(set! tts_hooks (list find-pos output-pos))

;;; Stop those GC messages
(gc-status nil)

;;; Do the work
(tts_file "-")