[HarfBuzz] Word Segmentation at 2008 Text Layout Meeting ...

Thu Jan 10 07:19:29 PST 2008

Hi, mpsuzuki,

On Jan 9, 2008 9:11 PM,  <mpsuzuki at hiroshima-u.ac.jp> wrote:
> Hi,
>
> Text Layout 2007 provided the discussion on font and
> Unicode/i18n-ed text layout technologies, and Text
> Layout 2008 will provide the discussion on more script-
> specific issues. The word segmentation/line breaking
> is central theme of Text Layout 2008?
>

Many issues in OpenType- and Graphite-based text layout on the Free
Desktop still remain to be discussed, debated, resolved, and
implemented in production-quality code.  Therefore I think it is too
early to make word segmentation/line breaking a "central theme".
However I do think there is much interest in this area -- and rightly
so.

Roughly speaking, there are two main groups of people interested in
"word segmentation/line breaking" :

   (1) People with a background and interest in high-quality Western
typography.  Most of the Scribus developers who will be at LGM fall
into this group.  For Western languages, "word segmentation" is
obviously not a problem -- but high-quality "line breaking" based on
syllabification of words with "soft" hyphens, optimal justification,
and how to prevent a single word "hanging" at the end of an otherwise
"good layout" paragraph -- these are non-trivial problems of great
interest.

  (2) People interested in word segmentation of so-called "spaceless"
scripts.  These are primarily scripts of Asia, and we can make two
sub-categories:

       2.1. People interested in the Indic-derived spaceless scripts
of South and Southeast Asia.  Thai, Lao, Myanmar, and Khmer top the
list here, but there are of course a number of other related scripts
here too.

       2.2. People interested in CJK used in East Asia.  For CJK, line
breaking is not a big problem because you can generally break a line
of text at any CJK character.  However, word segmentation is still a
huge area of interest in Natural Language Processing (NLP) : counting
words in a text, text-to-speech, checking spelling, OCR, etc.: the
application domain is endless.

Word segmentation is an area of great interest to me personally
because I studied both Thai and Chinese.  So I belong to camp #2.
Certainly I want to make word segmentation an area of high priority at
Text Layout 2008.  To the extent that I can influence things, that is
what I am going to do :-) .

When we look at the Free Desktop, it is clear that efforts to deal
with word segmentation for the spaceless scripts of Southeast Asia
(Thai, Lao, Myanmar, Khmer, inter alia) are still extremely
fragmented.  I want to change that by fostering a team which will
create a set of unified library classes to handle syllable and word
segmentation especially for (1) the spaceless Indic-derived scripts of
South and Southeast Asia, and (2) CJK too.  Those interested in
syllabification of Western or other languages will benefit too.

In both the Indic and CJK cases there is a need to use dictionary
corpora for proper word segmentation.  Usually "hybrid" algorithms
that incorporate language-specific rules along with dictionary lookups
give the best results.  Note that it may be necessary to distinguish
between *orthographic* syllabification and *phonetic* syllabification.
 It should be possible to implement common functionality --such as
loading dictionary corpora into cache-based Trie structures or
whatever-- in base classes.  Derived classes can then be tailored for
specific scripts and languages as necessary. Since it will be
extremely important to manage complexity properly and maintain the
flexibility of changing implementations as research in NLP reveals
better ways of doing things, I believe C++ would be a good choice for
creating a reference implementation library that could serve as a
foundation for future work.

If nothing else, now you know what is currently on my mind as I think
about Text Layout 2008.

Best Wishes -- Ed

> Regards,
> mpsuzuki
>