[Uim] uim-py: Adding idioms to PY.scm

Jon Babcock jon at kanji.com
Sun Apr 4 20:07:33 EEST 2004


[UTF-8 encoded Unicode.]

Yukiko Bando wrote:

> If someone is interested in giving it a try, please download it from here:
> http://www.h4.dion.ne.jp/~apricots/files/CEDICT_PY.scm.tar.bz2

Adding this is a big step, thanks!

> - With 17,000 entiries, lots of common words seem to be still missing!   
> "Chocolate" is not included...

You may want to add tsi.src which contains about 130,000 entries. Check 
the file 'COPYING' in the libtabe-0.2.3.tgz package for permissions of use.
<http://sourceforge.net/project/showfiles.php?group_id=1519>

The first step is to convert the Zhuyin entries to Pinyin and then to 
eliminate duplicates with your CEDICT file. I plan to do this 
eventually, but would be *delighted* if I didn't have to. <g> And then 
the file must be converted to UTF-8 Unicode, I assume.

Jon



[Somewhat OT]

Regarding number of words... actually, there are more than 200,000 words 
missing. <g> This may be a bit OT, but still useful to understand the 
eventual size of this project. (Also, non-native 
students/scholars/translators of Chinese like myself and native writers 
of Chinese may have somewhat different needs and expectations regarding 
an input method. I'll try to clarify this later.)

* The _Hanyu Da Cidian_ (汉语大词典) has 347,426 multisyllabic entries. 
[Notice I didn't say 'words'. There are many expert opinions on what 
exactly constitutes a 'word' in Chinese. To avoid this problem, they are 
often called 'compounds' or 'binoms', 'trinoms', etc.; 熟語 (jukugo)in 
Japanese. The 國語辭典 Chinese dictionary (see below) explanation of 
this is: <quote>語言中已定型的固定詞組或句子。包括成語、諺語、格言、歇後 
語等。是在人們長期使用語言的過程中逐漸形成的。</quote> Big subject 
which, for practical purposes, we can skip.]

* The new and extraordinarily useful _Grand dictionnaire Ricci de la 
langue chinoise_ has about 300,000 multisyllabic entries.

* The 広漢和辞典, a very appealing 3-volume + Index volume edition of 
the original 12-volume + Index volume 大漢和辭典 by the venerable 諸橋轍 
次 (Morohashi Tetsuji) has approximately 200,000 multisyllabic entries.

* I don't know how many entries are in the big 國語辭典 compiled by the 
教育部國語推行委員會 in Taiwan, because I don't yet own a copy. (I use 
it daily through its web interface: 
<http://140.111.1.22/mandr/clc/dict/> and find it to be the most useful 
overall reference for Chinese words within arm's reach.) I'd guess it's 
about the same size as _Grand dictionnaire Ricci_.

* The popular but hard to use, _Far-East Chinese-English Dictionary_ 
(from Taiwan) has about 120,000.

* That old standby of 20th century students Chinese, _Mathew's Chinese 
English Dictionary_, has 104,000 or more bisyllabic entries, and the

* venerable and still useful _A Chinese English Dictionary_ by Herbert 
A. Giles (1st edition 1892) has, hmm I don't see a number in the front 
matter of my copy.

Nevertheless, a professional C-E translator encounters many words that 
are not included in any of the above 7 or 8 major dictionaries of 
Chinese words. (Just check the daily traffic on the fanyi mailing list 
for an endless stream of examples.)

Jon
--
Jon Babcock <jon at kanji.com>




More information about the uim mailing list