[HarfBuzz] Question regarding the use of HB_SCRIPT_KATAKANA for "regular" Japanese

Behdad Esfahbod behdad at behdad.org
Wed Jan 8 18:02:39 PST 2014


On 14-01-09 01:55 AM, Ariel Malka wrote:
> 
> https://github.com/arielm/Unicode/blob/master/Projects/ScriptDetector

This is awesome!  Thank you.

behdad


> Feedback is welcome,
> Ariel
> 
> P.S. the next step is to mix script/lang items with BIDI items (the Mapnik
> project should be very helpful here...)
> 
> 
> On Mon, Dec 23, 2013 at 4:46 AM, Behdad Esfahbod <behdad at behdad.org
> <mailto:behdad at behdad.org>> wrote:
> 
>     On 13-12-22 08:51 PM, Ariel Malka wrote:
>     > Thanks Behdad, the info on how it works in Pango is indeed super useful.
>     >
>     >
>     > An attempt to recap using my original Japanese example:
>     >
>     > ユニコードは、すべての文字に固有の番号を付与します
>     >
>     > ICU's "scrptrun" is detecting Katakana, Hiragana and Han scripts.
>     >
>     >
>     > Case 1: no "input list of languages" is provided.
>     >
>     > a) For Katakana and Hiragana items, "ja" will be selected, with the help
>     > of http://goo.gl/mpD9Fg
>     > In turn, MTLmr3m.ttf (default for "ja" in my system) will be used.
> 
>     So far so good.
> 
> 
>     > b) For Han items, no language will be selected because of
>     http://goo.gl/xusqwn
>     > At this stage, we still need to pick a font, so I guess we
>     > choose DroidSansFallback.ttf (default for Han in my system), unless...
>     >
>     > Some additional strategy could be used, like: observing the surrounding
>     items?
> 
>     Yes.  All itemization issues can use surrounding context when in doubt...
>     It's just about managing complexity...
> 
> 
>     > Case 2: we use "ja" (say, collected from the locale) as "input language"
>     >
>     > For all the items, "ja" will be selected because the 3 scripts are valid for
>     > writing this language, as defined in http://goo.gl/hwQri5
>     >
>     > By the way, I wonder why Korean is not including Han
>     > (see http://goo.gl/bI5BLj), in contradiction to the explanations
>     > in http://goo.gl/xusqwn?
> 
>     Great point.  The way the script-per-language was put together is using
>     fontconfig's orth files, which basically only list Hangul characters for
>     Korean.  It definitely can be improved upon and I'm willing to hear from
>     roozbeh and others whether we have better data somewhere.
> 
>     behdad
> 
> 
>     >
>     >
>     > On Mon, Dec 23, 2013 at 1:35 AM, Behdad Esfahbod <behdad at behdad.org
>     <mailto:behdad at behdad.org>
>     > <mailto:behdad at behdad.org <mailto:behdad at behdad.org>>> wrote:
>     >
>     >     On 13-12-22 06:17 PM, Ariel Malka wrote:
>     >     >> As it happens, those three scripts are all considered "simple",
>     so the
>     >     shaping
>     >     >> logic in HarfBuzz is the same for all three.
>     >     >
>     >     > Good to know. For the record, there's a function for checking if a
>     script is
>     >     > complex in the recent Harfbuzz-flavored Android OS:
>     http://goo.gl/KL1KUi
>     >
>     >     Please NEVER use something like that.  It's broken by design.  It
>     exists in
>     >     Android for legacy reasons, and will eventually be removed.
>     >
>     >
>     >     >> Where it does make a difference
>     >     >> is if the font has ligatures, kerning, etc for those.  OpenType
>     organizes
>     >     >> those features by script, and if you request the wrong script you
>     will miss
>     >     >> out on the features.
>     >     >
>     >     > Makes sense to me for Hebrew, Arabic, Thai, etc., but I was bit
>     surprised to
>     >     > find-out that LATN was also a complex script.
>     >
>     >     LATN uses the "generic" shaper, so it's not complex, no.
>     >
>     >
>     >     > So for instance, if I would shape some text containing Hebrew and
>     English
>     >     > solely using the HEBR script, I would probably loose kerning and
>     ffi-like
>     >     > ligatures for the english part
>     >
>     >     Correct.
>     >
>     >
>     >     > (this is what I'm actually doing currently in
>     >     > my "simple" BIDI implementation...)
>     >
>     >     Then fix it.  BIDI and script itemization are two separate issues.
>     >
>     >
>     >     >> How you do font selection and what script you pass to HarfBuzz
>     are two
>     >     >> completely separate issues.  Font fallback stack should be
>     per-language.
>     >     >
>     >     > I understand that the best scenario will always be to take decisions
>     >     based on
>     >     > "language" rather than solely on "script", but it creates a problem:
>     >     >
>     >     > Say you work on an API for Unicode text rendering: you can't
>     promise your
>     >     > users a solution where they would use arbitrary text without providing
>     >     > language-context per span.
>     >
>     >     These are very good questions.  And we have answers to all.
>      Unfortunately
>     >     there's no single location with all this information.  I'm working on
>     >     documenting them, but looks like replying to you and letting you
>     document is
>     >     better.
>     >
>     >     What Pango does is: it takes an input list of languages (through
>     $LANGUAGE for
>     >     example), and whenever there's a item of text with script X, it
>     assigns a
>     >     language to the item in this manner:
>     >
>     >       - If a language L is set on the item (through xml:lang, or
>     whatever else the
>     >     user can use to set a language), and script X may be used to write
>     language L,
>     >     then resolve to language L and return,
>     >
>     >       - for each language L in the list of default languages $LANGUAGE,
>     if script
>     >     X may be used to write language L, then resolve to language L and
>     return,
>     >
>     >       - If there's a predominant language L that is likely for script X,
>     resolve
>     >     to language L and return,
>     >
>     >       - Assign no language.
>     >
>     >     This algorithm needs two tables of data:
>     >
>     >       - List of scripts a language tag may possibly use.  This is for
>     example
>     >     available in pango-script-lang-table.h.  It's generated from
>     fontconfig orth
>     >     files using pango/tools/gen-script-for-lang.c.  Feel free to copy it.
>     >
>     >       - List of most likely language for each script.  This is available
>     in CLDR:
>     >
>     >
>     >    
>     http://unicode.org/repos/cldr-tmp/trunk/diff/supplemental/likely_subtags.html
>     >
>     >     Pango has it's own manually compiled list in pango-language.c
>     >
>     >     Again, all these are on my plate for the next library I'm going to
>     design.  It
>     >     will take a while though...
>     >
>     >
>     >     behdad
>     >
>     >     > Or, to come back to the origin of the message: solutions like ICU's
>     >     "scrptrun"
>     >     > which are doing script detection are not appropriate (because they
>     won't
>     >     help
>     >     > you finding the right font due to the lack of language context...)
>     >     >
>     >     > I guess the problem is even more generic, like with utf8-encoded
>     html pages
>     >     > rendered in modern browsers, as demonstrated by the creator of
>     liblinebreak:
>     >     > http://wyw.dcweb.cn/lang_utf8.htm
>     >     >
>     >     > On Sun, Dec 22, 2013 at 10:47 PM, Behdad Esfahbod
>     <behdad at behdad.org <mailto:behdad at behdad.org>
>     >     <mailto:behdad at behdad.org <mailto:behdad at behdad.org>>
>     >     > <mailto:behdad at behdad.org <mailto:behdad at behdad.org>
>     <mailto:behdad at behdad.org <mailto:behdad at behdad.org>>>> wrote:
>     >     >
>     >     >     On 13-12-22 10:10 AM, Ariel Malka wrote:
>     >     >     > I'm trying to render "regular" (i.e. modern, horizontal)
>     Japanese with
>     >     >     Harfbuzz.
>     >     >     >
>     >     >     > So far, I have been using HB_SCRIPT_KATAKANA and it looks
>     similar
>     >     to what is
>     >     >     > rendered via browsers.
>     >     >     >
>     >     >     > But after examining other rendering solutions I can see that
>     >     "automatic
>     >     >     script
>     >     >     > detection" can often take place.
>     >     >     >
>     >     >     > For instance, the Mapnik project is using ICU's "scrptrun",
>     which,
>     >     given the
>     >     >     > following sentence:
>     >     >     >
>     >     >     > ユニコードは、すべての文字に固有の番号を付与します
>     >     >     >
>     >     >     > would detect a mix of Katakana, Hiragana and Han scripts.
>     >     >     >
>     >     >     > But for instance, it would not change anything if I'd render the
>     >     sentence by
>     >     >     > mixing the 3 different scripts (i.e. instead of using only
>     >     >     HB_SCRIPT_KATAKANA.)
>     >     >     >
>     >     >     > Or are there situations where it would make a difference?
>     >     >
>     >     >     As it happens, those three scripts are all considered "simple", so
>     >     the shaping
>     >     >     logic in HarfBuzz is the same for all three.  Where it does make a
>     >     difference
>     >     >     is if the font has ligatures, kerning, etc for those.  OpenType
>     >     organizes
>     >     >     those features by script, and if you request the wrong script you
>     >     will miss
>     >     >     out on the features.
>     >     >
>     >     >
>     >     >     > I'm asking that because I suspect a catch-22 situation here. For
>     >     >     example, the
>     >     >     > word "diameter" in Japanese is 直径 which, given to "scrptrun"
>     >     would be
>     >     >     > detected as Han script.
>     >     >     >
>     >     >     > As far as I understand, it could be a problem on systems where
>     >     >     > DroidSansFallback.ttf is used, because the word would look
>     like in
>     >     >     Simplified
>     >     >     > Chinese.
>     >     >     >
>     >     >     > Now, if we were using MTLmr3m.ttf, which is preferred for
>     >     Japanese, the word
>     >     >     > would have been rendered as intended.
>     >     >
>     >     >     How you do font selection and what script you pass to HarfBuzz
>     are two
>     >     >     completely separate issues.  Font fallback stack should be
>     per-language.
>     >     >
>     >     >     > Reference:
>     https://code.google.com/p/chromium/issues/detail?id=183830
>     >     >     >
>     >     >     > Any feedback would be appreciated. Note that the wisdom
>     >     accumulated here
>     >     >     will
>     >     >     > be translated into tangible info and code samples (see
>     >     >     > https://github.com/arielm/Unicode)
>     >     >     >
>     >     >     > Thanks!
>     >     >     > Ariel
>     >     >     >
>     >     >     >
>     >     >     > _______________________________________________
>     >     >     > HarfBuzz mailing list
>     >     >     > HarfBuzz at lists.freedesktop.org
>     <mailto:HarfBuzz at lists.freedesktop.org>
>     >     <mailto:HarfBuzz at lists.freedesktop.org
>     <mailto:HarfBuzz at lists.freedesktop.org>>
>     >     <mailto:HarfBuzz at lists.freedesktop.org
>     <mailto:HarfBuzz at lists.freedesktop.org>
>     >     <mailto:HarfBuzz at lists.freedesktop.org
>     <mailto:HarfBuzz at lists.freedesktop.org>>>
>     >     >     > http://lists.freedesktop.org/mailman/listinfo/harfbuzz
>     >     >     >
>     >     >
>     >     >     --
>     >     >     behdad
>     >     >     http://behdad.org/
>     >     >
>     >     >
>     >
>     >     --
>     >     behdad
>     >     http://behdad.org/
>     >
>     >
> 
>     --
>     behdad
>     http://behdad.org/
> 
> 

-- 
behdad
http://behdad.org/


More information about the HarfBuzz mailing list