[HarfBuzz] Question regarding the use of HB_SCRIPT_KATAKANA for "regular" Japanese

Sun Dec 22 17:51:51 PST 2013

Thanks Behdad, the info on how it works in Pango is indeed super useful.

An attempt to recap using my original Japanese example:

ユニコードは、すべての文字に固有の番号を付与します

ICU's "scrptrun" is detecting Katakana, Hiragana and Han scripts.

Case 1: no "input list of languages" is provided.

a) For Katakana and Hiragana items, "ja" will be selected, with the help of
http://goo.gl/mpD9Fg
In turn, MTLmr3m.ttf (default for "ja" in my system) will be used.

b) For Han items, no language will be selected because of
http://goo.gl/xusqwn
At this stage, we still need to pick a font, so I guess we choose
DroidSansFallback.ttf
(default for Han in my system), unless...

Some additional strategy could be used, like: observing the surrounding
items?

Case 2: we use "ja" (say, collected from the locale) as "input language"

For all the items, "ja" will be selected because the 3 scripts are valid
for writing this language, as defined in http://goo.gl/hwQri5

By the way, I wonder why Korean is not including Han (see
http://goo.gl/bI5BLj), in contradiction to the explanations in
http://goo.gl/xusqwn?

On Mon, Dec 23, 2013 at 1:35 AM, Behdad Esfahbod <behdad at behdad.org> wrote:

> On 13-12-22 06:17 PM, Ariel Malka wrote:
> >> As it happens, those three scripts are all considered "simple", so the
> shaping
> >> logic in HarfBuzz is the same for all three.
> >
> > Good to know. For the record, there's a function for checking if a
> script is
> > complex in the recent Harfbuzz-flavored Android OS: http://goo.gl/KL1KUi
>
> Please NEVER use something like that.  It's broken by design.  It exists in
> Android for legacy reasons, and will eventually be removed.
>
>
> >> Where it does make a difference
> >> is if the font has ligatures, kerning, etc for those.  OpenType
> organizes
> >> those features by script, and if you request the wrong script you will
> miss
> >> out on the features.
> >
> > Makes sense to me for Hebrew, Arabic, Thai, etc., but I was bit
> surprised to
> > find-out that LATN was also a complex script.
>
> LATN uses the "generic" shaper, so it's not complex, no.
>
>
> > So for instance, if I would shape some text containing Hebrew and English
> > solely using the HEBR script, I would probably loose kerning and ffi-like
> > ligatures for the english part
>
> Correct.
>
>
> > (this is what I'm actually doing currently in
> > my "simple" BIDI implementation...)
>
> Then fix it.  BIDI and script itemization are two separate issues.
>
>
> >> How you do font selection and what script you pass to HarfBuzz are two
> >> completely separate issues.  Font fallback stack should be per-language.
> >
> > I understand that the best scenario will always be to take decisions
> based on
> > "language" rather than solely on "script", but it creates a problem:
> >
> > Say you work on an API for Unicode text rendering: you can't promise your
> > users a solution where they would use arbitrary text without providing
> > language-context per span.
>
> These are very good questions.  And we have answers to all.  Unfortunately
> there's no single location with all this information.  I'm working on
> documenting them, but looks like replying to you and letting you document
> is
> better.
>
> What Pango does is: it takes an input list of languages (through $LANGUAGE
> for
> example), and whenever there's a item of text with script X, it assigns a
> language to the item in this manner:
>
>   - If a language L is set on the item (through xml:lang, or whatever else
> the
> user can use to set a language), and script X may be used to write
> language L,
> then resolve to language L and return,
>
>   - for each language L in the list of default languages $LANGUAGE, if
> script
> X may be used to write language L, then resolve to language L and return,
>
>   - If there's a predominant language L that is likely for script X,
> resolve
> to language L and return,
>
>   - Assign no language.
>
> This algorithm needs two tables of data:
>
>   - List of scripts a language tag may possibly use.  This is for example
> available in pango-script-lang-table.h.  It's generated from fontconfig
> orth
> files using pango/tools/gen-script-for-lang.c.  Feel free to copy it.
>
>   - List of most likely language for each script.  This is available in
> CLDR:
>
>
> http://unicode.org/repos/cldr-tmp/trunk/diff/supplemental/likely_subtags.html
>
> Pango has it's own manually compiled list in pango-language.c
>
> Again, all these are on my plate for the next library I'm going to design.
>  It
> will take a while though...
>
>
> behdad
>
> > Or, to come back to the origin of the message: solutions like ICU's
> "scrptrun"
> > which are doing script detection are not appropriate (because they won't
> help
> > you finding the right font due to the lack of language context...)
> >
> > I guess the problem is even more generic, like with utf8-encoded html
> pages
> > rendered in modern browsers, as demonstrated by the creator of
> liblinebreak:
> > http://wyw.dcweb.cn/lang_utf8.htm
> >
> > On Sun, Dec 22, 2013 at 10:47 PM, Behdad Esfahbod <behdad at behdad.org
> > <mailto:behdad at behdad.org>> wrote:
> >
> >     On 13-12-22 10:10 AM, Ariel Malka wrote:
> >     > I'm trying to render "regular" (i.e. modern, horizontal) Japanese
> with
> >     Harfbuzz.
> >     >
> >     > So far, I have been using HB_SCRIPT_KATAKANA and it looks similar
> to what is
> >     > rendered via browsers.
> >     >
> >     > But after examining other rendering solutions I can see that
> "automatic
> >     script
> >     > detection" can often take place.
> >     >
> >     > For instance, the Mapnik project is using ICU's "scrptrun", which,
> given the
> >     > following sentence:
> >     >
> >     > ユニコードは、すべての文字に固有の番号を付与します
> >     >
> >     > would detect a mix of Katakana, Hiragana and Han scripts.
> >     >
> >     > But for instance, it would not change anything if I'd render the
> sentence by
> >     > mixing the 3 different scripts (i.e. instead of using only
> >     HB_SCRIPT_KATAKANA.)
> >     >
> >     > Or are there situations where it would make a difference?
> >
> >     As it happens, those three scripts are all considered "simple", so
> the shaping
> >     logic in HarfBuzz is the same for all three.  Where it does make a
> difference
> >     is if the font has ligatures, kerning, etc for those.  OpenType
> organizes
> >     those features by script, and if you request the wrong script you
> will miss
> >     out on the features.
> >
> >
> >     > I'm asking that because I suspect a catch-22 situation here. For
> >     example, the
> >     > word "diameter" in Japanese is 直径 which, given to "scrptrun" would
> be
> >     > detected as Han script.
> >     >
> >     > As far as I understand, it could be a problem on systems where
> >     > DroidSansFallback.ttf is used, because the word would look like in
> >     Simplified
> >     > Chinese.
> >     >
> >     > Now, if we were using MTLmr3m.ttf, which is preferred for
> Japanese, the word
> >     > would have been rendered as intended.
> >
> >     How you do font selection and what script you pass to HarfBuzz are
> two
> >     completely separate issues.  Font fallback stack should be
> per-language.
> >
> >     > Reference:
> https://code.google.com/p/chromium/issues/detail?id=183830
> >     >
> >     > Any feedback would be appreciated. Note that the wisdom
> accumulated here
> >     will
> >     > be translated into tangible info and code samples (see
> >     > https://github.com/arielm/Unicode)
> >     >
> >     > Thanks!
> >     > Ariel
> >     >
> >     >
> >     > _______________________________________________
> >     > HarfBuzz mailing list
> >     > HarfBuzz at lists.freedesktop.org <mailto:
> HarfBuzz at lists.freedesktop.org>
> >     > http://lists.freedesktop.org/mailman/listinfo/harfbuzz
> >     >
> >
> >     --
> >     behdad
> >     http://behdad.org/
> >
> >
>
> --
> behdad
> http://behdad.org/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/harfbuzz/attachments/20131223/f75ba5fd/attachment-0001.html>