[HarfBuzz] Question regarding the use of HB_SCRIPT_KATAKANA for "regular" Japanese
Ariel Malka
ariel at chronotext.org
Sun Dec 22 17:51:51 PST 2013
Thanks Behdad, the info on how it works in Pango is indeed super useful.
An attempt to recap using my original Japanese example:
ユニコードは、すべての文字に固有の番号を付与します
ICU's "scrptrun" is detecting Katakana, Hiragana and Han scripts.
Case 1: no "input list of languages" is provided.
a) For Katakana and Hiragana items, "ja" will be selected, with the help of
http://goo.gl/mpD9Fg
In turn, MTLmr3m.ttf (default for "ja" in my system) will be used.
b) For Han items, no language will be selected because of
http://goo.gl/xusqwn
At this stage, we still need to pick a font, so I guess we choose
DroidSansFallback.ttf
(default for Han in my system), unless...
Some additional strategy could be used, like: observing the surrounding
items?
Case 2: we use "ja" (say, collected from the locale) as "input language"
For all the items, "ja" will be selected because the 3 scripts are valid
for writing this language, as defined in http://goo.gl/hwQri5
By the way, I wonder why Korean is not including Han (see
http://goo.gl/bI5BLj), in contradiction to the explanations in
http://goo.gl/xusqwn?
On Mon, Dec 23, 2013 at 1:35 AM, Behdad Esfahbod <behdad at behdad.org> wrote:
> On 13-12-22 06:17 PM, Ariel Malka wrote:
> >> As it happens, those three scripts are all considered "simple", so the
> shaping
> >> logic in HarfBuzz is the same for all three.
> >
> > Good to know. For the record, there's a function for checking if a
> script is
> > complex in the recent Harfbuzz-flavored Android OS: http://goo.gl/KL1KUi
>
> Please NEVER use something like that. It's broken by design. It exists in
> Android for legacy reasons, and will eventually be removed.
>
>
> >> Where it does make a difference
> >> is if the font has ligatures, kerning, etc for those. OpenType
> organizes
> >> those features by script, and if you request the wrong script you will
> miss
> >> out on the features.
> >
> > Makes sense to me for Hebrew, Arabic, Thai, etc., but I was bit
> surprised to
> > find-out that LATN was also a complex script.
>
> LATN uses the "generic" shaper, so it's not complex, no.
>
>
> > So for instance, if I would shape some text containing Hebrew and English
> > solely using the HEBR script, I would probably loose kerning and ffi-like
> > ligatures for the english part
>
> Correct.
>
>
> > (this is what I'm actually doing currently in
> > my "simple" BIDI implementation...)
>
> Then fix it. BIDI and script itemization are two separate issues.
>
>
> >> How you do font selection and what script you pass to HarfBuzz are two
> >> completely separate issues. Font fallback stack should be per-language.
> >
> > I understand that the best scenario will always be to take decisions
> based on
> > "language" rather than solely on "script", but it creates a problem:
> >
> > Say you work on an API for Unicode text rendering: you can't promise your
> > users a solution where they would use arbitrary text without providing
> > language-context per span.
>
> These are very good questions. And we have answers to all. Unfortunately
> there's no single location with all this information. I'm working on
> documenting them, but looks like replying to you and letting you document
> is
> better.
>
> What Pango does is: it takes an input list of languages (through $LANGUAGE
> for
> example), and whenever there's a item of text with script X, it assigns a
> language to the item in this manner:
>
> - If a language L is set on the item (through xml:lang, or whatever else
> the
> user can use to set a language), and script X may be used to write
> language L,
> then resolve to language L and return,
>
> - for each language L in the list of default languages $LANGUAGE, if
> script
> X may be used to write language L, then resolve to language L and return,
>
> - If there's a predominant language L that is likely for script X,
> resolve
> to language L and return,
>
> - Assign no language.
>
> This algorithm needs two tables of data:
>
> - List of scripts a language tag may possibly use. This is for example
> available in pango-script-lang-table.h. It's generated from fontconfig
> orth
> files using pango/tools/gen-script-for-lang.c. Feel free to copy it.
>
> - List of most likely language for each script. This is available in
> CLDR:
>
>
> http://unicode.org/repos/cldr-tmp/trunk/diff/supplemental/likely_subtags.html
>
> Pango has it's own manually compiled list in pango-language.c
>
> Again, all these are on my plate for the next library I'm going to design.
> It
> will take a while though...
>
>
> behdad
>
> > Or, to come back to the origin of the message: solutions like ICU's
> "scrptrun"
> > which are doing script detection are not appropriate (because they won't
> help
> > you finding the right font due to the lack of language context...)
> >
> > I guess the problem is even more generic, like with utf8-encoded html
> pages
> > rendered in modern browsers, as demonstrated by the creator of
> liblinebreak:
> > http://wyw.dcweb.cn/lang_utf8.htm
> >
> > On Sun, Dec 22, 2013 at 10:47 PM, Behdad Esfahbod <behdad at behdad.org
> > <mailto:behdad at behdad.org>> wrote:
> >
> > On 13-12-22 10:10 AM, Ariel Malka wrote:
> > > I'm trying to render "regular" (i.e. modern, horizontal) Japanese
> with
> > Harfbuzz.
> > >
> > > So far, I have been using HB_SCRIPT_KATAKANA and it looks similar
> to what is
> > > rendered via browsers.
> > >
> > > But after examining other rendering solutions I can see that
> "automatic
> > script
> > > detection" can often take place.
> > >
> > > For instance, the Mapnik project is using ICU's "scrptrun", which,
> given the
> > > following sentence:
> > >
> > > ユニコードは、すべての文字に固有の番号を付与します
> > >
> > > would detect a mix of Katakana, Hiragana and Han scripts.
> > >
> > > But for instance, it would not change anything if I'd render the
> sentence by
> > > mixing the 3 different scripts (i.e. instead of using only
> > HB_SCRIPT_KATAKANA.)
> > >
> > > Or are there situations where it would make a difference?
> >
> > As it happens, those three scripts are all considered "simple", so
> the shaping
> > logic in HarfBuzz is the same for all three. Where it does make a
> difference
> > is if the font has ligatures, kerning, etc for those. OpenType
> organizes
> > those features by script, and if you request the wrong script you
> will miss
> > out on the features.
> >
> >
> > > I'm asking that because I suspect a catch-22 situation here. For
> > example, the
> > > word "diameter" in Japanese is 直径 which, given to "scrptrun" would
> be
> > > detected as Han script.
> > >
> > > As far as I understand, it could be a problem on systems where
> > > DroidSansFallback.ttf is used, because the word would look like in
> > Simplified
> > > Chinese.
> > >
> > > Now, if we were using MTLmr3m.ttf, which is preferred for
> Japanese, the word
> > > would have been rendered as intended.
> >
> > How you do font selection and what script you pass to HarfBuzz are
> two
> > completely separate issues. Font fallback stack should be
> per-language.
> >
> > > Reference:
> https://code.google.com/p/chromium/issues/detail?id=183830
> > >
> > > Any feedback would be appreciated. Note that the wisdom
> accumulated here
> > will
> > > be translated into tangible info and code samples (see
> > > https://github.com/arielm/Unicode)
> > >
> > > Thanks!
> > > Ariel
> > >
> > >
> > > _______________________________________________
> > > HarfBuzz mailing list
> > > HarfBuzz at lists.freedesktop.org <mailto:
> HarfBuzz at lists.freedesktop.org>
> > > http://lists.freedesktop.org/mailman/listinfo/harfbuzz
> > >
> >
> > --
> > behdad
> > http://behdad.org/
> >
> >
>
> --
> behdad
> http://behdad.org/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/harfbuzz/attachments/20131223/f75ba5fd/attachment-0001.html>
More information about the HarfBuzz
mailing list