[HarfBuzz] Question regarding the use of HB_SCRIPT_KATAKANA for "regular" Japanese

Wed Jan 8 09:55:51 PST 2014

Hi,

As promised, I have synthetized the wisdom accumulated in this thread into
some code and documentation:

https://github.com/arielm/Unicode/blob/master/Projects/ScriptDetector

Feedback is welcome,
Ariel

P.S. the next step is to mix script/lang items with BIDI items (the Mapnik
project should be very helpful here...)

On Mon, Dec 23, 2013 at 4:46 AM, Behdad Esfahbod <behdad at behdad.org> wrote:

> On 13-12-22 08:51 PM, Ariel Malka wrote:
> > Thanks Behdad, the info on how it works in Pango is indeed super useful.
> >
> >
> > An attempt to recap using my original Japanese example:
> >
> > ユニコードは、すべての文字に固有の番号を付与します
> >
> > ICU's "scrptrun" is detecting Katakana, Hiragana and Han scripts.
> >
> >
> > Case 1: no "input list of languages" is provided.
> >
> > a) For Katakana and Hiragana items, "ja" will be selected, with the help
> > of http://goo.gl/mpD9Fg
> > In turn, MTLmr3m.ttf (default for "ja" in my system) will be used.
>
> So far so good.
>
>
> > b) For Han items, no language will be selected because of
> http://goo.gl/xusqwn
> > At this stage, we still need to pick a font, so I guess we
> > choose DroidSansFallback.ttf (default for Han in my system), unless...
> >
> > Some additional strategy could be used, like: observing the surrounding
> items?
>
> Yes.  All itemization issues can use surrounding context when in doubt...
> It's just about managing complexity...
>
>
> > Case 2: we use "ja" (say, collected from the locale) as "input language"
> >
> > For all the items, "ja" will be selected because the 3 scripts are valid
> for
> > writing this language, as defined in http://goo.gl/hwQri5
> >
> > By the way, I wonder why Korean is not including Han
> > (see http://goo.gl/bI5BLj), in contradiction to the explanations
> > in http://goo.gl/xusqwn?
>
> Great point.  The way the script-per-language was put together is using
> fontconfig's orth files, which basically only list Hangul characters for
> Korean.  It definitely can be improved upon and I'm willing to hear from
> roozbeh and others whether we have better data somewhere.
>
> behdad
>
>
> >
> >
> > On Mon, Dec 23, 2013 at 1:35 AM, Behdad Esfahbod <behdad at behdad.org
> > <mailto:behdad at behdad.org>> wrote:
> >
> >     On 13-12-22 06:17 PM, Ariel Malka wrote:
> >     >> As it happens, those three scripts are all considered "simple",
> so the
> >     shaping
> >     >> logic in HarfBuzz is the same for all three.
> >     >
> >     > Good to know. For the record, there's a function for checking if a
> script is
> >     > complex in the recent Harfbuzz-flavored Android OS:
> http://goo.gl/KL1KUi
> >
> >     Please NEVER use something like that.  It's broken by design.  It
> exists in
> >     Android for legacy reasons, and will eventually be removed.
> >
> >
> >     >> Where it does make a difference
> >     >> is if the font has ligatures, kerning, etc for those.  OpenType
> organizes
> >     >> those features by script, and if you request the wrong script you
> will miss
> >     >> out on the features.
> >     >
> >     > Makes sense to me for Hebrew, Arabic, Thai, etc., but I was bit
> surprised to
> >     > find-out that LATN was also a complex script.
> >
> >     LATN uses the "generic" shaper, so it's not complex, no.
> >
> >
> >     > So for instance, if I would shape some text containing Hebrew and
> English
> >     > solely using the HEBR script, I would probably loose kerning and
> ffi-like
> >     > ligatures for the english part
> >
> >     Correct.
> >
> >
> >     > (this is what I'm actually doing currently in
> >     > my "simple" BIDI implementation...)
> >
> >     Then fix it.  BIDI and script itemization are two separate issues.
> >
> >
> >     >> How you do font selection and what script you pass to HarfBuzz
> are two
> >     >> completely separate issues.  Font fallback stack should be
> per-language.
> >     >
> >     > I understand that the best scenario will always be to take
> decisions
> >     based on
> >     > "language" rather than solely on "script", but it creates a
> problem:
> >     >
> >     > Say you work on an API for Unicode text rendering: you can't
> promise your
> >     > users a solution where they would use arbitrary text without
> providing
> >     > language-context per span.
> >
> >     These are very good questions.  And we have answers to all.
>  Unfortunately
> >     there's no single location with all this information.  I'm working on
> >     documenting them, but looks like replying to you and letting you
> document is
> >     better.
> >
> >     What Pango does is: it takes an input list of languages (through
> $LANGUAGE for
> >     example), and whenever there's a item of text with script X, it
> assigns a
> >     language to the item in this manner:
> >
> >       - If a language L is set on the item (through xml:lang, or
> whatever else the
> >     user can use to set a language), and script X may be used to write
> language L,
> >     then resolve to language L and return,
> >
> >       - for each language L in the list of default languages $LANGUAGE,
> if script
> >     X may be used to write language L, then resolve to language L and
> return,
> >
> >       - If there's a predominant language L that is likely for script X,
> resolve
> >     to language L and return,
> >
> >       - Assign no language.
> >
> >     This algorithm needs two tables of data:
> >
> >       - List of scripts a language tag may possibly use.  This is for
> example
> >     available in pango-script-lang-table.h.  It's generated from
> fontconfig orth
> >     files using pango/tools/gen-script-for-lang.c.  Feel free to copy it.
> >
> >       - List of most likely language for each script.  This is available
> in CLDR:
> >
> >
> >
> http://unicode.org/repos/cldr-tmp/trunk/diff/supplemental/likely_subtags.html
> >
> >     Pango has it's own manually compiled list in pango-language.c
> >
> >     Again, all these are on my plate for the next library I'm going to
> design.  It
> >     will take a while though...
> >
> >
> >     behdad
> >
> >     > Or, to come back to the origin of the message: solutions like ICU's
> >     "scrptrun"
> >     > which are doing script detection are not appropriate (because they
> won't
> >     help
> >     > you finding the right font due to the lack of language context...)
> >     >
> >     > I guess the problem is even more generic, like with utf8-encoded
> html pages
> >     > rendered in modern browsers, as demonstrated by the creator of
> liblinebreak:
> >     > http://wyw.dcweb.cn/lang_utf8.htm
> >     >
> >     > On Sun, Dec 22, 2013 at 10:47 PM, Behdad Esfahbod <
> behdad at behdad.org
> >     <mailto:behdad at behdad.org>
> >     > <mailto:behdad at behdad.org <mailto:behdad at behdad.org>>> wrote:
> >     >
> >     >     On 13-12-22 10:10 AM, Ariel Malka wrote:
> >     >     > I'm trying to render "regular" (i.e. modern, horizontal)
> Japanese with
> >     >     Harfbuzz.
> >     >     >
> >     >     > So far, I have been using HB_SCRIPT_KATAKANA and it looks
> similar
> >     to what is
> >     >     > rendered via browsers.
> >     >     >
> >     >     > But after examining other rendering solutions I can see that
> >     "automatic
> >     >     script
> >     >     > detection" can often take place.
> >     >     >
> >     >     > For instance, the Mapnik project is using ICU's "scrptrun",
> which,
> >     given the
> >     >     > following sentence:
> >     >     >
> >     >     > ユニコードは、すべての文字に固有の番号を付与します
> >     >     >
> >     >     > would detect a mix of Katakana, Hiragana and Han scripts.
> >     >     >
> >     >     > But for instance, it would not change anything if I'd render
> the
> >     sentence by
> >     >     > mixing the 3 different scripts (i.e. instead of using only
> >     >     HB_SCRIPT_KATAKANA.)
> >     >     >
> >     >     > Or are there situations where it would make a difference?
> >     >
> >     >     As it happens, those three scripts are all considered
> "simple", so
> >     the shaping
> >     >     logic in HarfBuzz is the same for all three.  Where it does
> make a
> >     difference
> >     >     is if the font has ligatures, kerning, etc for those.  OpenType
> >     organizes
> >     >     those features by script, and if you request the wrong script
> you
> >     will miss
> >     >     out on the features.
> >     >
> >     >
> >     >     > I'm asking that because I suspect a catch-22 situation here.
> For
> >     >     example, the
> >     >     > word "diameter" in Japanese is 直径 which, given to "scrptrun"
> >     would be
> >     >     > detected as Han script.
> >     >     >
> >     >     > As far as I understand, it could be a problem on systems
> where
> >     >     > DroidSansFallback.ttf is used, because the word would look
> like in
> >     >     Simplified
> >     >     > Chinese.
> >     >     >
> >     >     > Now, if we were using MTLmr3m.ttf, which is preferred for
> >     Japanese, the word
> >     >     > would have been rendered as intended.
> >     >
> >     >     How you do font selection and what script you pass to HarfBuzz
> are two
> >     >     completely separate issues.  Font fallback stack should be
> per-language.
> >     >
> >     >     > Reference:
> https://code.google.com/p/chromium/issues/detail?id=183830
> >     >     >
> >     >     > Any feedback would be appreciated. Note that the wisdom
> >     accumulated here
> >     >     will
> >     >     > be translated into tangible info and code samples (see
> >     >     > https://github.com/arielm/Unicode)
> >     >     >
> >     >     > Thanks!
> >     >     > Ariel
> >     >     >
> >     >     >
> >     >     > _______________________________________________
> >     >     > HarfBuzz mailing list
> >     >     > HarfBuzz at lists.freedesktop.org
> >     <mailto:HarfBuzz at lists.freedesktop.org>
> >     <mailto:HarfBuzz at lists.freedesktop.org
> >     <mailto:HarfBuzz at lists.freedesktop.org>>
> >     >     > http://lists.freedesktop.org/mailman/listinfo/harfbuzz
> >     >     >
> >     >
> >     >     --
> >     >     behdad
> >     >     http://behdad.org/
> >     >
> >     >
> >
> >     --
> >     behdad
> >     http://behdad.org/
> >
> >
>
> --
> behdad
> http://behdad.org/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/harfbuzz/attachments/20140108/edb0f7a2/attachment-0001.html>