[HarfBuzz] Question regarding the use of HB_SCRIPT_KATAKANA for "regular" Japanese
Ariel Malka
ariel at chronotext.org
Wed Jan 8 09:55:51 PST 2014
Hi,
As promised, I have synthetized the wisdom accumulated in this thread into
some code and documentation:
https://github.com/arielm/Unicode/blob/master/Projects/ScriptDetector
Feedback is welcome,
Ariel
P.S. the next step is to mix script/lang items with BIDI items (the Mapnik
project should be very helpful here...)
On Mon, Dec 23, 2013 at 4:46 AM, Behdad Esfahbod <behdad at behdad.org> wrote:
> On 13-12-22 08:51 PM, Ariel Malka wrote:
> > Thanks Behdad, the info on how it works in Pango is indeed super useful.
> >
> >
> > An attempt to recap using my original Japanese example:
> >
> > ユニコードは、すべての文字に固有の番号を付与します
> >
> > ICU's "scrptrun" is detecting Katakana, Hiragana and Han scripts.
> >
> >
> > Case 1: no "input list of languages" is provided.
> >
> > a) For Katakana and Hiragana items, "ja" will be selected, with the help
> > of http://goo.gl/mpD9Fg
> > In turn, MTLmr3m.ttf (default for "ja" in my system) will be used.
>
> So far so good.
>
>
> > b) For Han items, no language will be selected because of
> http://goo.gl/xusqwn
> > At this stage, we still need to pick a font, so I guess we
> > choose DroidSansFallback.ttf (default for Han in my system), unless...
> >
> > Some additional strategy could be used, like: observing the surrounding
> items?
>
> Yes. All itemization issues can use surrounding context when in doubt...
> It's just about managing complexity...
>
>
> > Case 2: we use "ja" (say, collected from the locale) as "input language"
> >
> > For all the items, "ja" will be selected because the 3 scripts are valid
> for
> > writing this language, as defined in http://goo.gl/hwQri5
> >
> > By the way, I wonder why Korean is not including Han
> > (see http://goo.gl/bI5BLj), in contradiction to the explanations
> > in http://goo.gl/xusqwn?
>
> Great point. The way the script-per-language was put together is using
> fontconfig's orth files, which basically only list Hangul characters for
> Korean. It definitely can be improved upon and I'm willing to hear from
> roozbeh and others whether we have better data somewhere.
>
> behdad
>
>
> >
> >
> > On Mon, Dec 23, 2013 at 1:35 AM, Behdad Esfahbod <behdad at behdad.org
> > <mailto:behdad at behdad.org>> wrote:
> >
> > On 13-12-22 06:17 PM, Ariel Malka wrote:
> > >> As it happens, those three scripts are all considered "simple",
> so the
> > shaping
> > >> logic in HarfBuzz is the same for all three.
> > >
> > > Good to know. For the record, there's a function for checking if a
> script is
> > > complex in the recent Harfbuzz-flavored Android OS:
> http://goo.gl/KL1KUi
> >
> > Please NEVER use something like that. It's broken by design. It
> exists in
> > Android for legacy reasons, and will eventually be removed.
> >
> >
> > >> Where it does make a difference
> > >> is if the font has ligatures, kerning, etc for those. OpenType
> organizes
> > >> those features by script, and if you request the wrong script you
> will miss
> > >> out on the features.
> > >
> > > Makes sense to me for Hebrew, Arabic, Thai, etc., but I was bit
> surprised to
> > > find-out that LATN was also a complex script.
> >
> > LATN uses the "generic" shaper, so it's not complex, no.
> >
> >
> > > So for instance, if I would shape some text containing Hebrew and
> English
> > > solely using the HEBR script, I would probably loose kerning and
> ffi-like
> > > ligatures for the english part
> >
> > Correct.
> >
> >
> > > (this is what I'm actually doing currently in
> > > my "simple" BIDI implementation...)
> >
> > Then fix it. BIDI and script itemization are two separate issues.
> >
> >
> > >> How you do font selection and what script you pass to HarfBuzz
> are two
> > >> completely separate issues. Font fallback stack should be
> per-language.
> > >
> > > I understand that the best scenario will always be to take
> decisions
> > based on
> > > "language" rather than solely on "script", but it creates a
> problem:
> > >
> > > Say you work on an API for Unicode text rendering: you can't
> promise your
> > > users a solution where they would use arbitrary text without
> providing
> > > language-context per span.
> >
> > These are very good questions. And we have answers to all.
> Unfortunately
> > there's no single location with all this information. I'm working on
> > documenting them, but looks like replying to you and letting you
> document is
> > better.
> >
> > What Pango does is: it takes an input list of languages (through
> $LANGUAGE for
> > example), and whenever there's a item of text with script X, it
> assigns a
> > language to the item in this manner:
> >
> > - If a language L is set on the item (through xml:lang, or
> whatever else the
> > user can use to set a language), and script X may be used to write
> language L,
> > then resolve to language L and return,
> >
> > - for each language L in the list of default languages $LANGUAGE,
> if script
> > X may be used to write language L, then resolve to language L and
> return,
> >
> > - If there's a predominant language L that is likely for script X,
> resolve
> > to language L and return,
> >
> > - Assign no language.
> >
> > This algorithm needs two tables of data:
> >
> > - List of scripts a language tag may possibly use. This is for
> example
> > available in pango-script-lang-table.h. It's generated from
> fontconfig orth
> > files using pango/tools/gen-script-for-lang.c. Feel free to copy it.
> >
> > - List of most likely language for each script. This is available
> in CLDR:
> >
> >
> >
> http://unicode.org/repos/cldr-tmp/trunk/diff/supplemental/likely_subtags.html
> >
> > Pango has it's own manually compiled list in pango-language.c
> >
> > Again, all these are on my plate for the next library I'm going to
> design. It
> > will take a while though...
> >
> >
> > behdad
> >
> > > Or, to come back to the origin of the message: solutions like ICU's
> > "scrptrun"
> > > which are doing script detection are not appropriate (because they
> won't
> > help
> > > you finding the right font due to the lack of language context...)
> > >
> > > I guess the problem is even more generic, like with utf8-encoded
> html pages
> > > rendered in modern browsers, as demonstrated by the creator of
> liblinebreak:
> > > http://wyw.dcweb.cn/lang_utf8.htm
> > >
> > > On Sun, Dec 22, 2013 at 10:47 PM, Behdad Esfahbod <
> behdad at behdad.org
> > <mailto:behdad at behdad.org>
> > > <mailto:behdad at behdad.org <mailto:behdad at behdad.org>>> wrote:
> > >
> > > On 13-12-22 10:10 AM, Ariel Malka wrote:
> > > > I'm trying to render "regular" (i.e. modern, horizontal)
> Japanese with
> > > Harfbuzz.
> > > >
> > > > So far, I have been using HB_SCRIPT_KATAKANA and it looks
> similar
> > to what is
> > > > rendered via browsers.
> > > >
> > > > But after examining other rendering solutions I can see that
> > "automatic
> > > script
> > > > detection" can often take place.
> > > >
> > > > For instance, the Mapnik project is using ICU's "scrptrun",
> which,
> > given the
> > > > following sentence:
> > > >
> > > > ユニコードは、すべての文字に固有の番号を付与します
> > > >
> > > > would detect a mix of Katakana, Hiragana and Han scripts.
> > > >
> > > > But for instance, it would not change anything if I'd render
> the
> > sentence by
> > > > mixing the 3 different scripts (i.e. instead of using only
> > > HB_SCRIPT_KATAKANA.)
> > > >
> > > > Or are there situations where it would make a difference?
> > >
> > > As it happens, those three scripts are all considered
> "simple", so
> > the shaping
> > > logic in HarfBuzz is the same for all three. Where it does
> make a
> > difference
> > > is if the font has ligatures, kerning, etc for those. OpenType
> > organizes
> > > those features by script, and if you request the wrong script
> you
> > will miss
> > > out on the features.
> > >
> > >
> > > > I'm asking that because I suspect a catch-22 situation here.
> For
> > > example, the
> > > > word "diameter" in Japanese is 直径 which, given to "scrptrun"
> > would be
> > > > detected as Han script.
> > > >
> > > > As far as I understand, it could be a problem on systems
> where
> > > > DroidSansFallback.ttf is used, because the word would look
> like in
> > > Simplified
> > > > Chinese.
> > > >
> > > > Now, if we were using MTLmr3m.ttf, which is preferred for
> > Japanese, the word
> > > > would have been rendered as intended.
> > >
> > > How you do font selection and what script you pass to HarfBuzz
> are two
> > > completely separate issues. Font fallback stack should be
> per-language.
> > >
> > > > Reference:
> https://code.google.com/p/chromium/issues/detail?id=183830
> > > >
> > > > Any feedback would be appreciated. Note that the wisdom
> > accumulated here
> > > will
> > > > be translated into tangible info and code samples (see
> > > > https://github.com/arielm/Unicode)
> > > >
> > > > Thanks!
> > > > Ariel
> > > >
> > > >
> > > > _______________________________________________
> > > > HarfBuzz mailing list
> > > > HarfBuzz at lists.freedesktop.org
> > <mailto:HarfBuzz at lists.freedesktop.org>
> > <mailto:HarfBuzz at lists.freedesktop.org
> > <mailto:HarfBuzz at lists.freedesktop.org>>
> > > > http://lists.freedesktop.org/mailman/listinfo/harfbuzz
> > > >
> > >
> > > --
> > > behdad
> > > http://behdad.org/
> > >
> > >
> >
> > --
> > behdad
> > http://behdad.org/
> >
> >
>
> --
> behdad
> http://behdad.org/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/harfbuzz/attachments/20140108/edb0f7a2/attachment-0001.html>
More information about the HarfBuzz
mailing list