[HarfBuzz] Question regarding the use of HB_SCRIPT_KATAKANA for "regular" Japanese

Fri Jan 10 05:12:05 PST 2014

Is it too much to expect minority language users to specify the language
they are using?Inconveniencing the 99% who was using Thai script to write
Thai in order to help the 1% who are using Thai script to write minority
languages doesn't seem like a good trade-off.

On Thu, Jan 9, 2014 at 12:01 PM, Martin Hosken <mhosken at gmail.com> wrote:

> Dear All,
>
> > > https://github.com/arielm/Unicode/blob/master/Projects/ScriptDetector
> >
> > This is awesome!  Thank you.
>
> As I work with minority languages, automatic language detectors make me
> shudder and cry. Please do not assume that because something is in, say
> Thai script, that it is in Thai language. This is true for nearly every
> script there is.
>
> Yours,
> Martin
>
> >
> > behdad
> >
> >
> > > Feedback is welcome,
> > > Ariel
> > >
> > > P.S. the next step is to mix script/lang items with BIDI items (the
> Mapnik
> > > project should be very helpful here...)
> > >
> > >
> > > On Mon, Dec 23, 2013 at 4:46 AM, Behdad Esfahbod <behdad at behdad.org
> > > <mailto:behdad at behdad.org>> wrote:
> > >
> > >     On 13-12-22 08:51 PM, Ariel Malka wrote:
> > >     > Thanks Behdad, the info on how it works in Pango is indeed super
> useful.
> > >     >
> > >     >
> > >     > An attempt to recap using my original Japanese example:
> > >     >
> > >     > ユニコードは、すべての文字に固有の番号を付与します
> > >     >
> > >     > ICU's "scrptrun" is detecting Katakana, Hiragana and Han scripts.
> > >     >
> > >     >
> > >     > Case 1: no "input list of languages" is provided.
> > >     >
> > >     > a) For Katakana and Hiragana items, "ja" will be selected, with
> the help
> > >     > of http://goo.gl/mpD9Fg
> > >     > In turn, MTLmr3m.ttf (default for "ja" in my system) will be
> used.
> > >
> > >     So far so good.
> > >
> > >
> > >     > b) For Han items, no language will be selected because of
> > >     http://goo.gl/xusqwn
> > >     > At this stage, we still need to pick a font, so I guess we
> > >     > choose DroidSansFallback.ttf (default for Han in my system),
> unless...
> > >     >
> > >     > Some additional strategy could be used, like: observing the
> surrounding
> > >     items?
> > >
> > >     Yes.  All itemization issues can use surrounding context when in
> doubt...
> > >     It's just about managing complexity...
> > >
> > >
> > >     > Case 2: we use "ja" (say, collected from the locale) as "input
> language"
> > >     >
> > >     > For all the items, "ja" will be selected because the 3 scripts
> are valid for
> > >     > writing this language, as defined in http://goo.gl/hwQri5
> > >     >
> > >     > By the way, I wonder why Korean is not including Han
> > >     > (see http://goo.gl/bI5BLj), in contradiction to the explanations
> > >     > in http://goo.gl/xusqwn?
> > >
> > >     Great point.  The way the script-per-language was put together is
> using
> > >     fontconfig's orth files, which basically only list Hangul
> characters for
> > >     Korean.  It definitely can be improved upon and I'm willing to
> hear from
> > >     roozbeh and others whether we have better data somewhere.
> > >
> > >     behdad
> > >
> > >
> > >     >
> > >     >
> > >     > On Mon, Dec 23, 2013 at 1:35 AM, Behdad Esfahbod <
> behdad at behdad.org
> > >     <mailto:behdad at behdad.org>
> > >     > <mailto:behdad at behdad.org <mailto:behdad at behdad.org>>> wrote:
> > >     >
> > >     >     On 13-12-22 06:17 PM, Ariel Malka wrote:
> > >     >     >> As it happens, those three scripts are all considered
> "simple",
> > >     so the
> > >     >     shaping
> > >     >     >> logic in HarfBuzz is the same for all three.
> > >     >     >
> > >     >     > Good to know. For the record, there's a function for
> checking if a
> > >     script is
> > >     >     > complex in the recent Harfbuzz-flavored Android OS:
> > >     http://goo.gl/KL1KUi
> > >     >
> > >     >     Please NEVER use something like that.  It's broken by
> design.  It
> > >     exists in
> > >     >     Android for legacy reasons, and will eventually be removed.
> > >     >
> > >     >
> > >     >     >> Where it does make a difference
> > >     >     >> is if the font has ligatures, kerning, etc for those.
>  OpenType
> > >     organizes
> > >     >     >> those features by script, and if you request the wrong
> script you
> > >     will miss
> > >     >     >> out on the features.
> > >     >     >
> > >     >     > Makes sense to me for Hebrew, Arabic, Thai, etc., but I
> was bit
> > >     surprised to
> > >     >     > find-out that LATN was also a complex script.
> > >     >
> > >     >     LATN uses the "generic" shaper, so it's not complex, no.
> > >     >
> > >     >
> > >     >     > So for instance, if I would shape some text containing
> Hebrew and
> > >     English
> > >     >     > solely using the HEBR script, I would probably loose
> kerning and
> > >     ffi-like
> > >     >     > ligatures for the english part
> > >     >
> > >     >     Correct.
> > >     >
> > >     >
> > >     >     > (this is what I'm actually doing currently in
> > >     >     > my "simple" BIDI implementation...)
> > >     >
> > >     >     Then fix it.  BIDI and script itemization are two separate
> issues.
> > >     >
> > >     >
> > >     >     >> How you do font selection and what script you pass to
> HarfBuzz
> > >     are two
> > >     >     >> completely separate issues.  Font fallback stack should be
> > >     per-language.
> > >     >     >
> > >     >     > I understand that the best scenario will always be to take
> decisions
> > >     >     based on
> > >     >     > "language" rather than solely on "script", but it creates
> a problem:
> > >     >     >
> > >     >     > Say you work on an API for Unicode text rendering: you
> can't
> > >     promise your
> > >     >     > users a solution where they would use arbitrary text
> without providing
> > >     >     > language-context per span.
> > >     >
> > >     >     These are very good questions.  And we have answers to all.
> > >      Unfortunately
> > >     >     there's no single location with all this information.  I'm
> working on
> > >     >     documenting them, but looks like replying to you and letting
> you
> > >     document is
> > >     >     better.
> > >     >
> > >     >     What Pango does is: it takes an input list of languages
> (through
> > >     $LANGUAGE for
> > >     >     example), and whenever there's a item of text with script X,
> it
> > >     assigns a
> > >     >     language to the item in this manner:
> > >     >
> > >     >       - If a language L is set on the item (through xml:lang, or
> > >     whatever else the
> > >     >     user can use to set a language), and script X may be used to
> write
> > >     language L,
> > >     >     then resolve to language L and return,
> > >     >
> > >     >       - for each language L in the list of default languages
> $LANGUAGE,
> > >     if script
> > >     >     X may be used to write language L, then resolve to language
> L and
> > >     return,
> > >     >
> > >     >       - If there's a predominant language L that is likely for
> script X,
> > >     resolve
> > >     >     to language L and return,
> > >     >
> > >     >       - Assign no language.
> > >     >
> > >     >     This algorithm needs two tables of data:
> > >     >
> > >     >       - List of scripts a language tag may possibly use.  This
> is for
> > >     example
> > >     >     available in pango-script-lang-table.h.  It's generated from
> > >     fontconfig orth
> > >     >     files using pango/tools/gen-script-for-lang.c.  Feel free to
> copy it.
> > >     >
> > >     >       - List of most likely language for each script.  This is
> available
> > >     in CLDR:
> > >     >
> > >     >
> > >     >
> > >
> http://unicode.org/repos/cldr-tmp/trunk/diff/supplemental/likely_subtags.html
> > >     >
> > >     >     Pango has it's own manually compiled list in pango-language.c
> > >     >
> > >     >     Again, all these are on my plate for the next library I'm
> going to
> > >     design.  It
> > >     >     will take a while though...
> > >     >
> > >     >
> > >     >     behdad
> > >     >
> > >     >     > Or, to come back to the origin of the message: solutions
> like ICU's
> > >     >     "scrptrun"
> > >     >     > which are doing script detection are not appropriate
> (because they
> > >     won't
> > >     >     help
> > >     >     > you finding the right font due to the lack of language
> context...)
> > >     >     >
> > >     >     > I guess the problem is even more generic, like with
> utf8-encoded
> > >     html pages
> > >     >     > rendered in modern browsers, as demonstrated by the
> creator of
> > >     liblinebreak:
> > >     >     > http://wyw.dcweb.cn/lang_utf8.htm
> > >     >     >
> > >     >     > On Sun, Dec 22, 2013 at 10:47 PM, Behdad Esfahbod
> > >     <behdad at behdad.org <mailto:behdad at behdad.org>
> > >     >     <mailto:behdad at behdad.org <mailto:behdad at behdad.org>>
> > >     >     > <mailto:behdad at behdad.org <mailto:behdad at behdad.org>
> > >     <mailto:behdad at behdad.org <mailto:behdad at behdad.org>>>> wrote:
> > >     >     >
> > >     >     >     On 13-12-22 10:10 AM, Ariel Malka wrote:
> > >     >     >     > I'm trying to render "regular" (i.e. modern,
> horizontal)
> > >     Japanese with
> > >     >     >     Harfbuzz.
> > >     >     >     >
> > >     >     >     > So far, I have been using HB_SCRIPT_KATAKANA and it
> looks
> > >     similar
> > >     >     to what is
> > >     >     >     > rendered via browsers.
> > >     >     >     >
> > >     >     >     > But after examining other rendering solutions I can
> see that
> > >     >     "automatic
> > >     >     >     script
> > >     >     >     > detection" can often take place.
> > >     >     >     >
> > >     >     >     > For instance, the Mapnik project is using ICU's
> "scrptrun",
> > >     which,
> > >     >     given the
> > >     >     >     > following sentence:
> > >     >     >     >
> > >     >     >     > ユニコードは、すべての文字に固有の番号を付与します
> > >     >     >     >
> > >     >     >     > would detect a mix of Katakana, Hiragana and Han
> scripts.
> > >     >     >     >
> > >     >     >     > But for instance, it would not change anything if
> I'd render the
> > >     >     sentence by
> > >     >     >     > mixing the 3 different scripts (i.e. instead of
> using only
> > >     >     >     HB_SCRIPT_KATAKANA.)
> > >     >     >     >
> > >     >     >     > Or are there situations where it would make a
> difference?
> > >     >     >
> > >     >     >     As it happens, those three scripts are all considered
> "simple", so
> > >     >     the shaping
> > >     >     >     logic in HarfBuzz is the same for all three.  Where it
> does make a
> > >     >     difference
> > >     >     >     is if the font has ligatures, kerning, etc for those.
>  OpenType
> > >     >     organizes
> > >     >     >     those features by script, and if you request the wrong
> script you
> > >     >     will miss
> > >     >     >     out on the features.
> > >     >     >
> > >     >     >
> > >     >     >     > I'm asking that because I suspect a catch-22
> situation here. For
> > >     >     >     example, the
> > >     >     >     > word "diameter" in Japanese is 直径 which, given to
> "scrptrun"
> > >     >     would be
> > >     >     >     > detected as Han script.
> > >     >     >     >
> > >     >     >     > As far as I understand, it could be a problem on
> systems where
> > >     >     >     > DroidSansFallback.ttf is used, because the word
> would look
> > >     like in
> > >     >     >     Simplified
> > >     >     >     > Chinese.
> > >     >     >     >
> > >     >     >     > Now, if we were using MTLmr3m.ttf, which is
> preferred for
> > >     >     Japanese, the word
> > >     >     >     > would have been rendered as intended.
> > >     >     >
> > >     >     >     How you do font selection and what script you pass to
> HarfBuzz
> > >     are two
> > >     >     >     completely separate issues.  Font fallback stack
> should be
> > >     per-language.
> > >     >     >
> > >     >     >     > Reference:
> > >     https://code.google.com/p/chromium/issues/detail?id=183830
> > >     >     >     >
> > >     >     >     > Any feedback would be appreciated. Note that the
> wisdom
> > >     >     accumulated here
> > >     >     >     will
> > >     >     >     > be translated into tangible info and code samples
> (see
> > >     >     >     > https://github.com/arielm/Unicode)
> > >     >     >     >
> > >     >     >     > Thanks!
> > >     >     >     > Ariel
> > >     >     >     >
> > >     >     >     >
> > >     >     >     > _______________________________________________
> > >     >     >     > HarfBuzz mailing list
> > >     >     >     > HarfBuzz at lists.freedesktop.org
> > >     <mailto:HarfBuzz at lists.freedesktop.org>
> > >     >     <mailto:HarfBuzz at lists.freedesktop.org
> > >     <mailto:HarfBuzz at lists.freedesktop.org>>
> > >     >     <mailto:HarfBuzz at lists.freedesktop.org
> > >     <mailto:HarfBuzz at lists.freedesktop.org>
> > >     >     <mailto:HarfBuzz at lists.freedesktop.org
> > >     <mailto:HarfBuzz at lists.freedesktop.org>>>
> > >     >     >     >
> http://lists.freedesktop.org/mailman/listinfo/harfbuzz
> > >     >     >     >
> > >     >     >
> > >     >     >     --
> > >     >     >     behdad
> > >     >     >     http://behdad.org/
> > >     >     >
> > >     >     >
> > >     >
> > >     >     --
> > >     >     behdad
> > >     >     http://behdad.org/
> > >     >
> > >     >
> > >
> > >     --
> > >     behdad
> > >     http://behdad.org/
> > >
> > >
> >
> > --
> > behdad
> > http://behdad.org/
> > _______________________________________________
> > HarfBuzz mailing list
> > HarfBuzz at lists.freedesktop.org
> > http://lists.freedesktop.org/mailman/listinfo/harfbuzz
> _______________________________________________
> HarfBuzz mailing list
> HarfBuzz at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/harfbuzz
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/harfbuzz/attachments/20140110/9d531bc5/attachment-0001.html>