[HarfBuzz] Question regarding the use of HB_SCRIPT_KATAKANA for "regular" Japanese
James Clark
jjc at jclark.com
Fri Jan 10 05:12:05 PST 2014
Is it too much to expect minority language users to specify the language
they are using?Inconveniencing the 99% who was using Thai script to write
Thai in order to help the 1% who are using Thai script to write minority
languages doesn't seem like a good trade-off.
On Thu, Jan 9, 2014 at 12:01 PM, Martin Hosken <mhosken at gmail.com> wrote:
> Dear All,
>
> > > https://github.com/arielm/Unicode/blob/master/Projects/ScriptDetector
> >
> > This is awesome! Thank you.
>
> As I work with minority languages, automatic language detectors make me
> shudder and cry. Please do not assume that because something is in, say
> Thai script, that it is in Thai language. This is true for nearly every
> script there is.
>
> Yours,
> Martin
>
> >
> > behdad
> >
> >
> > > Feedback is welcome,
> > > Ariel
> > >
> > > P.S. the next step is to mix script/lang items with BIDI items (the
> Mapnik
> > > project should be very helpful here...)
> > >
> > >
> > > On Mon, Dec 23, 2013 at 4:46 AM, Behdad Esfahbod <behdad at behdad.org
> > > <mailto:behdad at behdad.org>> wrote:
> > >
> > > On 13-12-22 08:51 PM, Ariel Malka wrote:
> > > > Thanks Behdad, the info on how it works in Pango is indeed super
> useful.
> > > >
> > > >
> > > > An attempt to recap using my original Japanese example:
> > > >
> > > > ユニコードは、すべての文字に固有の番号を付与します
> > > >
> > > > ICU's "scrptrun" is detecting Katakana, Hiragana and Han scripts.
> > > >
> > > >
> > > > Case 1: no "input list of languages" is provided.
> > > >
> > > > a) For Katakana and Hiragana items, "ja" will be selected, with
> the help
> > > > of http://goo.gl/mpD9Fg
> > > > In turn, MTLmr3m.ttf (default for "ja" in my system) will be
> used.
> > >
> > > So far so good.
> > >
> > >
> > > > b) For Han items, no language will be selected because of
> > > http://goo.gl/xusqwn
> > > > At this stage, we still need to pick a font, so I guess we
> > > > choose DroidSansFallback.ttf (default for Han in my system),
> unless...
> > > >
> > > > Some additional strategy could be used, like: observing the
> surrounding
> > > items?
> > >
> > > Yes. All itemization issues can use surrounding context when in
> doubt...
> > > It's just about managing complexity...
> > >
> > >
> > > > Case 2: we use "ja" (say, collected from the locale) as "input
> language"
> > > >
> > > > For all the items, "ja" will be selected because the 3 scripts
> are valid for
> > > > writing this language, as defined in http://goo.gl/hwQri5
> > > >
> > > > By the way, I wonder why Korean is not including Han
> > > > (see http://goo.gl/bI5BLj), in contradiction to the explanations
> > > > in http://goo.gl/xusqwn?
> > >
> > > Great point. The way the script-per-language was put together is
> using
> > > fontconfig's orth files, which basically only list Hangul
> characters for
> > > Korean. It definitely can be improved upon and I'm willing to
> hear from
> > > roozbeh and others whether we have better data somewhere.
> > >
> > > behdad
> > >
> > >
> > > >
> > > >
> > > > On Mon, Dec 23, 2013 at 1:35 AM, Behdad Esfahbod <
> behdad at behdad.org
> > > <mailto:behdad at behdad.org>
> > > > <mailto:behdad at behdad.org <mailto:behdad at behdad.org>>> wrote:
> > > >
> > > > On 13-12-22 06:17 PM, Ariel Malka wrote:
> > > > >> As it happens, those three scripts are all considered
> "simple",
> > > so the
> > > > shaping
> > > > >> logic in HarfBuzz is the same for all three.
> > > > >
> > > > > Good to know. For the record, there's a function for
> checking if a
> > > script is
> > > > > complex in the recent Harfbuzz-flavored Android OS:
> > > http://goo.gl/KL1KUi
> > > >
> > > > Please NEVER use something like that. It's broken by
> design. It
> > > exists in
> > > > Android for legacy reasons, and will eventually be removed.
> > > >
> > > >
> > > > >> Where it does make a difference
> > > > >> is if the font has ligatures, kerning, etc for those.
> OpenType
> > > organizes
> > > > >> those features by script, and if you request the wrong
> script you
> > > will miss
> > > > >> out on the features.
> > > > >
> > > > > Makes sense to me for Hebrew, Arabic, Thai, etc., but I
> was bit
> > > surprised to
> > > > > find-out that LATN was also a complex script.
> > > >
> > > > LATN uses the "generic" shaper, so it's not complex, no.
> > > >
> > > >
> > > > > So for instance, if I would shape some text containing
> Hebrew and
> > > English
> > > > > solely using the HEBR script, I would probably loose
> kerning and
> > > ffi-like
> > > > > ligatures for the english part
> > > >
> > > > Correct.
> > > >
> > > >
> > > > > (this is what I'm actually doing currently in
> > > > > my "simple" BIDI implementation...)
> > > >
> > > > Then fix it. BIDI and script itemization are two separate
> issues.
> > > >
> > > >
> > > > >> How you do font selection and what script you pass to
> HarfBuzz
> > > are two
> > > > >> completely separate issues. Font fallback stack should be
> > > per-language.
> > > > >
> > > > > I understand that the best scenario will always be to take
> decisions
> > > > based on
> > > > > "language" rather than solely on "script", but it creates
> a problem:
> > > > >
> > > > > Say you work on an API for Unicode text rendering: you
> can't
> > > promise your
> > > > > users a solution where they would use arbitrary text
> without providing
> > > > > language-context per span.
> > > >
> > > > These are very good questions. And we have answers to all.
> > > Unfortunately
> > > > there's no single location with all this information. I'm
> working on
> > > > documenting them, but looks like replying to you and letting
> you
> > > document is
> > > > better.
> > > >
> > > > What Pango does is: it takes an input list of languages
> (through
> > > $LANGUAGE for
> > > > example), and whenever there's a item of text with script X,
> it
> > > assigns a
> > > > language to the item in this manner:
> > > >
> > > > - If a language L is set on the item (through xml:lang, or
> > > whatever else the
> > > > user can use to set a language), and script X may be used to
> write
> > > language L,
> > > > then resolve to language L and return,
> > > >
> > > > - for each language L in the list of default languages
> $LANGUAGE,
> > > if script
> > > > X may be used to write language L, then resolve to language
> L and
> > > return,
> > > >
> > > > - If there's a predominant language L that is likely for
> script X,
> > > resolve
> > > > to language L and return,
> > > >
> > > > - Assign no language.
> > > >
> > > > This algorithm needs two tables of data:
> > > >
> > > > - List of scripts a language tag may possibly use. This
> is for
> > > example
> > > > available in pango-script-lang-table.h. It's generated from
> > > fontconfig orth
> > > > files using pango/tools/gen-script-for-lang.c. Feel free to
> copy it.
> > > >
> > > > - List of most likely language for each script. This is
> available
> > > in CLDR:
> > > >
> > > >
> > > >
> > >
> http://unicode.org/repos/cldr-tmp/trunk/diff/supplemental/likely_subtags.html
> > > >
> > > > Pango has it's own manually compiled list in pango-language.c
> > > >
> > > > Again, all these are on my plate for the next library I'm
> going to
> > > design. It
> > > > will take a while though...
> > > >
> > > >
> > > > behdad
> > > >
> > > > > Or, to come back to the origin of the message: solutions
> like ICU's
> > > > "scrptrun"
> > > > > which are doing script detection are not appropriate
> (because they
> > > won't
> > > > help
> > > > > you finding the right font due to the lack of language
> context...)
> > > > >
> > > > > I guess the problem is even more generic, like with
> utf8-encoded
> > > html pages
> > > > > rendered in modern browsers, as demonstrated by the
> creator of
> > > liblinebreak:
> > > > > http://wyw.dcweb.cn/lang_utf8.htm
> > > > >
> > > > > On Sun, Dec 22, 2013 at 10:47 PM, Behdad Esfahbod
> > > <behdad at behdad.org <mailto:behdad at behdad.org>
> > > > <mailto:behdad at behdad.org <mailto:behdad at behdad.org>>
> > > > > <mailto:behdad at behdad.org <mailto:behdad at behdad.org>
> > > <mailto:behdad at behdad.org <mailto:behdad at behdad.org>>>> wrote:
> > > > >
> > > > > On 13-12-22 10:10 AM, Ariel Malka wrote:
> > > > > > I'm trying to render "regular" (i.e. modern,
> horizontal)
> > > Japanese with
> > > > > Harfbuzz.
> > > > > >
> > > > > > So far, I have been using HB_SCRIPT_KATAKANA and it
> looks
> > > similar
> > > > to what is
> > > > > > rendered via browsers.
> > > > > >
> > > > > > But after examining other rendering solutions I can
> see that
> > > > "automatic
> > > > > script
> > > > > > detection" can often take place.
> > > > > >
> > > > > > For instance, the Mapnik project is using ICU's
> "scrptrun",
> > > which,
> > > > given the
> > > > > > following sentence:
> > > > > >
> > > > > > ユニコードは、すべての文字に固有の番号を付与します
> > > > > >
> > > > > > would detect a mix of Katakana, Hiragana and Han
> scripts.
> > > > > >
> > > > > > But for instance, it would not change anything if
> I'd render the
> > > > sentence by
> > > > > > mixing the 3 different scripts (i.e. instead of
> using only
> > > > > HB_SCRIPT_KATAKANA.)
> > > > > >
> > > > > > Or are there situations where it would make a
> difference?
> > > > >
> > > > > As it happens, those three scripts are all considered
> "simple", so
> > > > the shaping
> > > > > logic in HarfBuzz is the same for all three. Where it
> does make a
> > > > difference
> > > > > is if the font has ligatures, kerning, etc for those.
> OpenType
> > > > organizes
> > > > > those features by script, and if you request the wrong
> script you
> > > > will miss
> > > > > out on the features.
> > > > >
> > > > >
> > > > > > I'm asking that because I suspect a catch-22
> situation here. For
> > > > > example, the
> > > > > > word "diameter" in Japanese is 直径 which, given to
> "scrptrun"
> > > > would be
> > > > > > detected as Han script.
> > > > > >
> > > > > > As far as I understand, it could be a problem on
> systems where
> > > > > > DroidSansFallback.ttf is used, because the word
> would look
> > > like in
> > > > > Simplified
> > > > > > Chinese.
> > > > > >
> > > > > > Now, if we were using MTLmr3m.ttf, which is
> preferred for
> > > > Japanese, the word
> > > > > > would have been rendered as intended.
> > > > >
> > > > > How you do font selection and what script you pass to
> HarfBuzz
> > > are two
> > > > > completely separate issues. Font fallback stack
> should be
> > > per-language.
> > > > >
> > > > > > Reference:
> > > https://code.google.com/p/chromium/issues/detail?id=183830
> > > > > >
> > > > > > Any feedback would be appreciated. Note that the
> wisdom
> > > > accumulated here
> > > > > will
> > > > > > be translated into tangible info and code samples
> (see
> > > > > > https://github.com/arielm/Unicode)
> > > > > >
> > > > > > Thanks!
> > > > > > Ariel
> > > > > >
> > > > > >
> > > > > > _______________________________________________
> > > > > > HarfBuzz mailing list
> > > > > > HarfBuzz at lists.freedesktop.org
> > > <mailto:HarfBuzz at lists.freedesktop.org>
> > > > <mailto:HarfBuzz at lists.freedesktop.org
> > > <mailto:HarfBuzz at lists.freedesktop.org>>
> > > > <mailto:HarfBuzz at lists.freedesktop.org
> > > <mailto:HarfBuzz at lists.freedesktop.org>
> > > > <mailto:HarfBuzz at lists.freedesktop.org
> > > <mailto:HarfBuzz at lists.freedesktop.org>>>
> > > > > >
> http://lists.freedesktop.org/mailman/listinfo/harfbuzz
> > > > > >
> > > > >
> > > > > --
> > > > > behdad
> > > > > http://behdad.org/
> > > > >
> > > > >
> > > >
> > > > --
> > > > behdad
> > > > http://behdad.org/
> > > >
> > > >
> > >
> > > --
> > > behdad
> > > http://behdad.org/
> > >
> > >
> >
> > --
> > behdad
> > http://behdad.org/
> > _______________________________________________
> > HarfBuzz mailing list
> > HarfBuzz at lists.freedesktop.org
> > http://lists.freedesktop.org/mailman/listinfo/harfbuzz
> _______________________________________________
> HarfBuzz mailing list
> HarfBuzz at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/harfbuzz
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/harfbuzz/attachments/20140110/9d531bc5/attachment-0001.html>
More information about the HarfBuzz
mailing list