[HarfBuzz] 'vert' substitutions in CJK fonts

Behdad Esfahbod behdad at behdad.org
Wed Feb 6 12:50:41 PST 2013


On 13-02-04 12:34 PM, Grigori Goronzy wrote:
> 
> The GSUB table of problematic fonts typically looks a bit too...
> minimal. Here's an example, that's the "MOTOYA LMaru" font, Android's
> standard CJK font:
...
> So subtitutions are only used if the run that is shaped has Katakana
> (kana) script and language is set to Japanese (JAN). It works if I
> explicitly set the language and force the script to Katakana.
> 
> But, in practice, that's of course not true! First, it breaks as soon as
> the system language is not Japanese, unless the language has been
> overridden. Second, not only Katakana characters have vertical variants.
> Punctuation might or might not be substituted depending on context,
> because punctuation characters have common script and assume the script
> of characters around them. If they're next to Kanji characters, it will
> break.

Grigori, welcome to the darker sides of text rendering :).

This is what Pango does, and what eventually I want to make easier doing with
HarfBuzz:

  - Say, system language is en.  Upon detecting Katakana, Pango then proceeds
to resolve the language to assign to that run of text.  Pango knows what
scripts each language tag (locale) uses.  As such it correctly detects that
English doesn't use Katakana, and as such this run can't be in English.  It
then goes searching for a better language tag for the run:

    * If env vars $LANGUAGE and/or $PANGO_LANGUAGE are set, it looks in the
languages listed there (those are each a list of language tags), and picks the
first one that uses Katakana,

    * If that fails, it knows that most likely language tag for Katakana is
"ja", so it uses that.

This is really useful.  For example, by default when Pango sees text in Arabic
script, it behaves as if it's in Arabic language.  But if I set LANGUAGE=en,fa
in my system, then Pango will attribute untagged Arabic script text to Persian
instead of Arabic language.

This is all in pango-language.c.  Check it out.

The other problem you point it is also handled by resolving
Script=Common/Inherited characters to their neighboring scripts.  So in this
case, even the punctuation will be marked 'kana'.

That said, it is a known shortcoming of OpenType, that a lone punctuation
character cannot hit any script tables other than DFLT...


> Should fonts with GSUB tables like that considered broken?

Yes, it should define a default language system.  But then, many fonts in
common use are broken one way or another...


> What does Uniscribe do to make this work?

Don't know.

> And lastly, can I force HarfBuzz to just
> use the first 'vert' substitution lookup in case there's none to be
> found with matching or DFLT script/language system?

Not really / easily.  You can use the hb-ot.h API to detect that, and find the
OT LangSys tag that *does* have the substitution, then use
hb_ot_tag_to_language() to get a language tag that when passed back to
HarfBuzz, will choose that substitution.

Cheers,

-- 
behdad
http://behdad.org/



More information about the HarfBuzz mailing list