<div dir="ltr">Hi,<div><br></div><div>As promised, I have synthetized the wisdom accumulated in this thread into some code and documentation:</div><div><br></div><div><a href="https://github.com/arielm/Unicode/blob/master/Projects/ScriptDetector">https://github.com/arielm/Unicode/blob/master/Projects/ScriptDetector</a><br>
</div><div><br></div><div>Feedback is welcome,</div><div>Ariel</div><div><br></div><div>P.S. the next step is to mix script/lang items with BIDI items (the Mapnik project should be very helpful here...)</div></div><div class="gmail_extra">
<br><br><div class="gmail_quote">On Mon, Dec 23, 2013 at 4:46 AM, Behdad Esfahbod <span dir="ltr"><<a href="mailto:behdad@behdad.org" target="_blank">behdad@behdad.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div class="im">On 13-12-22 08:51 PM, Ariel Malka wrote:<br>
> Thanks Behdad, the info on how it works in Pango is indeed super useful.<br>
><br>
><br>
> An attempt to recap using my original Japanese example:<br>
><br>
> $B%f%K%3!<%I$O!"$9$Y$F$NJ8;z$K8GM-$NHV9f$rIUM?$7$^$9(B<br>
><br>
> ICU's "scrptrun" is detecting Katakana, Hiragana and Han scripts.<br>
><br>
><br>
> Case 1: no "input list of languages" is provided.<br>
><br>
> a) For Katakana and Hiragana items, "ja" will be selected, with the help<br>
> of <a href="http://goo.gl/mpD9Fg" target="_blank">http://goo.gl/mpD9Fg</a><br>
> In turn, MTLmr3m.ttf (default for "ja" in my system) will be used.<br>
<br>
</div>So far so good.<br>
<div class="im"><br>
<br>
> b) For Han items, no language will be selected because of <a href="http://goo.gl/xusqwn" target="_blank">http://goo.gl/xusqwn</a><br>
> At this stage, we still need to pick a font, so I guess we<br>
> choose DroidSansFallback.ttf (default for Han in my system), unless...<br>
><br>
> Some additional strategy could be used, like: observing the surrounding items?<br>
<br>
</div>Yes. All itemization issues can use surrounding context when in doubt...<br>
It's just about managing complexity...<br>
<div class="im"><br>
<br>
> Case 2: we use "ja" (say, collected from the locale) as "input language"<br>
><br>
> For all the items, "ja" will be selected because the 3 scripts are valid for<br>
> writing this language, as defined in <a href="http://goo.gl/hwQri5" target="_blank">http://goo.gl/hwQri5</a><br>
><br>
> By the way, I wonder why Korean is not including Han<br>
> (see <a href="http://goo.gl/bI5BLj" target="_blank">http://goo.gl/bI5BLj</a>), in contradiction to the explanations<br>
> in <a href="http://goo.gl/xusqwn" target="_blank">http://goo.gl/xusqwn</a>?<br>
<br>
</div>Great point. The way the script-per-language was put together is using<br>
fontconfig's orth files, which basically only list Hangul characters for<br>
Korean. It definitely can be improved upon and I'm willing to hear from<br>
roozbeh and others whether we have better data somewhere.<br>
<br>
behdad<br>
<div class="im"><br>
<br>
><br>
><br>
> On Mon, Dec 23, 2013 at 1:35 AM, Behdad Esfahbod <<a href="mailto:behdad@behdad.org">behdad@behdad.org</a><br>
</div><div><div class="h5">> <mailto:<a href="mailto:behdad@behdad.org">behdad@behdad.org</a>>> wrote:<br>
><br>
> On 13-12-22 06:17 PM, Ariel Malka wrote:<br>
> >> As it happens, those three scripts are all considered "simple", so the<br>
> shaping<br>
> >> logic in HarfBuzz is the same for all three.<br>
> ><br>
> > Good to know. For the record, there's a function for checking if a script is<br>
> > complex in the recent Harfbuzz-flavored Android OS: <a href="http://goo.gl/KL1KUi" target="_blank">http://goo.gl/KL1KUi</a><br>
><br>
> Please NEVER use something like that. It's broken by design. It exists in<br>
> Android for legacy reasons, and will eventually be removed.<br>
><br>
><br>
> >> Where it does make a difference<br>
> >> is if the font has ligatures, kerning, etc for those. OpenType organizes<br>
> >> those features by script, and if you request the wrong script you will miss<br>
> >> out on the features.<br>
> ><br>
> > Makes sense to me for Hebrew, Arabic, Thai, etc., but I was bit surprised to<br>
> > find-out that LATN was also a complex script.<br>
><br>
> LATN uses the "generic" shaper, so it's not complex, no.<br>
><br>
><br>
> > So for instance, if I would shape some text containing Hebrew and English<br>
> > solely using the HEBR script, I would probably loose kerning and ffi-like<br>
> > ligatures for the english part<br>
><br>
> Correct.<br>
><br>
><br>
> > (this is what I'm actually doing currently in<br>
> > my "simple" BIDI implementation...)<br>
><br>
> Then fix it. BIDI and script itemization are two separate issues.<br>
><br>
><br>
> >> How you do font selection and what script you pass to HarfBuzz are two<br>
> >> completely separate issues. Font fallback stack should be per-language.<br>
> ><br>
> > I understand that the best scenario will always be to take decisions<br>
> based on<br>
> > "language" rather than solely on "script", but it creates a problem:<br>
> ><br>
> > Say you work on an API for Unicode text rendering: you can't promise your<br>
> > users a solution where they would use arbitrary text without providing<br>
> > language-context per span.<br>
><br>
> These are very good questions. And we have answers to all. Unfortunately<br>
> there's no single location with all this information. I'm working on<br>
> documenting them, but looks like replying to you and letting you document is<br>
> better.<br>
><br>
> What Pango does is: it takes an input list of languages (through $LANGUAGE for<br>
> example), and whenever there's a item of text with script X, it assigns a<br>
> language to the item in this manner:<br>
><br>
> - If a language L is set on the item (through xml:lang, or whatever else the<br>
> user can use to set a language), and script X may be used to write language L,<br>
> then resolve to language L and return,<br>
><br>
> - for each language L in the list of default languages $LANGUAGE, if script<br>
> X may be used to write language L, then resolve to language L and return,<br>
><br>
> - If there's a predominant language L that is likely for script X, resolve<br>
> to language L and return,<br>
><br>
> - Assign no language.<br>
><br>
> This algorithm needs two tables of data:<br>
><br>
> - List of scripts a language tag may possibly use. This is for example<br>
> available in pango-script-lang-table.h. It's generated from fontconfig orth<br>
> files using pango/tools/gen-script-for-lang.c. Feel free to copy it.<br>
><br>
> - List of most likely language for each script. This is available in CLDR:<br>
><br>
><br>
> <a href="http://unicode.org/repos/cldr-tmp/trunk/diff/supplemental/likely_subtags.html" target="_blank">http://unicode.org/repos/cldr-tmp/trunk/diff/supplemental/likely_subtags.html</a><br>
><br>
> Pango has it's own manually compiled list in pango-language.c<br>
><br>
> Again, all these are on my plate for the next library I'm going to design. It<br>
> will take a while though...<br>
><br>
><br>
> behdad<br>
><br>
> > Or, to come back to the origin of the message: solutions like ICU's<br>
> "scrptrun"<br>
> > which are doing script detection are not appropriate (because they won't<br>
> help<br>
> > you finding the right font due to the lack of language context...)<br>
> ><br>
> > I guess the problem is even more generic, like with utf8-encoded html pages<br>
> > rendered in modern browsers, as demonstrated by the creator of liblinebreak:<br>
> > <a href="http://wyw.dcweb.cn/lang_utf8.htm" target="_blank">http://wyw.dcweb.cn/lang_utf8.htm</a><br>
> ><br>
> > On Sun, Dec 22, 2013 at 10:47 PM, Behdad Esfahbod <<a href="mailto:behdad@behdad.org">behdad@behdad.org</a><br>
> <mailto:<a href="mailto:behdad@behdad.org">behdad@behdad.org</a>><br>
</div></div><div><div class="h5">> > <mailto:<a href="mailto:behdad@behdad.org">behdad@behdad.org</a> <mailto:<a href="mailto:behdad@behdad.org">behdad@behdad.org</a>>>> wrote:<br>
> ><br>
> > On 13-12-22 10:10 AM, Ariel Malka wrote:<br>
> > > I'm trying to render "regular" (i.e. modern, horizontal) Japanese with<br>
> > Harfbuzz.<br>
> > ><br>
> > > So far, I have been using HB_SCRIPT_KATAKANA and it looks similar<br>
> to what is<br>
> > > rendered via browsers.<br>
> > ><br>
> > > But after examining other rendering solutions I can see that<br>
> "automatic<br>
> > script<br>
> > > detection" can often take place.<br>
> > ><br>
> > > For instance, the Mapnik project is using ICU's "scrptrun", which,<br>
> given the<br>
> > > following sentence:<br>
> > ><br>
> > > $B%f%K%3!<%I$O!"$9$Y$F$NJ8;z$K8GM-$NHV9f$rIUM?$7$^$9(B<br>
> > ><br>
> > > would detect a mix of Katakana, Hiragana and Han scripts.<br>
> > ><br>
> > > But for instance, it would not change anything if I'd render the<br>
> sentence by<br>
> > > mixing the 3 different scripts (i.e. instead of using only<br>
> > HB_SCRIPT_KATAKANA.)<br>
> > ><br>
> > > Or are there situations where it would make a difference?<br>
> ><br>
> > As it happens, those three scripts are all considered "simple", so<br>
> the shaping<br>
> > logic in HarfBuzz is the same for all three. Where it does make a<br>
> difference<br>
> > is if the font has ligatures, kerning, etc for those. OpenType<br>
> organizes<br>
> > those features by script, and if you request the wrong script you<br>
> will miss<br>
> > out on the features.<br>
> ><br>
> ><br>
> > > I'm asking that because I suspect a catch-22 situation here. For<br>
> > example, the<br>
> > > word "diameter" in Japanese is $BD>7B(B which, given to "scrptrun"<br>
> would be<br>
> > > detected as Han script.<br>
> > ><br>
> > > As far as I understand, it could be a problem on systems where<br>
> > > DroidSansFallback.ttf is used, because the word would look like in<br>
> > Simplified<br>
> > > Chinese.<br>
> > ><br>
> > > Now, if we were using MTLmr3m.ttf, which is preferred for<br>
> Japanese, the word<br>
> > > would have been rendered as intended.<br>
> ><br>
> > How you do font selection and what script you pass to HarfBuzz are two<br>
> > completely separate issues. Font fallback stack should be per-language.<br>
> ><br>
> > > Reference: <a href="https://code.google.com/p/chromium/issues/detail?id=183830" target="_blank">https://code.google.com/p/chromium/issues/detail?id=183830</a><br>
> > ><br>
> > > Any feedback would be appreciated. Note that the wisdom<br>
> accumulated here<br>
> > will<br>
> > > be translated into tangible info and code samples (see<br>
> > > <a href="https://github.com/arielm/Unicode" target="_blank">https://github.com/arielm/Unicode</a>)<br>
> > ><br>
> > > Thanks!<br>
> > > Ariel<br>
> > ><br>
> > ><br>
> > > _______________________________________________<br>
> > > HarfBuzz mailing list<br>
> > > <a href="mailto:HarfBuzz@lists.freedesktop.org">HarfBuzz@lists.freedesktop.org</a><br>
> <mailto:<a href="mailto:HarfBuzz@lists.freedesktop.org">HarfBuzz@lists.freedesktop.org</a>><br>
</div></div>> <mailto:<a href="mailto:HarfBuzz@lists.freedesktop.org">HarfBuzz@lists.freedesktop.org</a><br>
<div class="HOEnZb"><div class="h5">> <mailto:<a href="mailto:HarfBuzz@lists.freedesktop.org">HarfBuzz@lists.freedesktop.org</a>>><br>
> > > <a href="http://lists.freedesktop.org/mailman/listinfo/harfbuzz" target="_blank">http://lists.freedesktop.org/mailman/listinfo/harfbuzz</a><br>
> > ><br>
> ><br>
> > --<br>
> > behdad<br>
> > <a href="http://behdad.org/" target="_blank">http://behdad.org/</a><br>
> ><br>
> ><br>
><br>
> --<br>
> behdad<br>
> <a href="http://behdad.org/" target="_blank">http://behdad.org/</a><br>
><br>
><br>
<br>
</div></div><span class="HOEnZb"><font color="#888888">--<br>
behdad<br>
<a href="http://behdad.org/" target="_blank">http://behdad.org/</a><br>
</font></span></blockquote></div><br></div>