[Fontconfig] Improving Latin font selection for CJK locales

Tue Jan 29 17:56:31 PST 2008

Hi, Qianqian,

Latin digits are basically treated as "neutral" characters in a run of
text -- I think that is pretty much
"standard Unicode operating procedure" if you look at how the digits
are categorized in UCD.

I don't know the internal details of how Pango itemizes a string of
text, but using
your "pngsBGtUJxMgD.png" as an example, we can see what is most likely
occurring: First, it appears that Pango treats  "1234A" as a run of "latn" text
because of the presence of the letter "A" -- all characters
preceding the "A" are "neutrals" which presumably don't influence the
itemizer, but of course
the letter "A" tells the itemizer that the current run of text is Latin script.
Then of course the "我" starts a new run of text which gets classified as Han
("hani" if using the ISO 15924 code) script -- and the following
neutrals "123" remain a part of that
2nd text segment. The final "ABC" however causes the itemizer to break
out a 3rd segment --and it is "latn".

Pango presumably then talks to fontconfig to get the font assignments
for each of the three segments.
Behdad can confirm if this is in fact how the itemizer works or not.

So fixing this kind of "bug" or "feature" may require changing how the
itemizer works.
For example, what if digits were not categorized as "neutrals" but
were instead assigned their own
category of "Latin Digits" ?

Then a text itemizer could break out "latin digits" into separate segments.

For a document with Latin script, maybe these "latin digit" segments
eventually get merged back into
the "latn" segments because it is not necessary to treat them any
differently from how the "latn" segments
are treated.

But if the main script is not Latin, then there may be some advantage
to treating "latin digits" segments separately.

For example, it would allow your Chinese text to have latin digits
rendered in DejaVu Sans because the "latin digits" segments could
simply be treated as another special kind of "latn" segment.

There might also be some benefit to doing this in Arabic texts since
the "latin digits" and even the "Arabic digits" need to be rendered as
runs of LTR text embedded in surrounding RTL text.

Of course there may be other issues and cases which I have not thought
of yet, but this is not the first time that I have thought about
treating segments of "latin digits" as some non-neutral category for
the purposes of enhanced itemization.

(I am actually currently working on writing some C++ UnicodeText
classes of my own -- and just recently was playing around with these
issues of text itemization, so I am very interested to learn what
people *really* want to have).  Is it possible that what people really
want may *differ* in some details from the status-quo standard Unicode
practices?

Best Wishes - Ed

>
> the second point currently is not possible, because Pango labels the Common
> scripts (digits) near Chinese text as Chinese, and in fontconfig, we never
> know if it is a common-script or Chinese Hanzi. This caused porblems
> like this:
>
> https://www.redhat.com/archives/fedora-fonts-list/2007-December/pngsBGtUJxMgD.png
>
> Seems to me that the proposed methods will still assign lang=zh for Common
> scripts between Chinese Hanzi if locale=zh. So, it may still not likely
> that we can force to use smooth Latin fonts for Common via fontconfig,
> is my understanding correct?
>
>
> >
> >> --Pat
>
> >>
>
> _______________________________________________
> Fontconfig mailing list
> Fontconfig at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/fontconfig
>