[HarfBuzz] Fwd: harfbuzz work

Wed Aug 5 14:27:58 PDT 2009

On 08/03/2009 11:16 PM, Martin Hosken wrote:
> Dear Jonathan,
>
>>> * A script-run itemizer based on ICU's, but adapted to support text
>>> in any of UTF-8, 16, or 32 (not actually tested with them all yet,
>>> though).

Hi Martin,

> I have a few struggles with script itemization. My primary struggle is the length of time it takes for a block allocation to get from an ISO meeting (or even earlier) into a release of harfbuzz or whatever application is using it. I'm not sure what can be done about that, but perhaps a solution to my other stuggle might help.

As of now, I'm not quite sure whether we want to have script itemization in 
harfbuzz to begin with.  *If* we do, however, it will use a callback to get 
the script for a character, so higher level can control what script is 
returned for unencoded characters.

> PUA characters are currently defined, very sensibly, as unknown script. But they can turn up in all sorts of places, for example as arabic characters. I am assuming we don't want to return unknown script if we can help it, and therefore wonder if the unknown script code were to be changed to be<  SCRIPT_INHERITED that it might not resolve both many PUA issues and also issues of new character allocations within a block or even new block allocations.
>
> An alternative is to have special handling for unknown characters: unknown characters inherit the script of the block they are in.

The problem is, Unicode currently doesn't assign a script to blocks.  That 
would indeed be a useful addition.  In the mean time though, the itemizer (at 
least the one in Pango) treats UNKNOWN like COMMON, that is, they inherit the 
script of the neighboring characters.

> Knowledge of the font (if known) can help in itemization too. Odds are, if an unknown character is in the same font then it is in the same script. But perhaps you have already done font based run breaking before the itemization occurs here.

Script itemization is always done before font selection.

> In the case of bidi for PUA, I would suggest that PUA (and therefore unknown) characters be given a neutral bidi property rather than the default L that the UTC throws in there.

That kind of consideration really belongs to UTC, not any conforming 
implementation.

behdad

> Yours,
> Martin
>