[HarfBuzz] HarfBuzz 1.0 API; the message you were hoping would never come

Sun Dec 22 21:41:24 PST 2013

I started harfbuzz-ng during Xmas holidays back in the end of 2006.  Seven
years later I hope we all agree that we are in good shape to call it 1.0 soon.
 As with every 1.0 release there are issues we need to figure out first, and
while I have been putting this off for way too long, I think it's time to talk
about it...

I was hoping that no long-term-support distro will ship with harfbuzz-0.x, but
looks like the RHEL 7 train has already left the station so to speak.
Notwithstanding that, here are my thoughts for API changes for 1.0.

What I like to suggest is that we form a core team to follow through with this
and make sure this happens in a matter of weeks.  My initial list would be for
myself (Pango, Android, Chrome), Jonathan (Firefox), Khaled (LibreOffice /
XeTeX), Konstantin Ritt (Qt), Roozbeh (Unicode), Ubuntu/Debian (Ahmed?), and
Fedora/Red Hat maintainers (Parag, Matthias?) to take on this task.  Please
respond if you are willing to help with this / other suggestions.

It would have been great if we could maintain ABI / API compatibility with
previous releases, but I think we should balance that with API satisfaction
moving forward.  True that it has been seven years already, but I think the
most, from a maintenance point of view, lies ahead of us.  So, keep that in
mind when discussing.  I'm hoping that people wouldn't have to #ifdef their
way around keeping compat with both versions for long, but I imagine some
amount of that is inevitable, so I like to keep that possible, if not desirable.

With no further comments, lets get to the list of changes I like to discuss:

API ITEM: Destroy Func:

The hb_destroy_func_t type in HarfBuzz currently only takes one argument: the
user-provided data.  This works for most uses of the function, except for the
use in hb_blob_create().  If you use, eg, mmap() to load data to pass to
hb_blob_create(), upon destruction all you need is the blob data address and
length to munmap().  Both of those you can get from the blob itself.  But
currently one is forced to allocate a piece of memory to record the blob data
address / length so they can unmap it.

One way to address this would be to add a new destroy type, hb_blob_destroy_t,
that also takes the blob as input.  Depending on which way we order the
arguments to that function it may or may not be ABI-compatible with previous
version.  At any rate, updating source code to new API will be trivial.

API ITEM: User Data:

Currently the set_user_data API for various objects looks like this:

hb_bool_t
hb_face_set_user_data (hb_face_t          *face,
                       hb_user_data_key_t *key,
                       void *              data,
                       hb_destroy_func_t   destroy,
                       hb_bool_t           replace);

the "replace" argument was added to allow limited use for thread-safe object
life-cycle management.  Reality-check suggests that it's not enough.  For
example, when making Pango thread-safe I had to request
g_object_replace_data() to be added.  g_object_replace_data() is essentially a
compare-and-exchange version of g_object_set_data():

gboolean    g_objegboolean    g_object_replace_data             (GObject
  *object,
                                               const gchar    *key,
                                               gpointer        oldval,
                                               gpointer        newval,
                                               GDestroyNotify  destroy,
                                               GDestroyNotify *old_destroy);

I'm afraid we may need something similar in HarfBuzz.  Ie. we want to allow
all of these:

  - Set user data unconditionally, destroying previous value if any,

  - Set user data if not previously set,

  - Set user data if previous value is not equal to a certain value.  Either
hand me the previous destroy value if setting was successful, or destroy old
value.  This is useful, eg, for using user-data to cache a linked-list of items.

Unfortunately supporting these requires adding three / four new arguments to
all of those functions.  I'm thinking about adding these:

  * replace_mode.  An enum that has values REPLACE_ALWAYS, REPLACE_IF_NOT_SET,
and REPLACE_IF_MATCHES,

  * prev_value.  If REPLACE_IF_MATCHES used, only replace if previous value is
equal to prev_value,

  * prev_destroy.  Out value for previous destroy functions.  If passed in as
NULL, previous destroy function will be called.  Otherwise, caller is
responsible for destroying previous value,

  * reference_func.  Kinda independent, but possibly add this, which will be
called every time get_user_data() is called, to make sure value we are
returning is referenced properly?  This will be called while the user-data
mutex is held so we know the object cannot disappear before we reference it.
If NULL, current behavior will be retained.

Again, it's trivial to update client code to new API.  ABI will be broken.

API ITEM: unsigned int vs uintptr_t / size_t:

Use uintptr_t / size_t instead of unsigned int throughout the API?  Which one?
 This has ABI implications on 64-bit architectures.  My current thinking is
that we should do this.  I'm unsure to what extent to do this though.  Should
the, eg, "number of lookups" type change?  I'm leaning towards no.  Updating
client code is trivial.

API ITEM: hb-ot-shape.h stuff:

Change hb_ot_shape_glyphs_closure() to use hb_shape_plan_t / hb_set_t.  Client
code needs to add a few more lines of code, but this API is barely used.

API ITEM: Deprecated Stuff:

Remove the stuff in hb-deprecated.h.  Trivial stuff.

API ITEM: Glyph Variants:

get_glyph() currently takes unicode codepoint as well as variation selector.
The current semantics is that if variation selector is not 0, you are supposed
to load the correct glyph for the variation selector, and return FALSE if that
fails, at which point we call get_glyph() again with variation selector set to
0.  It has been suggested that we move to two separate callbacks: get_glyph()
and get_glyph_variant().  It would be slightly faster to do so, but that would
spread the get_glyph logic into two callbacks instead of one, which would be
more error-prone in implementations.  So I'm not sure if it's worth it.
Client code update is small.

API ITEM: ScriptExtensions considerations:

Currently hb_unicode_script() returns the Unicode Script property and nothing
else.  Currently we only use this for guessing segment properties, which is
considered toy API.  But I've seen quite a few users of the library use
hb_unicode_t for their itemization needs, and I support that.

In recent years Unicode added the ScriptExtensions property.  This property
presents a set of scripts that may be used with a character.  For most
characters it only includes the script in the character's Script property, but
there are exceptions, and those exceptions may have Script=Common,
Script=Inherited, or any other script.

Since OpenType shaping relies on the correct script to be chosen, this is
important for us.  A trivial example is a sequence of ARABIC TATWEEL,ARABIC
FATHATAN.  We want to choose (and let clients choose) Arabic script for this
sequence, even though the script property of those are Common,Inherited
respectively.

How exactly should itemization / script resolution work is a whole other
question.  But I like us to provide the right data to make it possible at
least.  My current thinking is to:

  - Change hb_unicode_script to return an out param of
script_extensions_situation that can be set to DOESNT HAVE, HAS, MAY_HAVE.
Client code can be updated trivially to return MAY_HAVE,

  - Add a new callback for script extensions to be called if HAS or MAY_HAVE
is returned above.  I haven't though through this one yet, but since it's new
API, we can flesh it out separately / later.

API ITEM: Vertical Orientation:

The hb_unicode_eastasian_width() callback was added for future functionality
of an itemizer deciding whether to use up-right or rotated vertical text.
That is what Pango does currently IIRC.  In the mean time UTR#50 has been
developed:

  http://www.unicode.org/reports/tr50/

I haven't studied that in detail yet (or, I have, but I don't remember the
details right now).  So, do we need to remove the old callback and add
something around the new Vertical_Orientation property?  I don't know.  Client
code can simply remove support for old property and may or may not add support
for new property.

API ITEM: Compatibility type in decomposition:

This one is a new one I thought about.  James recently brought up the fact
that the new automatic-fractions feature doesn't work nicely with
compatibility decompositions of VULGAR FRACTIONS.  At the root of the issue is
that those characters have a compatibility decomposition type of <fraction>.
We currently ignore the compatibility type in decompositions.  Should we add a
compat_type enum to decompose_compatibility callback?  I understand that many
clients don't like that callback to begin with, and most providers of it don't
have the type data currently.  That said, this comes also handy in Arabic
compatibility decomposition characters, so we can wrap the <initial> /
<medial> / <isolated> / <final> decompositions with correct ZWJ/ZWNJ pairs.
Definitely not high priority, but something to consider.  We can definitely
add DECOMPOSITION_TYPE_UNKNOWN...  Client code update is trivial if not adding
support for new functionality.

API ITEM: get_glyph in face:

Currently, in a few places (space, dottedcircle, viramas) we can benefit from
the get_glyph() callback to be called on a face instead of a font.  So I'm
considering moving it to the face object.  The big problem is: we have
hb_font_funcs_t, but no hb_face_funcs_t.  I'm not sure this is worth the
trouble for clients.  We seem to be doing ok with the current setup.  But I
wanted to put it out there for discussion.  Client code may need relatively
major shuffling.

API ITEM: hb-ft and load-flags

We've known this: I need to fix hb-ft.h to take in load-flags.  Nothing
controversial here I suppose.  Client code may need trivial update.

API ITEM: Accept a list of languages:

I'm not sure about this myself, but looks like it may be helpful to clients to
be able to pass in a list of languages, instead of one language, and that we
walk through them while matching LangSys.  Has a lot with itemization, but
looks liked just taking in a list of languages will work regardless since most
fonts only have LangSys for relevant languages.  Client code should be
unaffected if not needing the new functionality.

API ITEM: Add init_func to font_funcs:

This can simplify hb-ft a bit, and will be generally useful.  Can come up with
a patch for further discussion.  Client code should be unaffected if not
needing the new functionality.

API ITEM: Add pkg-config files for glue codes (harfbuzz-glib, etc)

Again, not controversial, but has to be done before with burn the library API
in stone.  I'm still leaning towards keeping hb-glib and hb-ft in
libharfbuzz.so, while hb-gobject and hb-icue are separated already.  Most
client code shouldn't be affected.

API ITEM: 'const' for getter APIs:

This sounds like a good idea in principle, but is tricky to implement with C.
 Ie. what should hb_face_reference() take and what to return?  There is no
right answer for the return type as referencing a const face is certainly
permitted.  Maybe just do it for the trivial parts of the API?  Most client
code should be unaffected. API or ABI wise.

API ITEM: hb_feature_t breakdown:

This was discussed a couple month ago.  Currently hb_feature_t is defined to:

typedef struct hb_feature_t {
  hb_tag_t      tag;
  uint32_t      value;
  unsigned int  start;
  unsigned int  end;
} hb_feature_t;

Ideally I like to break that down to:

typedef struct hb_feature_t {
  hb_tag_t      tag;
  uint32_t      value;
} hb_feature_t;

typedef struct hb_range_t {
  unsigned int  start;
  unsigned int  end;
} hb_range_t;

And either define hb_feature_range_t that has a hb_feature_t and a hb_range_t
inside, or change hb_shape() (and variants) to take an array of hb_feature_t
as well as hb_range_t.  Both approaches have their benefits, though I'm more
interested to know whether such a big change is considered possible or too
much of a change.  Updating client code is trivial for the most part,
especially with hb_feature_range_t, though, harder to #ifdef.

Woah.  That's it for now.  Please discuss!
-- 
behdad
http://behdad.org/