[FriBidi] Fribidi 2.0 Status?

Tue Sep 20 20:58:55 PDT 2005

On Wed, 31 Aug 2005, Shachar Shemesh wrote:

> At present Fribidi doesn't do UTF-16. I put forward a plan to support
> UTF-16, and Behdad promised he will integrate the necessary interface
> changes into the next version. As I have not seen the relevant changes,
> I don't know whether that happened or not.....

The main problem, as I've been saying all the time, is my limited
time, and current API is not completely implemented and tested
yet, and I don't want to rush in shipping yet another incomplete
API.  Want to get it right.

Anyway, see my other message.  With that view of the world, to
support UTF-16, only these functions need a 16 equivalent:

  fribidi_get_bidi_types()
  fribidi_get_joining_types()
  fribidi_shape()
  fribidi_reorder_line()

The first two are simple for loops over fribidi_get_bidi_type()
and fribidi_get_joining_type(), so it's feasible to have a UTF-16
version around.  fribidi_shape() calls fribidi_shape_arabic() and
fribidi_shape_mirroring(), both of them are again for loops
around per-character functions.  fribidi_reorder_line() is the
only function that really works with the string.  It's not
feasible to duplicate that.  But it may make sense to change
fribidi_reorder_line() to only output l2v list, and from that
list construct the visual_str easily.  That's something I will
think about.  That definitely reduces the complexity of
reorder_line.  The problem with all this abstraction layers is
that the temporary memory consumption is multipled by a constant
factor, since you have to keep all those bidi_types,
embedding_levels, joining_types, and now l2v array around, which
in the current released fribidi they are freed before the next
one is allocated...

Now this brings the question: should we have a UTF-8
implementation too?  That's certainly more useful than UTF-16
generally.  Or to ask it more generally, should we integrate with
all that charset conversion code we have to not have to convert
to UTF-32 at all?  This brings the question of: what would the
benefit be?  We already are allocating buffers for several
things, bidi types, embedding levels, etc, etc, what's different
with allocating one for the UTF-32 string too?  Shachar?

> I'll mention that the Wine project is literally eager to dump the
> current solution in favor of Fribidi, as soon as it is supporting UTF-16.

Are you more interested in embedding a copy of FriBidi, or to
link against the one installed on the system?

I have this crazy idea for how to build an approximate
UTF-16-only FriBidi:  All you need to change the FriBidiCharType
to be 16-bit, then rebuild the bidi types and here you go.  This
works for UCS2, but not surrogate pairs.  Now we change it in a
way that it produces good-enough results for UTF-16, but not
perfect:  non-BMP characters are represented in UTF-16 as a
surrogate pair, which is a high surrogate character
(U+D800..U+DBFF) followed by a low surrogate character
(U+DC00..U+DFFF).  First, we want to make sure that after
reordering, the low surrogate always follows the high surrogate,
and not the other way around.  To achieve this, we change the
bidi type of the low surrogates to NSM.  This ensures what we
want, given that fribidi_nsm_reorder() is true.  With a little
hacking we can assign an NSM-like but different type to them,
such that we can handle them even with fribidi_nsm_reorder being
false.  The bitmask nature of bidi types makes it pretty easy.
Now the bidi type of each high surrogate is actually determining
the bidi type of all 1024 characters that start with that high
surrogate in UTF-16.  What we do is to assign the majority bidi
type of these 1024 chars to the high surrogate.

How's that?  Pretty easy, eh?

> (actual work will, of course, depend on what Behdad did, and my free
> resources).
>
>           Shachar

--behdad
http://behdad.org/