[Fontconfig] Can we use base 16, and not 85, for ASCII charset representations?

Sat Sep 21 17:26:13 PDT 2013

I worked up the last two patches [1,2] on the road toward
understanding fontconfig's view of charsets, with the goal being:

  Which installed fonts contain code point 0xXXXX?

Now I understand the (base-code-point, bitmap) structure (as
documented in [2]), and I can use this:

$ fc-list -v 'URW Chancery L:style=Medium Italic'
…
        charset:
        0000: 00000000 ffffffff ffffffff 7fffffff 00000000 ffffffff ffffffff ffffffff
        0001: ffffffff ffffffff fffff3ff ffffffff 00040000 00000000 00000000 00000000
        0002: 03000000 00000000 00000000 00000000 00000000 00000000 3f0002c0 00000000
        0003: 00000000 00000000 00000000 00000000 00100000 10000000 00000000 00000000
        0004: ffffffff ffffffff ffffffff 00000000 00000000 0c00c000 faff0007 033ffffc
        0020: 77180000 06010047 00000010 00000000 00000000 00001000 00000000 00000000
        0021: 00400000 00000004 00000000 00000000 00000000 00000000 00000000 00000000
        0022: 46260044 00000000 00000000 00000031 00000000 00000000 00000000 00000000
        0025: 00000000 00000000 00000000 00000000 00000000 00000000 00000400 00000000
        00f6: 00000000 00000000 00000000 00000000 00000000 00000000 000001f8 00000000
        00fb: 00000006 00000000 00000000 00000000 00000000 00000000 00000000 00000000
(s)

However, I'm still stuck on the base-85 formatting for the user-facing
charsets (and I'm not alone: [3]):

$ fc-list 'URW Chancery L:style=Medium Italic' charset
:charset=  |>^1!|>^1!P0oWQ |>^1!|>^1!|>^1!!!!%#|>^1!|>^1!|>]fs|>^1!!!K?&   !!!)$!{{B%     9;*l$ !!!.%    !#f05(1+e5  !!!1&|>^1!|>^1!|>^1!  %rw)IzbyU$#%lqi!!#0GM>RAd#y#fx!!!!5  !!!W5  !!#3H!)pSj!!!!&      !!#6I<UG/)  !!!!X    !!#AL      !!!1& !!+fv      !!!(y !!+u{!!!!)

Is code point 0x2202 in the first?  Yes:

  * 0x2202 / 0xff = 0x22, so it's in the "0022:" row, with a remainder
    of 0x2202 & 0xff = 0x02
  * 0x02 / 32 = 0, so it's in the first block (map[0] = 0x46260044),
    with a remainder of 0x02 % 32 = 2
  * 2 / 0xf = 0, so it's in the least significant digit of the block
    (map[0] & 0xf = 4), with a remainder of 2 % 0xf = 2
  * The remainder-2 entry is the third bit (2+1) in the digit, because
    the remainder-0 entry gets the first bit.  The third bit is in the
    4s column, and that's set in the digit 4 ;).

To do the same with the second format, I had to fiddle with the
valueToChar and Python to determine that 0x2200 is 0:0:0x1:0x11:0x22
in base 85, which should be represented by '!!#6I'.  The next five
characters are '<UG/)', which decodes to 0x16:0x2e:0x20:0xa:0x6 in
base 85, which is indeed 0x46260044.

I don't think saving three characters (37.5%) is worth the hassle of
learning a fontconfig-specific set of digits for base 85.  If I
convert the parse/unparse code in fccharset.c to use hex, would that
be mergable?  The only problem I can see would be for folks scripting
fc-list that had already written parsers for the current format (a
null set?).

Alternatively, perhaps there is another way to lookup fonts containing
a character, and I've just missed it.  In that case I don't care how
ugly the charset serialization is :p.

Cheers,
Trevor

[1]: http://thread.gmane.org/gmane.comp.fonts.fontconfig/4914
[2]: http://thread.gmane.org/gmane.comp.fonts.fontconfig/4915
[3]: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=498039#5

-- 
This email may be signed or encrypted with GnuPG (http://www.gnupg.org).
For more information, see http://en.wikipedia.org/wiki/Pretty_Good_Privacy
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: OpenPGP digital signature
URL: <http://lists.freedesktop.org/archives/fontconfig/attachments/20130921/3e39ae44/attachment.pgp>