[HarfBuzz] Indic Test Suite :: "Indie"

Ed Trager ed.trager at gmail.com
Wed Aug 26 11:17:13 PDT 2009


Hi, Everyone,

I've started to put together a program (written in C++ and called
"Indie" as in "Indie band", "Indie film", etc.) that can be easily
customized to generate Indic test suite data.  I would like to get
everyone's feedback to see if my approach is on track and to solicit
additional ideas and help.  So please provide feedback:

(1) The heart of the program is a "Markup Language Reporter" (MLR)
base class.  Derived non-virtual report classes include TEXTR, XMLR,
XHTMLR, and JSONR.  This means that the same test data "report" can be
produced in text, XML, XHTML, or JSON formats.  This is exceedingly
convenient for both automated and human-based processing. :-)

(2) Secondly, the program takes advantage of the fact that the
ordering of vowels and consonants across the Unicode blocks for the
major Indic scripts is consistent.  The program has (or when finished
will have) header files containing meta data for each script, and an
important item of meta data in the header files is the "offset" value.
 Script offsets are relative to Devanagari: the offset for Bengali,
for example, is 0x0080.  So, for example, once you have:

     <hex>0x0915, 0x093f</hex>
     <utf8>कि</utf8>

... as a test case for testing Devanagari KA + I, you can just add the
offset for Bengali (0x0080) to produce the equivalent test case for
Bengali:

    <hex>0x0995, 0x09bf</hex>
    <utf8>কি</utf8>

... and obviously you can continue in this manner to cover all of the
major Indic scripts.

There are of course differences among the Indic scripts -- some of you
on this list hopefully know a lot more about this than I do!
Therefore, I'm sure that for some specific scripts there will need to
be specific tests that don't generalize across all other Indic
scripts.  On the other hand, there also exist classes of test cases
that *do* generalize across all the Indic scripts -- for example tests
of dependent vowels, tests of the new ZWJ+HALANT behaviors, etc.

Below I provide an example of what the XML output for dependent vowels
currently looks like.  NOTE that I don't yet have Pango or Cairo stuff
in the program, so the "<glyphIds>" and "renderedImage" tags are
empty.  As I develop the program further, I can either (i) add the
necessary Pango-Cairo code directly in the program or (ii) have the
program call Behdad's "PangoView" to get the glyphIDs and render PNG
images.  (I'm leaning toward adding the Pango-Cairo calls directly
into my program because I don't think it will be too much more work,
but we'll see how things go).  (Recall that Pango will use Uniscribe
on Windows, ATSUI/AAT on Mac, so all bases will be covered).

Anyway, here's what the XML currently looks like for the dependent
vowel tests for Devanagari and Bengali.  In the future, the
"renderedImage" tag would contain a PNG file name based on the test
case ID, i.e., "case_11.png", "case_12.png" ... etc. :

===============

<?xml version="1.0" encoding="UTF-8" ?>
<report>
<scripts>
 <script>
  <commonName>Devanagari</commonName>
  <nativeName>देवनागरी</nativeName>
  <dependentVowels>
   <testCase>
    <id>1</id>
    <hex>0x0915</hex>
    <utf8>क</utf8>
    <glyphIds>...</glyphIds>
    <renderedImage>...</renderedImage>
   </testCase>
   <testCase>
    <id>2</id>
    <hex>0x0915, 0x093e</hex>
    <utf8>का</utf8>
    <glyphIds>...</glyphIds>
    <renderedImage>...</renderedImage>
   </testCase>
   <testCase>
    <id>3</id>
    <hex>0x0915, 0x093f</hex>
    <utf8>कि</utf8>
    <glyphIds>...</glyphIds>
    <renderedImage>...</renderedImage>
   </testCase>
   <testCase>
    <id>4</id>
    <hex>0x0915, 0x0940</hex>
    <utf8>की</utf8>
    <glyphIds>...</glyphIds>
    <renderedImage>...</renderedImage>
   </testCase>
   <testCase>
    <id>5</id>
    <hex>0x0915, 0x0941</hex>
    <utf8>कु</utf8>
    <glyphIds>...</glyphIds>
    <renderedImage>...</renderedImage>
   </testCase>
   <testCase>
    <id>6</id>
    <hex>0x0915, 0x0942</hex>
    <utf8>कू</utf8>
    <glyphIds>...</glyphIds>
    <renderedImage>...</renderedImage>
   </testCase>
   <testCase>
    <id>7</id>
    <hex>0x0915, 0x0943</hex>
    <utf8>कृ</utf8>
    <glyphIds>...</glyphIds>
    <renderedImage>...</renderedImage>
   </testCase>
   <testCase>
    <id>8</id>
    <hex>0x0915, 0x0944</hex>
    <utf8>कॄ</utf8>
    <glyphIds>...</glyphIds>
    <renderedImage>...</renderedImage>
   </testCase>
   <testCase>
    <id>9</id>
    <hex>0x0915, 0x0962</hex>
    <utf8>कॢ</utf8>
    <glyphIds>...</glyphIds>
    <renderedImage>...</renderedImage>
   </testCase>
   <testCase>
    <id>10</id>
    <hex>0x0915, 0x0963</hex>
    <utf8>कॣ</utf8>
    <glyphIds>...</glyphIds>
    <renderedImage>...</renderedImage>
   </testCase>
   <testCase>
    <id>11</id>
    <hex>0x0915, 0x0947</hex>
    <utf8>के</utf8>
    <glyphIds>...</glyphIds>
    <renderedImage>...</renderedImage>
   </testCase>
   <testCase>
    <id>12</id>
    <hex>0x0915, 0x0948</hex>
    <utf8>कै</utf8>
    <glyphIds>...</glyphIds>
    <renderedImage>...</renderedImage>
   </testCase>
   <testCase>
    <id>13</id>
    <hex>0x0915, 0x094b</hex>
    <utf8>को</utf8>
    <glyphIds>...</glyphIds>
    <renderedImage>...</renderedImage>
   </testCase>
   <testCase>
    <id>14</id>
    <hex>0x0915, 0x094c</hex>
    <utf8>कौ</utf8>
    <glyphIds>...</glyphIds>
    <renderedImage>...</renderedImage>
   </testCase>
  </dependentVowels>
 </script>
 <script>
  <commonName>Bengali</commonName>
  <nativeName>বাংলা</nativeName>
  <dependentVowels>
   <testCase>
    <id>15</id>
    <hex>0x0995</hex>
    <utf8>ক</utf8>
    <glyphIds>...</glyphIds>
    <renderedImage>...</renderedImage>
   </testCase>
   <testCase>
    <id>16</id>
    <hex>0x0995, 0x09be</hex>
    <utf8>কা</utf8>
    <glyphIds>...</glyphIds>
    <renderedImage>...</renderedImage>
   </testCase>
   <testCase>
    <id>17</id>
    <hex>0x0995, 0x09bf</hex>
    <utf8>কি</utf8>
    <glyphIds>...</glyphIds>
    <renderedImage>...</renderedImage>
   </testCase>
   <testCase>
    <id>18</id>
    <hex>0x0995, 0x09c0</hex>
    <utf8>কী</utf8>
    <glyphIds>...</glyphIds>
    <renderedImage>...</renderedImage>
   </testCase>
   <testCase>
    <id>19</id>
    <hex>0x0995, 0x09c1</hex>
    <utf8>কু</utf8>
    <glyphIds>...</glyphIds>
    <renderedImage>...</renderedImage>
   </testCase>
   <testCase>
    <id>20</id>
    <hex>0x0995, 0x09c2</hex>
    <utf8>কূ</utf8>
    <glyphIds>...</glyphIds>
    <renderedImage>...</renderedImage>
   </testCase>
   <testCase>
    <id>21</id>
    <hex>0x0995, 0x09c3</hex>
    <utf8>কৃ</utf8>
    <glyphIds>...</glyphIds>
    <renderedImage>...</renderedImage>
   </testCase>
   <testCase>
    <id>22</id>
    <hex>0x0995, 0x09c4</hex>
    <utf8>কৄ</utf8>
    <glyphIds>...</glyphIds>
    <renderedImage>...</renderedImage>
   </testCase>
   <testCase>
    <id>23</id>
    <hex>0x0995, 0x09e2</hex>
    <utf8>কৢ</utf8>
    <glyphIds>...</glyphIds>
    <renderedImage>...</renderedImage>
   </testCase>
   <testCase>
    <id>24</id>
    <hex>0x0995, 0x09e3</hex>
    <utf8>কৣ</utf8>
    <glyphIds>...</glyphIds>
    <renderedImage>...</renderedImage>
   </testCase>
   <testCase>
    <id>25</id>
    <hex>0x0995, 0x09c7</hex>
    <utf8>কে</utf8>
    <glyphIds>...</glyphIds>
    <renderedImage>...</renderedImage>
   </testCase>
   <testCase>
    <id>26</id>
    <hex>0x0995, 0x09c8</hex>
    <utf8>কৈ</utf8>
    <glyphIds>...</glyphIds>
    <renderedImage>...</renderedImage>
   </testCase>
   <testCase>
    <id>27</id>
    <hex>0x0995, 0x09cb</hex>
    <utf8>কো</utf8>
    <glyphIds>...</glyphIds>
    <renderedImage>...</renderedImage>
   </testCase>
   <testCase>
    <id>28</id>
    <hex>0x0995, 0x09cc</hex>
    <utf8>কৌ</utf8>
    <glyphIds>...</glyphIds>
    <renderedImage>...</renderedImage>
   </testCase>
  </dependentVowels>
 </script>
</scripts>
</report>

=================

Finally, it should be obvious that this approach, once the Pango-Cairo
stuff is incorporated one way or the other, should make it trivially
easy to test and compare different fonts ("Mangal.ttf" on Windows
Vista / 7, "lohit" for Linux, etc.)

 -- Ed



More information about the HarfBuzz mailing list