[poppler] Working with Asian languages

suzuki toshiya mpsuzuki at hiroshima-u.ac.jp
Mon Sep 14 06:22:03 PDT 2015


Dear Rob,

Poppler extracts the text from PDF via the serie of glyphs.
Therefore, the scripts that the Unicode encode the characters
as visible order, the first step of the text extraction is
possible.

However, some Asian scripts, especially Brahmic-based scripts,
have very complicated layout rules, so, the encoding order
in Unicode text is phonetic and different from the visible
order (e.g. coded characters are in consonant-then-vowel order,
but the displayed characters are in vowel-then-consonant order).

In such case, the character serie extracted via the glyph serie
is not good coded text.

I'm not sure which script you assume for Indonesian (Latin?
Javanese? Balinese?), but, among Thai, Burmese, Khmer scripts,
only Thai script is coded in visible order. Other scripts
have vowel-then-consonant encoding issue, so, it is not easy
for Poppler to extract the text in correct "Unicode" text.
Therefore, the result you have (Thai is OK, others are not)
sounds reasonable.

I'm unfamiliar with the bleeding-edge technology in the latedt
PDF about how to deal with such complex script (I guess PDF
developers are willing to support such), but, the PDFs made
by old PDF production softwares may have similar problem.

I wish some Adobe experts mentions about the situation in the
latest PDF for complex scripts :-)

Regards,
mpsuzuki

Rob Hawkins wrote:
> Greetings all,
> 
> Can pdftohtml produce output for Burmese, Khmer, Indonesian, Thai and
> Vietnamese?  I didn't see a language pack for any except Thai, and that one
> doesn't produce properly formatted characters for my source files.  They're
> missing the vowel marks.  The other languages fail completely on my setup.
> I've tried on OS X and Ubuntu 12.
> 
> My source files are here:
> https://github.com/robhawkins/drive-taiwan/tree/master/input/pdf
> 
> Chinese seems to work fine.
> 
> I found out that PDF.js will produce good output, though I already have
> code based on pdftohtml output and would rather not switch if not
> necessary.  I wonder if there is something wrong with my setup.
> 
> Thanks for any help even if it's just a "nope, that's not possible" kind of
> reply =)
> 
> Rob
> 
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> poppler mailing list
> poppler at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/poppler



More information about the poppler mailing list