[poppler] Working with Asian languages

Jonathan Kew jfkthame at gmail.com
Mon Sep 14 08:53:10 PDT 2015


On 14/9/15 16:40, Rob Hawkins wrote:
> Thank you all for these great replies.  I find the stuff about the
> unicode encoding order really interesting.  And I too wish we could find
> more information about the as-yet unmapped Asian scripts.
>
> I was mistaken about the output of PDF.js.  I thought I had viewed the
> HTML source and seen good data, how exciting!  Yet now I that I double
> check, I see it is just the viewer that is correct, and the source text
> is garbled just like pdftotext etc.
>
> I'm bummed there is no magic solution here as I thought I had found, but
> glad to see people are still interested in this.  If I find out how to
> implement these languages, I will try.

I think what you're looking for is the ActualText feature in PDF. If 
this is present, a viewer or text-extraction tool can use it to provide 
the correct text, instead of trying to reconstruct the text from the 
stream of glyphs in the PDF -- which, while it often works OK for 
European languages and similar "simple" writing systems, is pretty much 
doomed to failure for complex South/Southeast Asian scripts, etc.

But this is dependent on the PDF-generating tool or workflow including 
the correct ActualText attributes in the first place. In my (very 
limited) experience, this is pretty rare.

JK

 > Alternatively, can we band
> together to destroy PDFs everywhere?  If we work in concert it may be
> possible. =)
>
> Thanks again,
>
> Rob
>
> On Mon, Sep 14, 2015 at 9:22 PM, suzuki toshiya
> <mpsuzuki at hiroshima-u.ac.jp <mailto:mpsuzuki at hiroshima-u.ac.jp>> wrote:
>
>     Dear Rob,
>
>     Poppler extracts the text from PDF via the serie of glyphs.
>     Therefore, the scripts that the Unicode encode the characters
>     as visible order, the first step of the text extraction is
>     possible.
>
>     However, some Asian scripts, especially Brahmic-based scripts,
>     have very complicated layout rules, so, the encoding order
>     in Unicode text is phonetic and different from the visible
>     order (e.g. coded characters are in consonant-then-vowel order,
>     but the displayed characters are in vowel-then-consonant order).
>
>     In such case, the character serie extracted via the glyph serie
>     is not good coded text.
>
>     I'm not sure which script you assume for Indonesian (Latin?
>     Javanese? Balinese?), but, among Thai, Burmese, Khmer scripts,
>     only Thai script is coded in visible order. Other scripts
>     have vowel-then-consonant encoding issue, so, it is not easy
>     for Poppler to extract the text in correct "Unicode" text.
>     Therefore, the result you have (Thai is OK, others are not)
>     sounds reasonable.
>
>     I'm unfamiliar with the bleeding-edge technology in the latedt
>     PDF about how to deal with such complex script (I guess PDF
>     developers are willing to support such), but, the PDFs made
>     by old PDF production softwares may have similar problem.
>
>     I wish some Adobe experts mentions about the situation in the
>     latest PDF for complex scripts :-)
>
>     Regards,
>     mpsuzuki
>
>     Rob Hawkins wrote:
>      > Greetings all,
>      >
>      > Can pdftohtml produce output for Burmese, Khmer, Indonesian, Thai and
>      > Vietnamese?  I didn't see a language pack for any except Thai,
>     and that one
>      > doesn't produce properly formatted characters for my source
>     files.  They're
>      > missing the vowel marks.  The other languages fail completely on
>     my setup.
>      > I've tried on OS X and Ubuntu 12.
>      >
>      > My source files are here:
>      > https://github.com/robhawkins/drive-taiwan/tree/master/input/pdf
>      >
>      > Chinese seems to work fine.
>      >
>      > I found out that PDF.js will produce good output, though I
>     already have
>      > code based on pdftohtml output and would rather not switch if not
>      > necessary.  I wonder if there is something wrong with my setup.
>      >
>      > Thanks for any help even if it's just a "nope, that's not
>     possible" kind of
>      > reply =)
>      >
>      > Rob
>      >
>      >
>      >
>      >
>     ------------------------------------------------------------------------
>      >
>      > _______________________________________________
>      > poppler mailing list
>      > poppler at lists.freedesktop.org <mailto:poppler at lists.freedesktop.org>
>      > http://lists.freedesktop.org/mailman/listinfo/poppler
>
>
>
>
> _______________________________________________
> poppler mailing list
> poppler at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/poppler
>



More information about the poppler mailing list