[poppler] Working with Asian languages

Mon Sep 14 16:34:41 PDT 2015

On 15/09/15 01:23, Jonathan Kew wrote:
> On 14/9/15 16:40, Rob Hawkins wrote:
>> Thank you all for these great replies.  I find the stuff about the
>> unicode encoding order really interesting.  And I too wish we could find
>> more information about the as-yet unmapped Asian scripts.
>>
>> I was mistaken about the output of PDF.js.  I thought I had viewed the
>> HTML source and seen good data, how exciting!  Yet now I that I double
>> check, I see it is just the viewer that is correct, and the source text
>> is garbled just like pdftotext etc.
>>
>> I'm bummed there is no magic solution here as I thought I had found, but
>> glad to see people are still interested in this.  If I find out how to
>> implement these languages, I will try.
> 
> I think what you're looking for is the ActualText feature in PDF. If
> this is present, a viewer or text-extraction tool can use it to provide
> the correct text, instead of trying to reconstruct the text from the
> stream of glyphs in the PDF -- which, while it often works OK for
> European languages and similar "simple" writing systems, is pretty much
> doomed to failure for complex South/Southeast Asian scripts, etc.
> 
> But this is dependent on the PDF-generating tool or workflow including
> the correct ActualText attributes in the first place. In my (very
> limited) experience, this is pretty rare.

Poppler has supported ActualText when extracting text since 2008. I
added this to poppler when I added ActualText generation to cairo.
Application support for this appears to be rare.  I'm not aware of any
cairo application that uses the cairo_show_text_glyphs() API for
generating ActualText entries.

> 
> JK
> 
>> Alternatively, can we band
>> together to destroy PDFs everywhere?  If we work in concert it may be
>> possible. =)
>>
>> Thanks again,
>>
>> Rob
>>
>> On Mon, Sep 14, 2015 at 9:22 PM, suzuki toshiya
>> <mpsuzuki at hiroshima-u.ac.jp <mailto:mpsuzuki at hiroshima-u.ac.jp>> wrote:
>>
>>     Dear Rob,
>>
>>     Poppler extracts the text from PDF via the serie of glyphs.
>>     Therefore, the scripts that the Unicode encode the characters
>>     as visible order, the first step of the text extraction is
>>     possible.
>>
>>     However, some Asian scripts, especially Brahmic-based scripts,
>>     have very complicated layout rules, so, the encoding order
>>     in Unicode text is phonetic and different from the visible
>>     order (e.g. coded characters are in consonant-then-vowel order,
>>     but the displayed characters are in vowel-then-consonant order).
>>
>>     In such case, the character serie extracted via the glyph serie
>>     is not good coded text.
>>
>>     I'm not sure which script you assume for Indonesian (Latin?
>>     Javanese? Balinese?), but, among Thai, Burmese, Khmer scripts,
>>     only Thai script is coded in visible order. Other scripts
>>     have vowel-then-consonant encoding issue, so, it is not easy
>>     for Poppler to extract the text in correct "Unicode" text.
>>     Therefore, the result you have (Thai is OK, others are not)
>>     sounds reasonable.
>>
>>     I'm unfamiliar with the bleeding-edge technology in the latedt
>>     PDF about how to deal with such complex script (I guess PDF
>>     developers are willing to support such), but, the PDFs made
>>     by old PDF production softwares may have similar problem.
>>
>>     I wish some Adobe experts mentions about the situation in the
>>     latest PDF for complex scripts :-)
>>
>>     Regards,
>>     mpsuzuki
>>
>>     Rob Hawkins wrote:
>>      > Greetings all,
>>      >
>>      > Can pdftohtml produce output for Burmese, Khmer, Indonesian,
>> Thai and
>>      > Vietnamese?  I didn't see a language pack for any except Thai,
>>     and that one
>>      > doesn't produce properly formatted characters for my source
>>     files.  They're
>>      > missing the vowel marks.  The other languages fail completely on
>>     my setup.
>>      > I've tried on OS X and Ubuntu 12.
>>      >
>>      > My source files are here:
>>      > https://github.com/robhawkins/drive-taiwan/tree/master/input/pdf
>>      >
>>      > Chinese seems to work fine.
>>      >
>>      > I found out that PDF.js will produce good output, though I
>>     already have
>>      > code based on pdftohtml output and would rather not switch if not
>>      > necessary.  I wonder if there is something wrong with my setup.
>>      >
>>      > Thanks for any help even if it's just a "nope, that's not
>>     possible" kind of
>>      > reply =)
>>      >
>>      > Rob
>>      >
>>      >
>>      >
>>      >
>>    
>> ------------------------------------------------------------------------
>>      >
>>      > _______________________________________________
>>      > poppler mailing list
>>      > poppler at lists.freedesktop.org
>> <mailto:poppler at lists.freedesktop.org>
>>      > http://lists.freedesktop.org/mailman/listinfo/poppler
>>
>>
>>
>>
>> _______________________________________________
>> poppler mailing list
>> poppler at lists.freedesktop.org
>> http://lists.freedesktop.org/mailman/listinfo/poppler
>>
> 
> _______________________________________________
> poppler mailing list
> poppler at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/poppler