[poppler] Working with Asian languages
Adrian Johnson
ajohnson at redneon.com
Mon Sep 14 16:34:41 PDT 2015
On 15/09/15 01:23, Jonathan Kew wrote:
> On 14/9/15 16:40, Rob Hawkins wrote:
>> Thank you all for these great replies. I find the stuff about the
>> unicode encoding order really interesting. And I too wish we could find
>> more information about the as-yet unmapped Asian scripts.
>>
>> I was mistaken about the output of PDF.js. I thought I had viewed the
>> HTML source and seen good data, how exciting! Yet now I that I double
>> check, I see it is just the viewer that is correct, and the source text
>> is garbled just like pdftotext etc.
>>
>> I'm bummed there is no magic solution here as I thought I had found, but
>> glad to see people are still interested in this. If I find out how to
>> implement these languages, I will try.
>
> I think what you're looking for is the ActualText feature in PDF. If
> this is present, a viewer or text-extraction tool can use it to provide
> the correct text, instead of trying to reconstruct the text from the
> stream of glyphs in the PDF -- which, while it often works OK for
> European languages and similar "simple" writing systems, is pretty much
> doomed to failure for complex South/Southeast Asian scripts, etc.
>
> But this is dependent on the PDF-generating tool or workflow including
> the correct ActualText attributes in the first place. In my (very
> limited) experience, this is pretty rare.
Poppler has supported ActualText when extracting text since 2008. I
added this to poppler when I added ActualText generation to cairo.
Application support for this appears to be rare. I'm not aware of any
cairo application that uses the cairo_show_text_glyphs() API for
generating ActualText entries.
>
> JK
>
>> Alternatively, can we band
>> together to destroy PDFs everywhere? If we work in concert it may be
>> possible. =)
>>
>> Thanks again,
>>
>> Rob
>>
>> On Mon, Sep 14, 2015 at 9:22 PM, suzuki toshiya
>> <mpsuzuki at hiroshima-u.ac.jp <mailto:mpsuzuki at hiroshima-u.ac.jp>> wrote:
>>
>> Dear Rob,
>>
>> Poppler extracts the text from PDF via the serie of glyphs.
>> Therefore, the scripts that the Unicode encode the characters
>> as visible order, the first step of the text extraction is
>> possible.
>>
>> However, some Asian scripts, especially Brahmic-based scripts,
>> have very complicated layout rules, so, the encoding order
>> in Unicode text is phonetic and different from the visible
>> order (e.g. coded characters are in consonant-then-vowel order,
>> but the displayed characters are in vowel-then-consonant order).
>>
>> In such case, the character serie extracted via the glyph serie
>> is not good coded text.
>>
>> I'm not sure which script you assume for Indonesian (Latin?
>> Javanese? Balinese?), but, among Thai, Burmese, Khmer scripts,
>> only Thai script is coded in visible order. Other scripts
>> have vowel-then-consonant encoding issue, so, it is not easy
>> for Poppler to extract the text in correct "Unicode" text.
>> Therefore, the result you have (Thai is OK, others are not)
>> sounds reasonable.
>>
>> I'm unfamiliar with the bleeding-edge technology in the latedt
>> PDF about how to deal with such complex script (I guess PDF
>> developers are willing to support such), but, the PDFs made
>> by old PDF production softwares may have similar problem.
>>
>> I wish some Adobe experts mentions about the situation in the
>> latest PDF for complex scripts :-)
>>
>> Regards,
>> mpsuzuki
>>
>> Rob Hawkins wrote:
>> > Greetings all,
>> >
>> > Can pdftohtml produce output for Burmese, Khmer, Indonesian,
>> Thai and
>> > Vietnamese? I didn't see a language pack for any except Thai,
>> and that one
>> > doesn't produce properly formatted characters for my source
>> files. They're
>> > missing the vowel marks. The other languages fail completely on
>> my setup.
>> > I've tried on OS X and Ubuntu 12.
>> >
>> > My source files are here:
>> > https://github.com/robhawkins/drive-taiwan/tree/master/input/pdf
>> >
>> > Chinese seems to work fine.
>> >
>> > I found out that PDF.js will produce good output, though I
>> already have
>> > code based on pdftohtml output and would rather not switch if not
>> > necessary. I wonder if there is something wrong with my setup.
>> >
>> > Thanks for any help even if it's just a "nope, that's not
>> possible" kind of
>> > reply =)
>> >
>> > Rob
>> >
>> >
>> >
>> >
>>
>> ------------------------------------------------------------------------
>> >
>> > _______________________________________________
>> > poppler mailing list
>> > poppler at lists.freedesktop.org
>> <mailto:poppler at lists.freedesktop.org>
>> > http://lists.freedesktop.org/mailman/listinfo/poppler
>>
>>
>>
>>
>> _______________________________________________
>> poppler mailing list
>> poppler at lists.freedesktop.org
>> http://lists.freedesktop.org/mailman/listinfo/poppler
>>
>
> _______________________________________________
> poppler mailing list
> poppler at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/poppler
More information about the poppler
mailing list