[poppler] Working with Asian languages
Jonathan Kew
jfkthame at gmail.com
Mon Sep 14 08:53:10 PDT 2015
On 14/9/15 16:40, Rob Hawkins wrote:
> Thank you all for these great replies. I find the stuff about the
> unicode encoding order really interesting. And I too wish we could find
> more information about the as-yet unmapped Asian scripts.
>
> I was mistaken about the output of PDF.js. I thought I had viewed the
> HTML source and seen good data, how exciting! Yet now I that I double
> check, I see it is just the viewer that is correct, and the source text
> is garbled just like pdftotext etc.
>
> I'm bummed there is no magic solution here as I thought I had found, but
> glad to see people are still interested in this. If I find out how to
> implement these languages, I will try.
I think what you're looking for is the ActualText feature in PDF. If
this is present, a viewer or text-extraction tool can use it to provide
the correct text, instead of trying to reconstruct the text from the
stream of glyphs in the PDF -- which, while it often works OK for
European languages and similar "simple" writing systems, is pretty much
doomed to failure for complex South/Southeast Asian scripts, etc.
But this is dependent on the PDF-generating tool or workflow including
the correct ActualText attributes in the first place. In my (very
limited) experience, this is pretty rare.
JK
> Alternatively, can we band
> together to destroy PDFs everywhere? If we work in concert it may be
> possible. =)
>
> Thanks again,
>
> Rob
>
> On Mon, Sep 14, 2015 at 9:22 PM, suzuki toshiya
> <mpsuzuki at hiroshima-u.ac.jp <mailto:mpsuzuki at hiroshima-u.ac.jp>> wrote:
>
> Dear Rob,
>
> Poppler extracts the text from PDF via the serie of glyphs.
> Therefore, the scripts that the Unicode encode the characters
> as visible order, the first step of the text extraction is
> possible.
>
> However, some Asian scripts, especially Brahmic-based scripts,
> have very complicated layout rules, so, the encoding order
> in Unicode text is phonetic and different from the visible
> order (e.g. coded characters are in consonant-then-vowel order,
> but the displayed characters are in vowel-then-consonant order).
>
> In such case, the character serie extracted via the glyph serie
> is not good coded text.
>
> I'm not sure which script you assume for Indonesian (Latin?
> Javanese? Balinese?), but, among Thai, Burmese, Khmer scripts,
> only Thai script is coded in visible order. Other scripts
> have vowel-then-consonant encoding issue, so, it is not easy
> for Poppler to extract the text in correct "Unicode" text.
> Therefore, the result you have (Thai is OK, others are not)
> sounds reasonable.
>
> I'm unfamiliar with the bleeding-edge technology in the latedt
> PDF about how to deal with such complex script (I guess PDF
> developers are willing to support such), but, the PDFs made
> by old PDF production softwares may have similar problem.
>
> I wish some Adobe experts mentions about the situation in the
> latest PDF for complex scripts :-)
>
> Regards,
> mpsuzuki
>
> Rob Hawkins wrote:
> > Greetings all,
> >
> > Can pdftohtml produce output for Burmese, Khmer, Indonesian, Thai and
> > Vietnamese? I didn't see a language pack for any except Thai,
> and that one
> > doesn't produce properly formatted characters for my source
> files. They're
> > missing the vowel marks. The other languages fail completely on
> my setup.
> > I've tried on OS X and Ubuntu 12.
> >
> > My source files are here:
> > https://github.com/robhawkins/drive-taiwan/tree/master/input/pdf
> >
> > Chinese seems to work fine.
> >
> > I found out that PDF.js will produce good output, though I
> already have
> > code based on pdftohtml output and would rather not switch if not
> > necessary. I wonder if there is something wrong with my setup.
> >
> > Thanks for any help even if it's just a "nope, that's not
> possible" kind of
> > reply =)
> >
> > Rob
> >
> >
> >
> >
> ------------------------------------------------------------------------
> >
> > _______________________________________________
> > poppler mailing list
> > poppler at lists.freedesktop.org <mailto:poppler at lists.freedesktop.org>
> > http://lists.freedesktop.org/mailman/listinfo/poppler
>
>
>
>
> _______________________________________________
> poppler mailing list
> poppler at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/poppler
>
More information about the poppler
mailing list