[poppler] Working with Asian languages

Mon Sep 14 08:40:25 PDT 2015

Thank you all for these great replies.  I find the stuff about the unicode
encoding order really interesting.  And I too wish we could find more
information about the as-yet unmapped Asian scripts.

I was mistaken about the output of PDF.js.  I thought I had viewed the HTML
source and seen good data, how exciting!  Yet now I that I double check, I
see it is just the viewer that is correct, and the source text is garbled
just like pdftotext etc.

I'm bummed there is no magic solution here as I thought I had found, but
glad to see people are still interested in this.  If I find out how to
implement these languages, I will try.  Alternatively, can we band together
to destroy PDFs everywhere?  If we work in concert it may be possible. =)

Thanks again,

Rob

On Mon, Sep 14, 2015 at 9:22 PM, suzuki toshiya <mpsuzuki at hiroshima-u.ac.jp>
wrote:

> Dear Rob,
>
> Poppler extracts the text from PDF via the serie of glyphs.
> Therefore, the scripts that the Unicode encode the characters
> as visible order, the first step of the text extraction is
> possible.
>
> However, some Asian scripts, especially Brahmic-based scripts,
> have very complicated layout rules, so, the encoding order
> in Unicode text is phonetic and different from the visible
> order (e.g. coded characters are in consonant-then-vowel order,
> but the displayed characters are in vowel-then-consonant order).
>
> In such case, the character serie extracted via the glyph serie
> is not good coded text.
>
> I'm not sure which script you assume for Indonesian (Latin?
> Javanese? Balinese?), but, among Thai, Burmese, Khmer scripts,
> only Thai script is coded in visible order. Other scripts
> have vowel-then-consonant encoding issue, so, it is not easy
> for Poppler to extract the text in correct "Unicode" text.
> Therefore, the result you have (Thai is OK, others are not)
> sounds reasonable.
>
> I'm unfamiliar with the bleeding-edge technology in the latedt
> PDF about how to deal with such complex script (I guess PDF
> developers are willing to support such), but, the PDFs made
> by old PDF production softwares may have similar problem.
>
> I wish some Adobe experts mentions about the situation in the
> latest PDF for complex scripts :-)
>
> Regards,
> mpsuzuki
>
> Rob Hawkins wrote:
> > Greetings all,
> >
> > Can pdftohtml produce output for Burmese, Khmer, Indonesian, Thai and
> > Vietnamese?  I didn't see a language pack for any except Thai, and that
> one
> > doesn't produce properly formatted characters for my source files.
> They're
> > missing the vowel marks.  The other languages fail completely on my
> setup.
> > I've tried on OS X and Ubuntu 12.
> >
> > My source files are here:
> > https://github.com/robhawkins/drive-taiwan/tree/master/input/pdf
> >
> > Chinese seems to work fine.
> >
> > I found out that PDF.js will produce good output, though I already have
> > code based on pdftohtml output and would rather not switch if not
> > necessary.  I wonder if there is something wrong with my setup.
> >
> > Thanks for any help even if it's just a "nope, that's not possible" kind
> of
> > reply =)
> >
> > Rob
> >
> >
> >
> > ------------------------------------------------------------------------
> >
> > _______________________________________________
> > poppler mailing list
> > poppler at lists.freedesktop.org
> > http://lists.freedesktop.org/mailman/listinfo/poppler
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/poppler/attachments/20150914/80419d58/attachment-0001.html>