<div dir="ltr">Thank you all for these great replies. I find the stuff about the unicode encoding order really interesting. And I too wish we could find more information about the as-yet unmapped Asian scripts.<div><br></div><div>I was mistaken about the output of PDF.js. I thought I had viewed the HTML source and seen good data, how exciting! Yet now I that I double check, I see it is just the viewer that is correct, and the source text is garbled just like pdftotext etc.</div><div><br></div><div>I'm bummed there is no magic solution here as I thought I had found, but glad to see people are still interested in this. If I find out how to implement these languages, I will try. Alternatively, can we band together to destroy PDFs everywhere? If we work in concert it may be possible. =)</div><div><br></div><div>Thanks again,</div><div><br></div><div class="gmail_extra"><div><div class="gmail_signature"><div dir="ltr">Rob<br></div></div></div>
<br><div class="gmail_quote">On Mon, Sep 14, 2015 at 9:22 PM, suzuki toshiya <span dir="ltr"><<a href="mailto:mpsuzuki@hiroshima-u.ac.jp" target="_blank">mpsuzuki@hiroshima-u.ac.jp</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Dear Rob,<br>
<br>
Poppler extracts the text from PDF via the serie of glyphs.<br>
Therefore, the scripts that the Unicode encode the characters<br>
as visible order, the first step of the text extraction is<br>
possible.<br>
<br>
However, some Asian scripts, especially Brahmic-based scripts,<br>
have very complicated layout rules, so, the encoding order<br>
in Unicode text is phonetic and different from the visible<br>
order (e.g. coded characters are in consonant-then-vowel order,<br>
but the displayed characters are in vowel-then-consonant order).<br>
<br>
In such case, the character serie extracted via the glyph serie<br>
is not good coded text.<br>
<br>
I'm not sure which script you assume for Indonesian (Latin?<br>
Javanese? Balinese?), but, among Thai, Burmese, Khmer scripts,<br>
only Thai script is coded in visible order. Other scripts<br>
have vowel-then-consonant encoding issue, so, it is not easy<br>
for Poppler to extract the text in correct "Unicode" text.<br>
Therefore, the result you have (Thai is OK, others are not)<br>
sounds reasonable.<br>
<br>
I'm unfamiliar with the bleeding-edge technology in the latedt<br>
PDF about how to deal with such complex script (I guess PDF<br>
developers are willing to support such), but, the PDFs made<br>
by old PDF production softwares may have similar problem.<br>
<br>
I wish some Adobe experts mentions about the situation in the<br>
latest PDF for complex scripts :-)<br>
<br>
Regards,<br>
mpsuzuki<br>
<div><div class="h5"><br>
Rob Hawkins wrote:<br>
> Greetings all,<br>
><br>
> Can pdftohtml produce output for Burmese, Khmer, Indonesian, Thai and<br>
> Vietnamese? I didn't see a language pack for any except Thai, and that one<br>
> doesn't produce properly formatted characters for my source files. They're<br>
> missing the vowel marks. The other languages fail completely on my setup.<br>
> I've tried on OS X and Ubuntu 12.<br>
><br>
> My source files are here:<br>
> <a href="https://github.com/robhawkins/drive-taiwan/tree/master/input/pdf" rel="noreferrer" target="_blank">https://github.com/robhawkins/drive-taiwan/tree/master/input/pdf</a><br>
><br>
> Chinese seems to work fine.<br>
><br>
> I found out that PDF.js will produce good output, though I already have<br>
> code based on pdftohtml output and would rather not switch if not<br>
> necessary. I wonder if there is something wrong with my setup.<br>
><br>
> Thanks for any help even if it's just a "nope, that's not possible" kind of<br>
> reply =)<br>
><br>
> Rob<br>
><br>
><br>
><br>
</div></div>> ------------------------------------------------------------------------<br>
<div class="HOEnZb"><div class="h5">><br>
> _______________________________________________<br>
> poppler mailing list<br>
> <a href="mailto:poppler@lists.freedesktop.org">poppler@lists.freedesktop.org</a><br>
> <a href="http://lists.freedesktop.org/mailman/listinfo/poppler" rel="noreferrer" target="_blank">http://lists.freedesktop.org/mailman/listinfo/poppler</a><br>
<br>
</div></div></blockquote></div><br></div></div>