[poppler] Working with Asian languages

Jason Crain jason at aquaticape.us
Mon Sep 14 06:12:45 PDT 2015


On 2015-09-13 20:06, Rob Hawkins wrote:
> Greetings all,
> 
> Can pdftohtml produce output for Burmese, Khmer, Indonesian, Thai and
> Vietnamese?  I didn't see a language pack for any except Thai, and
> that one doesn't produce properly formatted characters for my source
> files.  They're missing the vowel marks.  The other languages fail
> completely on my setup.  I've tried on OS X and Ubuntu 12.
> 
> My source files are here:
> https://github.com/robhawkins/drive-taiwan/tree/master/input/pdf
> 
> Chinese seems to work fine.
> 
> I found out that PDF.js will produce good output, though I already
> have code based on pdftohtml output and would rather not switch if not
> necessary.  I wonder if there is something wrong with my setup.
> 
> Thanks for any help even if it's just a "nope, that's not possible"
> kind of reply =)
> 
> Rob

pdftohtml can work with those languages but it depends the ability to
extract the plain text from the document.  From the couple of PDFs I've
looked at, they have problems with text extraction.  Possibly poppler
could do a better job, but as several application I tried have problems
extracting text from those documents, it's probably just a problem with
those documents.

I assume that PDF.js just works in a different way and doesn't require
the extracted text to be correct.


More information about the poppler mailing list