[poppler] How poppler deal with multiple charsets?

suzuki toshiya mpsuzuki at hiroshima-u.ac.jp
Tue Nov 1 02:37:37 PDT 2011


Hi,

Please check GfxCIDFont::getNextChar in GfxFont.cc, for non 8bit string,
you may find how poppler translates a bytestream to Unicode string.
I have to note that the text in PDF is related with a font in PDF,
so encoding info is determined by the font.

Also please check poppler-data package for the mapping table resource.

Regards,
mpsuzuki

杨辉强 wrote:
> Hi, all:
>      I am a newbie to poppler. Now I want to extract text in pdf file 
> which contain Chinese GBK or other charsets.
>      Whether the poppler can deal with this situation and how it do it? 
> Now I am hacking the source code.
> So I want to know which part of the source codes are related to dealing 
> with multiple charsets.
> 
> 
> 
> Thank you very much.
> _______________________________________________
> poppler mailing list
> poppler at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/poppler



More information about the poppler mailing list