The main problem with your files is that the producer is doing a S**TTY job of font encoding...I don't know what type of font(s) you are starting with, but the final PDF is produced with dynamically produced, quite poor, Type 1 fonts.  In addition, there are _NO_ ToUnicode tables (ISO 332000-1, 9.6.1)  which is what prevents the proper extraction of the content by a tool that follows the requirements of section 9.10 of ISO 32000-1.<div> </div><div>Adobe Acrobat/Reader is able to properly extract the text from the ActualText version, since the "ActualText" is provided in the tagged content as defined in section 14.9.4 of ISO 32000-1.  However, if you read the preceeding section (14.9.3) which describes the Alt tag, you will see that it is not what you think it is - which is why that doesn't work.</div> <div> </div><div>It is unfortunate that only Adobe's tools correctly support Tagged PDF and use those features to provide richer semantic extraction of PDF content.  I would love to see someone add such support to Poppler...</div> <div> </div><div>Leonard Rosenthol</div><div>PDF Standards Architect</div><div>Adobe Systems</div><div><div> <div class="gmail_quote">On Sun, Feb 8, 2009 at 3:16 PM, Ross Moore <<a href="mailto:ross@ics.mq.edu.au">ross@ics.mq.edu.au</a>> wrote: <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">Hi Adamson, and Albert.<div class="Ih2E3d"> On 09/02/2009, at 3:14 AM, Adamson H wrote: <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> Yes, I have poppler-data 0.2.0-2. Please take a look at these two screenshots <a href="http://launchpadlibrarian.net/21701829/Screenshot-" target="_blank">http://launchpadlibrarian.net/21701829/Screenshot-</a>evince.png and <a href="http://launchpadlibrarian.net/21701834/Screenshot-foxit.png" target="_blank">http://launchpadlibrarian.net/21701834/Screenshot-foxit.png</a> for </blockquote> </div> I see broken characters too, using Apple's Preview to read  dell440.pdf , but not with Adobe Reader v9.x on MacOS X. (see attached images, of a portion of your PDF).               This suggests that it is a problem within the fonts themselves, allowing different possible interpretations by font-rendering software. Alternatively, Adobe is using some information that other PDF renderers, or text-extraction tools, are not using. To back up this statement, suppose you select the last of the blue dot-points from dell440.pdf and Copy/Paste to UTF8 text. Adobe gives:  以电子邮件发送您的订单 Now use      pdftotext -raw dell440.pdf and find the appropriate portion; you'll get: 以以以以电电电电子子子子邮邮邮邮件件件件发发发发送送送送您您您您的的的的订单订单订单订单 in which each ideograph is repeated 4 times over. This repetition seems to be a common way to get bold-face, rather than using a separate font. Copy/Paste from Apple's preview gives the same result as  pdftotext . Perhaps it is this multiple overstriking that causes the bad display? If so, how does Adobe know how to get it correct? What extra information is Adobe using? I have another example of this kind of thing:   <a href="http://www.maths.mq.edu.au/~ross/poppler/Big5/" target="_blank">http://www.maths.mq.edu.au/~ross/poppler/Big5/</a> Big5-actual.pdf   170k      --- has /ActualText tagging Big5-actual.txt   97 bytes Big5-alt.pdf      169k      --- has /Alt tagging Big5-alt.txt      434 bytes Big5-notags.pdf   157k      ---  no special tagging Big5-notags.txt   432 bytes The corresponding .txt files were obtained using  pdftotext -raw with Poppler version as follows: [GlenMorangie:~/PDFTeX/test-PDFs] rossmoor% pdftotext --help pdftotext version 0.10.3 Copyright 2005-2009 The Poppler Developers - http://<a href="http://poppler.freedesktop.org" target="_blank">poppler.freedesktop.org</a> Copyright 1996-2004 Glyph & Cog, LLC It is clear just from the file-size of  Big5-actual.txt  that Poppler isn't extracting the /ActualText  in this case. Also, if you look at the contents of  Big5-notags.txt  you'll see the same kind of "multiple-striking" to get the bold effect. With Big5-alt.pdf (and Big5-actual.pdf) this triple-striking is meant to be mapped to a single Unicode character. But Poppler has no support for /Alt tagging, which is why Big5-alt.txt  is practically the same size as  Big5-notags.txt . With these three PDFs, Adobe Reader cannot extract the chinese characters from   Big5-notags.pdf whereas it can do so from   Big5-actual.pdf  and  Big5-alt.pdf due to the extra tagging. Apple's Preview and Poppler, on the other hand, can identify the characters (presumably from information in the fonts or their encoding arrays --- a CMap is not applicable). But both extract three copies when the multiple striking occurs, so are not dealing with the /Alt or /ActualText tags. Furthermore, Poppler gives nothing for the ideographs marked with /ActualText tagging. I've been looking at this kind of thing for some time now, with tagging and Chinese/Korean/Japanese documents (produced using pdfTeX) and the result of Text-extraction using different tools.  It seems that no-one gets it right all the time, which makes it really hard to prepare a bug-report --- which software is the one which is buggy, when all appear to neglect available information, or process it incorrectly in different ways? For certain I can say that Poppler has (at least) two bugs:  1.  /ActualText  doesn't work properly for the content in these                   Big5-*.pdf  documents;  2.  /Alt  isn't even recognised by Poppler;       (there is no coding to support it in either          TextOutputDev.cc   or   Gfx.cc ) Could the PDF property streams in my PDFs be malformed in some way? Yes, I've looked at that, and have tried different ways to place the tagging in them --- these made no difference whatsoever to the result of text-extraction with the different software tools that I've tried. <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> comparison. I don't have any problems viewing other Chinese PDF files on my system. I use pdffonts to check fonts used by the file, and I have simsun.ttf installed. Adamson H </blockquote> Hope this helps someone identify the problems, and where/how to fix them. Cheers,        Ross ------------------------------------------------------------------------ Ross Moore                                       <a href="mailto:ross@maths.mq.edu.au" target="_blank">ross@maths.mq.edu.au</a> Mathematics Department                           office: E7A-419 Macquarie University                             tel: +61 (0)2 9850 8955 Sydney, Australia  2109                          fax: +61 (0)2 9850 8114 ------------------------------------------------------------------------ _______________________________________________ poppler mailing list <a href="mailto:poppler@lists.freedesktop.org">poppler@lists.freedesktop.org</a> <a href="http://lists.freedesktop.org/mailman/listinfo/poppler" target="_blank">http://lists.freedesktop.org/mailman/listinfo/poppler</a> </blockquote></div> </div></div>