The main problem with your files is that the producer is doing a S**TTY job of font encoding...I don't know what type of font(s) you are starting with, but the final PDF is produced with dynamically produced, quite poor, Type 1 fonts. In addition, there are _NO_ ToUnicode tables (ISO 332000-1, 9.6.1) which is what prevents the proper extraction of the content by a tool that follows the requirements of section 9.10 of ISO 32000-1.<div>
<br></div><div>Adobe Acrobat/Reader is able to properly extract the text from the ActualText version, since the "ActualText" is provided in the tagged content as defined in section 14.9.4 of ISO 32000-1. However, if you read the preceeding section (14.9.3) which describes the Alt tag, you will see that it is not what you think it is - which is why that doesn't work.</div>
<div><br></div><div>It is unfortunate that only Adobe's tools correctly support Tagged PDF and use those features to provide richer semantic extraction of PDF content. I would love to see someone add such support to Poppler...</div>
<div><br></div><div>Leonard Rosenthol</div><div>PDF Standards Architect</div><div>Adobe Systems</div><div><div><br><br><div class="gmail_quote">On Sun, Feb 8, 2009 at 3:16 PM, Ross Moore <span dir="ltr"><<a href="mailto:ross@ics.mq.edu.au">ross@ics.mq.edu.au</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">Hi Adamson, and Albert.<div class="Ih2E3d"><br>
<br>
On 09/02/2009, at 3:14 AM, Adamson H wrote:<br>
<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Yes, I have poppler-data 0.2.0-2. Please take a look at these two<br>
screenshots <a href="http://launchpadlibrarian.net/21701829/Screenshot-" target="_blank">http://launchpadlibrarian.net/21701829/Screenshot-</a>evince.png<br>
and <a href="http://launchpadlibrarian.net/21701834/Screenshot-foxit.png" target="_blank">http://launchpadlibrarian.net/21701834/Screenshot-foxit.png</a> for<br>
</blockquote>
<br></div>
I see broken characters too, using Apple's Preview to read dell440.pdf ,<br>
but not with Adobe Reader v9.x on MacOS X.<br>
<br>
(see attached images, of a portion of your PDF).<br>
<br>
<br> <br><br>
<br>
This suggests that it is a problem within the fonts themselves,<br>
allowing different possible interpretations by font-rendering<br>
software.<br>
<br>
Alternatively, Adobe is using some information that other PDF<br>
renderers, or text-extraction tools, are not using.<br>
To back up this statement, suppose you select the last of<br>
the blue dot-points from dell440.pdf and Copy/Paste to UTF8 text.<br>
<br>
Adobe gives:<br>
<br>
以电子邮件发送您的订单<br>
<br>
<br>
Now use pdftotext -raw dell440.pdf<br>
and find the appropriate portion; you'll get:<br>
<br>
以以以以电电电电子子子子邮邮邮邮件件件件发发发发送送送送您您您您的的的的订单订单订单订单<br>
<br>
in which each ideograph is repeated 4 times over.<br>
This repetition seems to be a common way to get bold-face,<br>
rather than using a separate font.<br>
Copy/Paste from Apple's preview gives the same result as pdftotext .<br>
<br>
Perhaps it is this multiple overstriking that causes the bad display?<br>
If so, how does Adobe know how to get it correct?<br>
What extra information is Adobe using?<br>
<br>
<br>
I have another example of this kind of thing:<br>
<br>
<a href="http://www.maths.mq.edu.au/~ross/poppler/Big5/" target="_blank">http://www.maths.mq.edu.au/~ross/poppler/Big5/</a><br>
<br>
Big5-actual.pdf 170k --- has /ActualText tagging<br>
Big5-actual.txt 97 bytes<br>
Big5-alt.pdf 169k --- has /Alt tagging<br>
Big5-alt.txt 434 bytes<br>
Big5-notags.pdf 157k --- no special tagging<br>
Big5-notags.txt 432 bytes<br>
<br>
The corresponding .txt files were obtained using pdftotext -raw<br>
with Poppler version as follows:<br>
<br>
[GlenMorangie:~/PDFTeX/test-PDFs] rossmoor% pdftotext --help<br>
pdftotext version 0.10.3<br>
Copyright 2005-2009 The Poppler Developers - http://<a href="http://poppler.freedesktop.org" target="_blank">poppler.freedesktop.org</a><br>
Copyright 1996-2004 Glyph & Cog, LLC<br>
<br>
<br>
It is clear just from the file-size of Big5-actual.txt that<br>
Poppler isn't extracting the /ActualText in this case.<br>
Also, if you look at the contents of Big5-notags.txt you'll<br>
see the same kind of "multiple-striking" to get the bold effect.<br>
<br>
With Big5-alt.pdf (and Big5-actual.pdf) this triple-striking<br>
is meant to be mapped to a single Unicode character.<br>
But Poppler has no support for /Alt tagging, which is why<br>
Big5-alt.txt is practically the same size as Big5-notags.txt .<br>
<br>
<br>
With these three PDFs, Adobe Reader cannot extract<br>
the chinese characters from Big5-notags.pdf<br>
whereas it can do so from Big5-actual.pdf and Big5-alt.pdf<br>
due to the extra tagging.<br>
<br>
Apple's Preview and Poppler, on the other hand, can identify<br>
the characters (presumably from information in the fonts or<br>
their encoding arrays --- a CMap is not applicable).<br>
But both extract three copies when the multiple striking occurs,<br>
so are not dealing with the /Alt or /ActualText tags.<br>
Furthermore, Poppler gives nothing for the ideographs<br>
marked with /ActualText tagging.<br>
<br>
<br>
I've been looking at this kind of thing for some time now,<br>
with tagging and Chinese/Korean/Japanese documents (produced<br>
using pdfTeX) and the result of Text-extraction using different<br>
tools.<br>
It seems that no-one gets it right all the time, which makes<br>
it really hard to prepare a bug-report --- which software is<br>
the one which is buggy, when all appear to neglect available<br>
information, or process it incorrectly in different ways?<br>
<br>
For certain I can say that Poppler has (at least) two bugs:<br>
<br>
1. /ActualText doesn't work properly for the content in these<br>
Big5-*.pdf documents;<br>
<br>
2. /Alt isn't even recognised by Poppler;<br>
(there is no coding to support it in either<br>
TextOutputDev.cc or Gfx.cc )<br>
<br>
Could the PDF property streams in my PDFs be malformed in some way?<br>
Yes, I've looked at that, and have tried different ways to place<br>
the tagging in them --- these made no difference whatsoever to<br>
the result of text-extraction with the different software tools<br>
that I've tried.<br>
<br>
<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
comparison. I don't have any problems viewing other Chinese PDF files on<br>
my system. I use pdffonts to check fonts used by the file, and I have<br>
simsun.ttf installed.<br>
<br>
Adamson H<br>
</blockquote>
<br>
<br>
Hope this helps someone identify the problems,<br>
and where/how to fix them.<br>
<br>
Cheers,<br>
<br>
Ross<br>
<br>
<br>
------------------------------------------------------------------------<br>
Ross Moore <a href="mailto:ross@maths.mq.edu.au" target="_blank">ross@maths.mq.edu.au</a><br>
Mathematics Department office: E7A-419<br>
Macquarie University tel: +61 (0)2 9850 8955<br>
Sydney, Australia 2109 fax: +61 (0)2 9850 8114<br>
------------------------------------------------------------------------<br>
<br>
<br>
<br>
<br>_______________________________________________<br>
poppler mailing list<br>
<a href="mailto:poppler@lists.freedesktop.org">poppler@lists.freedesktop.org</a><br>
<a href="http://lists.freedesktop.org/mailman/listinfo/poppler" target="_blank">http://lists.freedesktop.org/mailman/listinfo/poppler</a><br>
<br></blockquote></div><br></div></div>