[Poppler-bugs] [Bug 20013] New: pdftotext doesn't support /Alt nor / ActualText with octal content

bugzilla-daemon at freedesktop.org bugzilla-daemon at freedesktop.org
Sun Feb 8 21:55:06 PST 2009


http://bugs.freedesktop.org/show_bug.cgi?id=20013

           Summary: pdftotext  doesn't support /Alt  nor /ActualText  with
                    octal content
           Product: poppler
           Version: unspecified
          Platform: All
               URL: http://www.maths.mq.edu.au/~ross/poppler/Big5/
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: medium
         Component: general
        AssignedTo: poppler-bugs at lists.freedesktop.org
        ReportedBy: ross at maths.mq.edu.au


Created an attachment (id=22702)
 --> (http://bugs.freedesktop.org/attachment.cgi?id=22702)
Similar to the PDF with /ActualText tagging from the 2nd location of the
description. The internal PDF coding is slightly different but the effect
should be the same.

Trying to extract text from PDFs constructed using pdfTeX and containing
/ActualText or /Alt tags does not give the desired results.

 1.  /Alt tagging is not supported at all.
 2.  /ActualText tagging is recognised but no content is extracted.

Here are some examples from 
     http://www.maths.mq.edu.au/~ross/poppler/Big5/

Big5-actual.pdf  (170kb) --- has /ActualText tagging
Big5-actual.txt   (97 bytes)
Big5-alt.pdf        (169kb) --- has /Alt tagging
Big5-alt.txt         (434 bytes)
Big5-notags.pdf  (157kb) --- no special tagging
Big5-notags.txt   (432 bytes)

The corresponding .txt files were obtained using pdftotext -raw
with  pdftotext/Poppler  version as follows:

[GlenMorangie:~/PDFTeX/test-PDFs] rossmoor% pdftotext --help
pdftotext version 0.10.3

Copyright 2005-2009 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2004 Glyph & Cog, LLC


It is clear just from the file-size of  Big5-actual.txt  that Poppler isn't
extracting the /ActualText in this case.
Also, if you look at the contents of  Big5-notags.txt  you'll see the same kind
of "multiple-striking" to get the bold effect.

With Big5-alt.pdf (and Big5-actual.pdf) this triple-striking
is meant to be mapped to a single Unicode character.
But Poppler has no support for /Alt tagging, which is why
Big5-alt.txt is practically the same size as Big5-notags.txt .


With these three PDFs, Adobe Reader cannot extract the chinese characters from 
Big5-notags.pdf
whereas it can do so from  Big5-actual.pdf  and  Big5-alt.pdf  due to the extra
tagging.

Apple's Preview and Poppler, on the other hand, can identify the characters
(presumably from information in the fonts or their encoding arrays --- a CMap
is not applicable). But both extract three copies when the multiple striking
occurs, so are not dealing with the /Alt or /ActualText tags.
Furthermore, Poppler gives nothing for the ideographs marked with /ActualText
tagging.


Speculation:  poppler may not be extracting the information in the tagging
strings since they contain octal character codes?  For example, the tagging
looks like this:

        /Span<</ActualText(\376\377\307\164)>> BDC
            ... Chinese/Korean ideograph ...
       EMC

whereas the coding in  TextOutputDev.cc  that handles this is:

         actualText = obj.getString();

and

      if (!actualText->hasUnicodeMarker()) {
        if (actualText->getLength() > 0) {
          //non-unicode string -- assume pdfDocEncoding and
          //try to convert to UTF16BE
          uniString = pdfDocEncodingToUTF16(actualText, &length);
        } else {
          length = 0;
        }
      } else {
        uniString = actualText->getCString();
        length = actualText->getLength();
      }

Shouldn't there be some use of  GooString  within this coding block, to
properly handle those octal character codes?

There are some more similar examples, involving Korean fonts, at:
    http://www.maths.mq.edu.au/~ross/poppler/KS/


-- 
Configure bugmail: http://bugs.freedesktop.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


More information about the Poppler-bugs mailing list