[Poppler-bugs] [Bug 28282] New: pdftohtml is unable to extract the text in some PDF files

Thu May 27 07:49:03 PDT 2010

https://bugs.freedesktop.org/show_bug.cgi?id=28282

           Summary: pdftohtml is unable to extract the text in some PDF
                    files
           Product: poppler
           Version: unspecified
          Platform: x86 (IA32)
        OS/Version: Linux (All)
            Status: NEW
          Severity: normal
          Priority: medium
         Component: general
        AssignedTo: poppler-bugs at lists.freedesktop.org
        ReportedBy: jaime at iteisa.com

(As discussed in
http://lists.freedesktop.org/archives/poppler/2010-May/005791.html)

It seems poppler is being unable to extract text in some PDF files (I'm not
attaching the file to this bug report due to its lenght):

http://iteisa.com/tmp/poppler-sample.pdf (11 Mb)

pdftohtml from poppler 0.12.4 and 0.12.2 is not able to extract the
text, and evince shows the document correctly but it's unable to select
it's text. However acroread shows and selects the text correctly (so
it's normal, editable text and not an image).

Everything seems ok with the file:

$ pdfinfo poppler-sample.pdf
> Title:          untitled
> Creator:        Adobe InDesign CS4 (6.0.4)
> Producer:       Acrobat Distiller 9.0.0 (Windows)
> CreationDate:   Wed May  5 09:35:12 2010
> ModDate:        Wed May  5 09:35:12 2010
> Tagged:         no
> Pages:          208
> Encrypted:      no
> Page size:      595.276 x 841.89 pts (A4)
> File size:      10536602 bytes
> Optimized:      no
> PDF version:    1.4

-- 
Configure bugmail: https://bugs.freedesktop.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.