[Poppler-bugs] [Bug 50739] New: Unable to convert PDF to xml using pdftohtml (empty pages)

bugzilla-daemon at freedesktop.org bugzilla-daemon at freedesktop.org
Tue Jun 5 09:21:14 PDT 2012


https://bugs.freedesktop.org/show_bug.cgi?id=50739

             Bug #: 50739
           Summary: Unable to convert PDF to xml using pdftohtml (empty
                    pages)
    Classification: Unclassified
           Product: poppler
           Version: unspecified
          Platform: Other
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: medium
         Component: pdftohtml
        AssignedTo: poppler-bugs at lists.freedesktop.org
        ReportedBy: pere at hungry.com


When I convert
http://nrk.no/contentfile/file/1.8116520!offentligjournal02052012.pdf
to XML using

  pdftohtml -xml -noframes 1.8116520\!offentligjournal02052012.pdf

I get the following content-less XML file.  I find this rather strange,
as the PDF is searchable using xpdf, okular and evince.  Any idea where
the text went?  Anything I can do to get access to the text as XML?

This is the output I get:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE pdf2xml SYSTEM "pdf2xml.dtd">

<pdf2xml>
<page number="1" position="absolute" top="0" left="0" height="792" width="612">
        <fontspec id="0" size="18" family="Helvetica" color="#000000"/>
        <fontspec id="1" size="5" family="Helvetica" color="#000000"/>
        <fontspec id="2" size="5" family="Helvetica" color="#000000"/>
        <fontspec id="3" size="7" family="Helvetica" color="#000000"/>
</page>
<page number="2" position="absolute" top="0" left="0" height="792" width="612">
        <fontspec id="4" size="6" family="Helvetica" color="#000000"/>
</page>
<page number="3" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="4" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="5" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="6" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="7" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="8" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="9" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="10" position="absolute" top="0" left="0" height="792"
width="612">
</page>
<page number="11" position="absolute" top="0" left="0" height="792"
width="612">
</page>
<page number="12" position="absolute" top="0" left="0" height="792"
width="612">
</page>
<page number="13" position="absolute" top="0" left="0" height="792"
width="612">
</page>
<page number="14" position="absolute" top="0" left="0" height="792"
width="612">
</page>
<page number="15" position="absolute" top="0" left="0" height="792"
width="612">
</page>
<page number="16" position="absolute" top="0" left="0" height="792"
width="612">
</page>
<page number="17" position="absolute" top="0" left="0" height="792"
width="612">
</page>
<page number="18" position="absolute" top="0" left="0" height="792"
width="612">
</page>
<page number="19" position="absolute" top="0" left="0" height="792"
width="612">
</page>
<page number="20" position="absolute" top="0" left="0" height="792"
width="612">
</page>
</pdf2xml>

This problem is also reported to Debian as http://bugs.debian.org/676238

-- 
Configure bugmail: https://bugs.freedesktop.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


More information about the Poppler-bugs mailing list