[Poppler-bugs] [Bug 50739] New: Unable to convert PDF to xml using pdftohtml (empty pages)
bugzilla-daemon at freedesktop.org
bugzilla-daemon at freedesktop.org
Tue Jun 5 09:21:14 PDT 2012
https://bugs.freedesktop.org/show_bug.cgi?id=50739
Bug #: 50739
Summary: Unable to convert PDF to xml using pdftohtml (empty
pages)
Classification: Unclassified
Product: poppler
Version: unspecified
Platform: Other
OS/Version: All
Status: NEW
Severity: normal
Priority: medium
Component: pdftohtml
AssignedTo: poppler-bugs at lists.freedesktop.org
ReportedBy: pere at hungry.com
When I convert
http://nrk.no/contentfile/file/1.8116520!offentligjournal02052012.pdf
to XML using
pdftohtml -xml -noframes 1.8116520\!offentligjournal02052012.pdf
I get the following content-less XML file. I find this rather strange,
as the PDF is searchable using xpdf, okular and evince. Any idea where
the text went? Anything I can do to get access to the text as XML?
This is the output I get:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE pdf2xml SYSTEM "pdf2xml.dtd">
<pdf2xml>
<page number="1" position="absolute" top="0" left="0" height="792" width="612">
<fontspec id="0" size="18" family="Helvetica" color="#000000"/>
<fontspec id="1" size="5" family="Helvetica" color="#000000"/>
<fontspec id="2" size="5" family="Helvetica" color="#000000"/>
<fontspec id="3" size="7" family="Helvetica" color="#000000"/>
</page>
<page number="2" position="absolute" top="0" left="0" height="792" width="612">
<fontspec id="4" size="6" family="Helvetica" color="#000000"/>
</page>
<page number="3" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="4" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="5" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="6" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="7" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="8" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="9" position="absolute" top="0" left="0" height="792" width="612">
</page>
<page number="10" position="absolute" top="0" left="0" height="792"
width="612">
</page>
<page number="11" position="absolute" top="0" left="0" height="792"
width="612">
</page>
<page number="12" position="absolute" top="0" left="0" height="792"
width="612">
</page>
<page number="13" position="absolute" top="0" left="0" height="792"
width="612">
</page>
<page number="14" position="absolute" top="0" left="0" height="792"
width="612">
</page>
<page number="15" position="absolute" top="0" left="0" height="792"
width="612">
</page>
<page number="16" position="absolute" top="0" left="0" height="792"
width="612">
</page>
<page number="17" position="absolute" top="0" left="0" height="792"
width="612">
</page>
<page number="18" position="absolute" top="0" left="0" height="792"
width="612">
</page>
<page number="19" position="absolute" top="0" left="0" height="792"
width="612">
</page>
<page number="20" position="absolute" top="0" left="0" height="792"
width="612">
</page>
</pdf2xml>
This problem is also reported to Debian as http://bugs.debian.org/676238
--
Configure bugmail: https://bugs.freedesktop.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
More information about the Poppler-bugs
mailing list