[poppler] Compatibility between poppler's pdfunite and JHOVE.
Russell McOrmond
Russell.McOrmond at canadiana.ca
Fri Apr 7 17:09:05 UTC 2017
The organisation I work for currently uses poppler's pdfunite
utility as part of our preservation system. We scan documents, run
through ABBYY Recognition Server to generate a PDF for each page, and
have been using pdfunite to join those files into multi-page PDFs
which are available for download.
We recently started to investigate adopting JHOVE
http://jhove.openpreservation.org/ which identifies and validates
files including PDF files. JHOVE is indicating there are problems
with the files we create with pdfunite as well as the files we
previously created with `pdftk cat`.
The thread in the JHOVE forum can be seen at
http://lists.openpreservation.org/pipermail/jhove/2017-April/thread.html#3
A pdfunite generated file is available via
http://pub.canadiana.ca/view/omcn.MississaugaNews_2 (download link
beside the zoom buttons).
With `pdftk cat` the problem happens after a certain size (around 795
pages from the sample pages I used).
Using only two of those single-page PDF files as an example, I get the
following with the latest release of pdfunite (compiled on Ubuntu
14.04):
cihm at russell-desktop:/opt/wip/Temp/rwm$ /opt/jhove/jhove -m pdf-hul -h
xml MississaugaNews_2/0001.pdf | grep '<status'
<status>Well-Formed and valid</status>
cihm at russell-desktop:/opt/wip/Temp/rwm$ /opt/jhove/jhove -m pdf-hul -h
xml MississaugaNews_2/0002.pdf | grep '<status'
<status>Well-Formed and valid</status>
cihm at russell-desktop:/opt/wip/Temp/rwm$ /usr/local/bin/pdfunite
MississaugaNews_2/0001.pdf MississaugaNews_2/0002.pdf pdfunite.pdf
cihm at russell-desktop:/opt/wip/Temp/rwm$ /opt/jhove/jhove -m pdf-hul -h
xml pdfunite.pdf
java.lang.ArrayIndexOutOfBoundsException: 60
at edu.harvard.hul.ois.jhove.module.PdfModule.getObject(PdfModule.java:2398)
at edu.harvard.hul.ois.jhove.module.PdfModule.resolveIndirectObject(PdfModule.java:2377)
at edu.harvard.hul.ois.jhove.module.PdfModule.readDocCatalogDict(PdfModule.java:1344)
at edu.harvard.hul.ois.jhove.module.PdfModule.parse(PdfModule.java:521)
at edu.harvard.hul.ois.jhove.JhoveBase.processFile(JhoveBase.java:803)
at edu.harvard.hul.ois.jhove.JhoveBase.process(JhoveBase.java:588)
at edu.harvard.hul.ois.jhove.JhoveBase.dispatch(JhoveBase.java:455)
at Jhove.main(Jhove.java:292)
<?xml version="1.0" encoding="UTF-8"?>
<jhove xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="http://hul.harvard.edu/ois/xml/ns/jhove"
xsi:schemaLocation="http://hul.harvard.edu/ois/xml/ns/jhove
http://hul.harvard.edu/ois/xml/xsd/jhove/1.6/jhove.xsd" name="Jhove"
release="1.16.5" date="2017-03-20">
<date>2017-04-07T12:22:35-04:00</date>
<repInfo uri="pdfunite.pdf">
<reportingModule release="1.8" date="2017-03-14">PDF-hul</reportingModule>
<lastModified>2017-04-07T12:22:16-04:00</lastModified>
<size>2888705</size>
<format>PDF</format>
<status>Not well-formed</status>
<sigMatch>
<module>PDF-hul</module>
</sigMatch>
<messages>
<message offset="2888253" severity="error">46</message>
<message offset="0" severity="error">No document catalog dictionary</message>
</messages>
<mimeType>application/pdf</mimeType>
</repInfo>
</jhove>
cihm at russell-desktop:/opt/wip/Temp/rwm$ /usr/local/bin/pdfunite -v
pdfunite version 0.53.0
Copyright 2005-2017 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC
cihm at russell-desktop:/opt/wip/Temp/rwm$
Same with the older version of Poppler that is distributed with Ubuntu 14.04:
cihm at russell-desktop:/opt/wip/Temp/rwm$ /usr/bin/pdfunite
MississaugaNews_2/0001.pdf MississaugaNews_2/0002.pdf pdfunite.pdf
cihm at russell-desktop:/opt/wip/Temp/rwm$ /opt/jhove/jhove -m pdf-hul -h
xml pdfunite.pdf
java.lang.ArrayIndexOutOfBoundsException: 60
at edu.harvard.hul.ois.jhove.module.PdfModule.getObject(PdfModule.java:2398)
at edu.harvard.hul.ois.jhove.module.PdfModule.resolveIndirectObject(PdfModule.java:2377)
at edu.harvard.hul.ois.jhove.module.PdfModule.readDocCatalogDict(PdfModule.java:1344)
at edu.harvard.hul.ois.jhove.module.PdfModule.parse(PdfModule.java:521)
at edu.harvard.hul.ois.jhove.JhoveBase.processFile(JhoveBase.java:803)
at edu.harvard.hul.ois.jhove.JhoveBase.process(JhoveBase.java:588)
at edu.harvard.hul.ois.jhove.JhoveBase.dispatch(JhoveBase.java:455)
at Jhove.main(Jhove.java:292)
<?xml version="1.0" encoding="UTF-8"?>
<jhove xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="http://hul.harvard.edu/ois/xml/ns/jhove"
xsi:schemaLocation="http://hul.harvard.edu/ois/xml/ns/jhove
http://hul.harvard.edu/ois/xml/xsd/jhove/1.6/jhove.xsd" name="Jhove"
release="1.16.5" date="2017-03-20">
<date>2017-04-07T12:25:50-04:00</date>
<repInfo uri="pdfunite.pdf">
<reportingModule release="1.8" date="2017-03-14">PDF-hul</reportingModule>
<lastModified>2017-04-07T12:25:42-04:00</lastModified>
<size>2885504</size>
<format>PDF</format>
<status>Not well-formed</status>
<sigMatch>
<module>PDF-hul</module>
</sigMatch>
<messages>
<message offset="2885066" severity="error">46</message>
<message offset="0" severity="error">No document catalog dictionary</message>
</messages>
<mimeType>application/pdf</mimeType>
</repInfo>
</jhove>
cihm at russell-desktop:/opt/wip/Temp/rwm$ /usr/bin/pdfunite -v
pdfunite version 0.24.5
Copyright 2005-2013 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC
cihm at russell-desktop:/opt/wip/Temp/rwm$
Any suggestions? I can make the source PDF files available if that
would help.
cihm at russell-desktop:/opt/wip/Temp/rwm$ pdfinfo MississaugaNews_2/0001.pdf
Producer: ABBYY Recognition Server
CreationDate: Sun Mar 12 09:44:20 2017 EDT
ModDate: Sun Mar 12 09:44:20 2017 EDT
Tagged: yes
UserProperties: no
Suspects: no
Form: none
JavaScript: no
Pages: 1
Encrypted: no
Page size: 733.45 x 1486.1 pts
Page rot: 0
File size: 2388234 bytes
Optimized: no
PDF version: 1.4
cihm at russell-desktop:/opt/wip/Temp/rwm$ pdfinfo MississaugaNews_2/0002.pdf
Producer: ABBYY Recognition Server
CreationDate: Sun Mar 12 09:43:59 2017 EDT
ModDate: Sun Mar 12 09:43:59 2017 EDT
Tagged: yes
UserProperties: no
Suspects: no
Form: none
JavaScript: no
Pages: 1
Encrypted: no
Page size: 783.35 x 1428.5 pts
Page rot: 0
File size: 511205 bytes
Optimized: no
PDF version: 1.4
cihm at russell-desktop:/opt/wip/Temp/rwm$
--
System Administration and software developer,
Canadiana.org http://www.canadiana.ca
More information about the poppler
mailing list