[poppler] Compatibility between poppler's pdfunite and JHOVE.

Russell McOrmond Russell.McOrmond at canadiana.ca
Fri Apr 7 17:09:05 UTC 2017


  The organisation I work for currently uses poppler's pdfunite
utility as part of our preservation system.  We scan documents, run
through ABBYY Recognition Server to generate a PDF for each page, and
have been using pdfunite to join those files into multi-page PDFs
which are available for download.

  We recently started to investigate adopting JHOVE
http://jhove.openpreservation.org/ which identifies and validates
files including PDF files.   JHOVE is indicating there are problems
with the files we create with pdfunite as well as the files we
previously created with `pdftk cat`.

The thread in the JHOVE forum can be seen at
http://lists.openpreservation.org/pipermail/jhove/2017-April/thread.html#3

A pdfunite generated file is available via
http://pub.canadiana.ca/view/omcn.MississaugaNews_2  (download link
beside the zoom buttons).


With `pdftk cat` the problem happens after a certain size (around 795
pages from the sample pages I used).


Using only two of those single-page PDF files as an example, I get the
following with the latest release of pdfunite (compiled on Ubuntu
14.04):

cihm at russell-desktop:/opt/wip/Temp/rwm$ /opt/jhove/jhove -m pdf-hul -h
xml MississaugaNews_2/0001.pdf | grep '<status'
  <status>Well-Formed and valid</status>
cihm at russell-desktop:/opt/wip/Temp/rwm$ /opt/jhove/jhove -m pdf-hul -h
xml MississaugaNews_2/0002.pdf | grep '<status'
  <status>Well-Formed and valid</status>
cihm at russell-desktop:/opt/wip/Temp/rwm$ /usr/local/bin/pdfunite
MississaugaNews_2/0001.pdf MississaugaNews_2/0002.pdf pdfunite.pdf
cihm at russell-desktop:/opt/wip/Temp/rwm$ /opt/jhove/jhove -m pdf-hul -h
xml pdfunite.pdf
java.lang.ArrayIndexOutOfBoundsException: 60
at edu.harvard.hul.ois.jhove.module.PdfModule.getObject(PdfModule.java:2398)
at edu.harvard.hul.ois.jhove.module.PdfModule.resolveIndirectObject(PdfModule.java:2377)
at edu.harvard.hul.ois.jhove.module.PdfModule.readDocCatalogDict(PdfModule.java:1344)
at edu.harvard.hul.ois.jhove.module.PdfModule.parse(PdfModule.java:521)
at edu.harvard.hul.ois.jhove.JhoveBase.processFile(JhoveBase.java:803)
at edu.harvard.hul.ois.jhove.JhoveBase.process(JhoveBase.java:588)
at edu.harvard.hul.ois.jhove.JhoveBase.dispatch(JhoveBase.java:455)
at Jhove.main(Jhove.java:292)
<?xml version="1.0" encoding="UTF-8"?>
<jhove xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="http://hul.harvard.edu/ois/xml/ns/jhove"
xsi:schemaLocation="http://hul.harvard.edu/ois/xml/ns/jhove
http://hul.harvard.edu/ois/xml/xsd/jhove/1.6/jhove.xsd" name="Jhove"
release="1.16.5" date="2017-03-20">
 <date>2017-04-07T12:22:35-04:00</date>
 <repInfo uri="pdfunite.pdf">
  <reportingModule release="1.8" date="2017-03-14">PDF-hul</reportingModule>
  <lastModified>2017-04-07T12:22:16-04:00</lastModified>
  <size>2888705</size>
  <format>PDF</format>
  <status>Not well-formed</status>
  <sigMatch>
  <module>PDF-hul</module>
  </sigMatch>
  <messages>
   <message offset="2888253" severity="error">46</message>
   <message offset="0" severity="error">No document catalog dictionary</message>
  </messages>
  <mimeType>application/pdf</mimeType>
 </repInfo>
</jhove>
cihm at russell-desktop:/opt/wip/Temp/rwm$ /usr/local/bin/pdfunite -v
pdfunite version 0.53.0
Copyright 2005-2017 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC
cihm at russell-desktop:/opt/wip/Temp/rwm$


Same with the older version of Poppler that is distributed with Ubuntu 14.04:

cihm at russell-desktop:/opt/wip/Temp/rwm$ /usr/bin/pdfunite
MississaugaNews_2/0001.pdf MississaugaNews_2/0002.pdf pdfunite.pdf
cihm at russell-desktop:/opt/wip/Temp/rwm$ /opt/jhove/jhove -m pdf-hul -h
xml pdfunite.pdf
java.lang.ArrayIndexOutOfBoundsException: 60
at edu.harvard.hul.ois.jhove.module.PdfModule.getObject(PdfModule.java:2398)
at edu.harvard.hul.ois.jhove.module.PdfModule.resolveIndirectObject(PdfModule.java:2377)
at edu.harvard.hul.ois.jhove.module.PdfModule.readDocCatalogDict(PdfModule.java:1344)
at edu.harvard.hul.ois.jhove.module.PdfModule.parse(PdfModule.java:521)
at edu.harvard.hul.ois.jhove.JhoveBase.processFile(JhoveBase.java:803)
at edu.harvard.hul.ois.jhove.JhoveBase.process(JhoveBase.java:588)
at edu.harvard.hul.ois.jhove.JhoveBase.dispatch(JhoveBase.java:455)
at Jhove.main(Jhove.java:292)
<?xml version="1.0" encoding="UTF-8"?>
<jhove xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="http://hul.harvard.edu/ois/xml/ns/jhove"
xsi:schemaLocation="http://hul.harvard.edu/ois/xml/ns/jhove
http://hul.harvard.edu/ois/xml/xsd/jhove/1.6/jhove.xsd" name="Jhove"
release="1.16.5" date="2017-03-20">
 <date>2017-04-07T12:25:50-04:00</date>
 <repInfo uri="pdfunite.pdf">
  <reportingModule release="1.8" date="2017-03-14">PDF-hul</reportingModule>
  <lastModified>2017-04-07T12:25:42-04:00</lastModified>
  <size>2885504</size>
  <format>PDF</format>
  <status>Not well-formed</status>
  <sigMatch>
  <module>PDF-hul</module>
  </sigMatch>
  <messages>
   <message offset="2885066" severity="error">46</message>
   <message offset="0" severity="error">No document catalog dictionary</message>
  </messages>
  <mimeType>application/pdf</mimeType>
 </repInfo>
</jhove>
cihm at russell-desktop:/opt/wip/Temp/rwm$ /usr/bin/pdfunite -v
pdfunite version 0.24.5
Copyright 2005-2013 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC
cihm at russell-desktop:/opt/wip/Temp/rwm$


Any suggestions?  I can make the source PDF files available if that
would help.

cihm at russell-desktop:/opt/wip/Temp/rwm$ pdfinfo MississaugaNews_2/0001.pdf
Producer:       ABBYY Recognition Server
CreationDate:   Sun Mar 12 09:44:20 2017 EDT
ModDate:        Sun Mar 12 09:44:20 2017 EDT
Tagged:         yes
UserProperties: no
Suspects:       no
Form:           none
JavaScript:     no
Pages:          1
Encrypted:      no
Page size:      733.45 x 1486.1 pts
Page rot:       0
File size:      2388234 bytes
Optimized:      no
PDF version:    1.4
cihm at russell-desktop:/opt/wip/Temp/rwm$ pdfinfo MississaugaNews_2/0002.pdf
Producer:       ABBYY Recognition Server
CreationDate:   Sun Mar 12 09:43:59 2017 EDT
ModDate:        Sun Mar 12 09:43:59 2017 EDT
Tagged:         yes
UserProperties: no
Suspects:       no
Form:           none
JavaScript:     no
Pages:          1
Encrypted:      no
Page size:      783.35 x 1428.5 pts
Page rot:       0
File size:      511205 bytes
Optimized:      no
PDF version:    1.4
cihm at russell-desktop:/opt/wip/Temp/rwm$

-- 
System Administration and software developer,
Canadiana.org   http://www.canadiana.ca


More information about the poppler mailing list