[poppler] Compatibility between poppler's pdfunite and JHOVE.
Leonard Rosenthol
lrosenth at adobe.com
Fri Apr 7 17:33:25 UTC 2017
Can I assume that you are aware that JHOVE is NOT a PDF validator in any way? In addition, it’s support for modern PDF feature is quite out of date! And their own site (<http://jhove.openpreservation.org/modules/pdf/>) says as much. I suspect that if you ran these files through a more thorough PDF validation, such as the one in Adobe Acrobat Pro, it would not report any problems.
Leonard
On 4/7/17, 1:09 PM, "poppler on behalf of Russell McOrmond" <poppler-bounces at lists.freedesktop.org on behalf of Russell.McOrmond at canadiana.ca> wrote:
The organisation I work for currently uses poppler's pdfunite
utility as part of our preservation system. We scan documents, run
through ABBYY Recognition Server to generate a PDF for each page, and
have been using pdfunite to join those files into multi-page PDFs
which are available for download.
We recently started to investigate adopting JHOVE
https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fjhove.openpreservation.org%2F&data=02%7C01%7C%7C97be08c4cd44485b759c08d47dd8e089%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636271817808299258&sdata=S3vpx8rUG%2FXnSqG%2B%2F1hQm8qXaicwkHkimjwI%2B4MezmE%3D&reserved=0 which identifies and validates
files including PDF files. JHOVE is indicating there are problems
with the files we create with pdfunite as well as the files we
previously created with `pdftk cat`.
The thread in the JHOVE forum can be seen at
https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.openpreservation.org%2Fpipermail%2Fjhove%2F2017-April%2Fthread.html%233&data=02%7C01%7C%7C97be08c4cd44485b759c08d47dd8e089%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636271817808299258&sdata=D1M%2FWi6NP1BpEZQ%2FdxS36qeX7BtWbsCDjUvDkre6dY8%3D&reserved=0
A pdfunite generated file is available via
https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpub.canadiana.ca%2Fview%2Fomcn.MississaugaNews_2&data=02%7C01%7C%7C97be08c4cd44485b759c08d47dd8e089%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636271817808299258&sdata=qiWYNR33CHWYdDeFQPPiUZbY5kJiZR7EFqDvqoSGvNI%3D&reserved=0 (download link
beside the zoom buttons).
With `pdftk cat` the problem happens after a certain size (around 795
pages from the sample pages I used).
Using only two of those single-page PDF files as an example, I get the
following with the latest release of pdfunite (compiled on Ubuntu
14.04):
cihm at russell-desktop:/opt/wip/Temp/rwm$ /opt/jhove/jhove -m pdf-hul -h
xml MississaugaNews_2/0001.pdf | grep '<status'
<status>Well-Formed and valid</status>
cihm at russell-desktop:/opt/wip/Temp/rwm$ /opt/jhove/jhove -m pdf-hul -h
xml MississaugaNews_2/0002.pdf | grep '<status'
<status>Well-Formed and valid</status>
cihm at russell-desktop:/opt/wip/Temp/rwm$ /usr/local/bin/pdfunite
MississaugaNews_2/0001.pdf MississaugaNews_2/0002.pdf pdfunite.pdf
cihm at russell-desktop:/opt/wip/Temp/rwm$ /opt/jhove/jhove -m pdf-hul -h
xml pdfunite.pdf
java.lang.ArrayIndexOutOfBoundsException: 60
at edu.harvard.hul.ois.jhove.module.PdfModule.getObject(PdfModule.java:2398)
at edu.harvard.hul.ois.jhove.module.PdfModule.resolveIndirectObject(PdfModule.java:2377)
at edu.harvard.hul.ois.jhove.module.PdfModule.readDocCatalogDict(PdfModule.java:1344)
at edu.harvard.hul.ois.jhove.module.PdfModule.parse(PdfModule.java:521)
at edu.harvard.hul.ois.jhove.JhoveBase.processFile(JhoveBase.java:803)
at edu.harvard.hul.ois.jhove.JhoveBase.process(JhoveBase.java:588)
at edu.harvard.hul.ois.jhove.JhoveBase.dispatch(JhoveBase.java:455)
at Jhove.main(Jhove.java:292)
<?xml version="1.0" encoding="UTF-8"?>
<jhove xmlns:xsi="https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.w3.org%2F2001%2FXMLSchema-instance&data=02%7C01%7C%7C97be08c4cd44485b759c08d47dd8e089%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636271817808299258&sdata=AUQ%2FNrzRMEqrh0BBQLl2muD1rMPsRjjTQQX28gylZi8%3D&reserved=0"
xmlns="https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fhul.harvard.edu%2Fois%2Fxml%2Fns%2Fjhove&data=02%7C01%7C%7C97be08c4cd44485b759c08d47dd8e089%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636271817808299258&sdata=ib8Gf6kZBrYSRZP%2FCfdX%2BN5bHJm5V7XIk%2B42ayl1U3A%3D&reserved=0"
xsi:schemaLocation="https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fhul.harvard.edu%2Fois%2Fxml%2Fns%2Fjhove&data=02%7C01%7C%7C97be08c4cd44485b759c08d47dd8e089%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636271817808299258&sdata=ib8Gf6kZBrYSRZP%2FCfdX%2BN5bHJm5V7XIk%2B42ayl1U3A%3D&reserved=0
https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fhul.harvard.edu%2Fois%2Fxml%2Fxsd%2Fjhove%2F1.6%2Fjhove.xsd&data=02%7C01%7C%7C97be08c4cd44485b759c08d47dd8e089%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636271817808299258&sdata=XURY1d2FHIRCEu3SoV6pWQxRI4ebw%2FUzn30O9R%2BT2CA%3D&reserved=0" name="Jhove"
release="1.16.5" date="2017-03-20">
<date>2017-04-07T12:22:35-04:00</date>
<repInfo uri="pdfunite.pdf">
<reportingModule release="1.8" date="2017-03-14">PDF-hul</reportingModule>
<lastModified>2017-04-07T12:22:16-04:00</lastModified>
<size>2888705</size>
<format>PDF</format>
<status>Not well-formed</status>
<sigMatch>
<module>PDF-hul</module>
</sigMatch>
<messages>
<message offset="2888253" severity="error">46</message>
<message offset="0" severity="error">No document catalog dictionary</message>
</messages>
<mimeType>application/pdf</mimeType>
</repInfo>
</jhove>
cihm at russell-desktop:/opt/wip/Temp/rwm$ /usr/local/bin/pdfunite -v
pdfunite version 0.53.0
Copyright 2005-2017 The Poppler Developers - https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpoppler.freedesktop.org&data=02%7C01%7C%7C97be08c4cd44485b759c08d47dd8e089%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636271817808299258&sdata=tm7yPr4hHwHTyU9QVgoQ8dxGBucy77Egee8NKiMXnFM%3D&reserved=0
Copyright 1996-2011 Glyph & Cog, LLC
cihm at russell-desktop:/opt/wip/Temp/rwm$
Same with the older version of Poppler that is distributed with Ubuntu 14.04:
cihm at russell-desktop:/opt/wip/Temp/rwm$ /usr/bin/pdfunite
MississaugaNews_2/0001.pdf MississaugaNews_2/0002.pdf pdfunite.pdf
cihm at russell-desktop:/opt/wip/Temp/rwm$ /opt/jhove/jhove -m pdf-hul -h
xml pdfunite.pdf
java.lang.ArrayIndexOutOfBoundsException: 60
at edu.harvard.hul.ois.jhove.module.PdfModule.getObject(PdfModule.java:2398)
at edu.harvard.hul.ois.jhove.module.PdfModule.resolveIndirectObject(PdfModule.java:2377)
at edu.harvard.hul.ois.jhove.module.PdfModule.readDocCatalogDict(PdfModule.java:1344)
at edu.harvard.hul.ois.jhove.module.PdfModule.parse(PdfModule.java:521)
at edu.harvard.hul.ois.jhove.JhoveBase.processFile(JhoveBase.java:803)
at edu.harvard.hul.ois.jhove.JhoveBase.process(JhoveBase.java:588)
at edu.harvard.hul.ois.jhove.JhoveBase.dispatch(JhoveBase.java:455)
at Jhove.main(Jhove.java:292)
<?xml version="1.0" encoding="UTF-8"?>
<jhove xmlns:xsi="https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.w3.org%2F2001%2FXMLSchema-instance&data=02%7C01%7C%7C97be08c4cd44485b759c08d47dd8e089%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636271817808299258&sdata=AUQ%2FNrzRMEqrh0BBQLl2muD1rMPsRjjTQQX28gylZi8%3D&reserved=0"
xmlns="https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fhul.harvard.edu%2Fois%2Fxml%2Fns%2Fjhove&data=02%7C01%7C%7C97be08c4cd44485b759c08d47dd8e089%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636271817808299258&sdata=ib8Gf6kZBrYSRZP%2FCfdX%2BN5bHJm5V7XIk%2B42ayl1U3A%3D&reserved=0"
xsi:schemaLocation="https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fhul.harvard.edu%2Fois%2Fxml%2Fns%2Fjhove&data=02%7C01%7C%7C97be08c4cd44485b759c08d47dd8e089%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636271817808299258&sdata=ib8Gf6kZBrYSRZP%2FCfdX%2BN5bHJm5V7XIk%2B42ayl1U3A%3D&reserved=0
https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fhul.harvard.edu%2Fois%2Fxml%2Fxsd%2Fjhove%2F1.6%2Fjhove.xsd&data=02%7C01%7C%7C97be08c4cd44485b759c08d47dd8e089%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636271817808299258&sdata=XURY1d2FHIRCEu3SoV6pWQxRI4ebw%2FUzn30O9R%2BT2CA%3D&reserved=0" name="Jhove"
release="1.16.5" date="2017-03-20">
<date>2017-04-07T12:25:50-04:00</date>
<repInfo uri="pdfunite.pdf">
<reportingModule release="1.8" date="2017-03-14">PDF-hul</reportingModule>
<lastModified>2017-04-07T12:25:42-04:00</lastModified>
<size>2885504</size>
<format>PDF</format>
<status>Not well-formed</status>
<sigMatch>
<module>PDF-hul</module>
</sigMatch>
<messages>
<message offset="2885066" severity="error">46</message>
<message offset="0" severity="error">No document catalog dictionary</message>
</messages>
<mimeType>application/pdf</mimeType>
</repInfo>
</jhove>
cihm at russell-desktop:/opt/wip/Temp/rwm$ /usr/bin/pdfunite -v
pdfunite version 0.24.5
Copyright 2005-2013 The Poppler Developers - https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpoppler.freedesktop.org&data=02%7C01%7C%7C97be08c4cd44485b759c08d47dd8e089%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636271817808299258&sdata=tm7yPr4hHwHTyU9QVgoQ8dxGBucy77Egee8NKiMXnFM%3D&reserved=0
Copyright 1996-2011 Glyph & Cog, LLC
cihm at russell-desktop:/opt/wip/Temp/rwm$
Any suggestions? I can make the source PDF files available if that
would help.
cihm at russell-desktop:/opt/wip/Temp/rwm$ pdfinfo MississaugaNews_2/0001.pdf
Producer: ABBYY Recognition Server
CreationDate: Sun Mar 12 09:44:20 2017 EDT
ModDate: Sun Mar 12 09:44:20 2017 EDT
Tagged: yes
UserProperties: no
Suspects: no
Form: none
JavaScript: no
Pages: 1
Encrypted: no
Page size: 733.45 x 1486.1 pts
Page rot: 0
File size: 2388234 bytes
Optimized: no
PDF version: 1.4
cihm at russell-desktop:/opt/wip/Temp/rwm$ pdfinfo MississaugaNews_2/0002.pdf
Producer: ABBYY Recognition Server
CreationDate: Sun Mar 12 09:43:59 2017 EDT
ModDate: Sun Mar 12 09:43:59 2017 EDT
Tagged: yes
UserProperties: no
Suspects: no
Form: none
JavaScript: no
Pages: 1
Encrypted: no
Page size: 783.35 x 1428.5 pts
Page rot: 0
File size: 511205 bytes
Optimized: no
PDF version: 1.4
cihm at russell-desktop:/opt/wip/Temp/rwm$
--
System Administration and software developer,
Canadiana.org https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.canadiana.ca&data=02%7C01%7C%7C97be08c4cd44485b759c08d47dd8e089%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636271817808299258&sdata=kGZj2EfwrJldcV91X858yPstjNDDHPVzqZnX08KY0lU%3D&reserved=0
_______________________________________________
poppler mailing list
poppler at lists.freedesktop.org
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fpoppler&data=02%7C01%7C%7C97be08c4cd44485b759c08d47dd8e089%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636271817808299258&sdata=c0XAKZXYP33vaODALKn6KfLUA%2F6Dk3oQhmCzXvzx6HU%3D&reserved=0
More information about the poppler
mailing list