[poppler] Compatibility between poppler's pdfunite and JHOVE.

Leonard Rosenthol lrosenth at adobe.com
Fri Apr 7 17:33:25 UTC 2017


Can I assume that you are aware that JHOVE is NOT a PDF validator in any way?  In addition, it’s support for modern PDF feature is quite out of date!  And their own site (<http://jhove.openpreservation.org/modules/pdf/>) says as much.  I suspect that if you ran these files through a more thorough PDF validation, such as the one in Adobe Acrobat Pro, it would not report any problems.

Leonard

On 4/7/17, 1:09 PM, "poppler on behalf of Russell McOrmond" <poppler-bounces at lists.freedesktop.org on behalf of Russell.McOrmond at canadiana.ca> wrote:

      The organisation I work for currently uses poppler's pdfunite
    utility as part of our preservation system.  We scan documents, run
    through ABBYY Recognition Server to generate a PDF for each page, and
    have been using pdfunite to join those files into multi-page PDFs
    which are available for download.
    
      We recently started to investigate adopting JHOVE
    https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fjhove.openpreservation.org%2F&data=02%7C01%7C%7C97be08c4cd44485b759c08d47dd8e089%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636271817808299258&sdata=S3vpx8rUG%2FXnSqG%2B%2F1hQm8qXaicwkHkimjwI%2B4MezmE%3D&reserved=0 which identifies and validates
    files including PDF files.   JHOVE is indicating there are problems
    with the files we create with pdfunite as well as the files we
    previously created with `pdftk cat`.
    
    The thread in the JHOVE forum can be seen at
    https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.openpreservation.org%2Fpipermail%2Fjhove%2F2017-April%2Fthread.html%233&data=02%7C01%7C%7C97be08c4cd44485b759c08d47dd8e089%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636271817808299258&sdata=D1M%2FWi6NP1BpEZQ%2FdxS36qeX7BtWbsCDjUvDkre6dY8%3D&reserved=0
    
    A pdfunite generated file is available via
    https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpub.canadiana.ca%2Fview%2Fomcn.MississaugaNews_2&data=02%7C01%7C%7C97be08c4cd44485b759c08d47dd8e089%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636271817808299258&sdata=qiWYNR33CHWYdDeFQPPiUZbY5kJiZR7EFqDvqoSGvNI%3D&reserved=0  (download link
    beside the zoom buttons).
    
    
    With `pdftk cat` the problem happens after a certain size (around 795
    pages from the sample pages I used).
    
    
    Using only two of those single-page PDF files as an example, I get the
    following with the latest release of pdfunite (compiled on Ubuntu
    14.04):
    
    cihm at russell-desktop:/opt/wip/Temp/rwm$ /opt/jhove/jhove -m pdf-hul -h
    xml MississaugaNews_2/0001.pdf | grep '<status'
      <status>Well-Formed and valid</status>
    cihm at russell-desktop:/opt/wip/Temp/rwm$ /opt/jhove/jhove -m pdf-hul -h
    xml MississaugaNews_2/0002.pdf | grep '<status'
      <status>Well-Formed and valid</status>
    cihm at russell-desktop:/opt/wip/Temp/rwm$ /usr/local/bin/pdfunite
    MississaugaNews_2/0001.pdf MississaugaNews_2/0002.pdf pdfunite.pdf
    cihm at russell-desktop:/opt/wip/Temp/rwm$ /opt/jhove/jhove -m pdf-hul -h
    xml pdfunite.pdf
    java.lang.ArrayIndexOutOfBoundsException: 60
    at edu.harvard.hul.ois.jhove.module.PdfModule.getObject(PdfModule.java:2398)
    at edu.harvard.hul.ois.jhove.module.PdfModule.resolveIndirectObject(PdfModule.java:2377)
    at edu.harvard.hul.ois.jhove.module.PdfModule.readDocCatalogDict(PdfModule.java:1344)
    at edu.harvard.hul.ois.jhove.module.PdfModule.parse(PdfModule.java:521)
    at edu.harvard.hul.ois.jhove.JhoveBase.processFile(JhoveBase.java:803)
    at edu.harvard.hul.ois.jhove.JhoveBase.process(JhoveBase.java:588)
    at edu.harvard.hul.ois.jhove.JhoveBase.dispatch(JhoveBase.java:455)
    at Jhove.main(Jhove.java:292)
    <?xml version="1.0" encoding="UTF-8"?>
    <jhove xmlns:xsi="https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.w3.org%2F2001%2FXMLSchema-instance&data=02%7C01%7C%7C97be08c4cd44485b759c08d47dd8e089%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636271817808299258&sdata=AUQ%2FNrzRMEqrh0BBQLl2muD1rMPsRjjTQQX28gylZi8%3D&reserved=0"
    xmlns="https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fhul.harvard.edu%2Fois%2Fxml%2Fns%2Fjhove&data=02%7C01%7C%7C97be08c4cd44485b759c08d47dd8e089%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636271817808299258&sdata=ib8Gf6kZBrYSRZP%2FCfdX%2BN5bHJm5V7XIk%2B42ayl1U3A%3D&reserved=0"
    xsi:schemaLocation="https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fhul.harvard.edu%2Fois%2Fxml%2Fns%2Fjhove&data=02%7C01%7C%7C97be08c4cd44485b759c08d47dd8e089%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636271817808299258&sdata=ib8Gf6kZBrYSRZP%2FCfdX%2BN5bHJm5V7XIk%2B42ayl1U3A%3D&reserved=0
    https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fhul.harvard.edu%2Fois%2Fxml%2Fxsd%2Fjhove%2F1.6%2Fjhove.xsd&data=02%7C01%7C%7C97be08c4cd44485b759c08d47dd8e089%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636271817808299258&sdata=XURY1d2FHIRCEu3SoV6pWQxRI4ebw%2FUzn30O9R%2BT2CA%3D&reserved=0" name="Jhove"
    release="1.16.5" date="2017-03-20">
     <date>2017-04-07T12:22:35-04:00</date>
     <repInfo uri="pdfunite.pdf">
      <reportingModule release="1.8" date="2017-03-14">PDF-hul</reportingModule>
      <lastModified>2017-04-07T12:22:16-04:00</lastModified>
      <size>2888705</size>
      <format>PDF</format>
      <status>Not well-formed</status>
      <sigMatch>
      <module>PDF-hul</module>
      </sigMatch>
      <messages>
       <message offset="2888253" severity="error">46</message>
       <message offset="0" severity="error">No document catalog dictionary</message>
      </messages>
      <mimeType>application/pdf</mimeType>
     </repInfo>
    </jhove>
    cihm at russell-desktop:/opt/wip/Temp/rwm$ /usr/local/bin/pdfunite -v
    pdfunite version 0.53.0
    Copyright 2005-2017 The Poppler Developers - https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpoppler.freedesktop.org&data=02%7C01%7C%7C97be08c4cd44485b759c08d47dd8e089%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636271817808299258&sdata=tm7yPr4hHwHTyU9QVgoQ8dxGBucy77Egee8NKiMXnFM%3D&reserved=0
    Copyright 1996-2011 Glyph & Cog, LLC
    cihm at russell-desktop:/opt/wip/Temp/rwm$
    
    
    Same with the older version of Poppler that is distributed with Ubuntu 14.04:
    
    cihm at russell-desktop:/opt/wip/Temp/rwm$ /usr/bin/pdfunite
    MississaugaNews_2/0001.pdf MississaugaNews_2/0002.pdf pdfunite.pdf
    cihm at russell-desktop:/opt/wip/Temp/rwm$ /opt/jhove/jhove -m pdf-hul -h
    xml pdfunite.pdf
    java.lang.ArrayIndexOutOfBoundsException: 60
    at edu.harvard.hul.ois.jhove.module.PdfModule.getObject(PdfModule.java:2398)
    at edu.harvard.hul.ois.jhove.module.PdfModule.resolveIndirectObject(PdfModule.java:2377)
    at edu.harvard.hul.ois.jhove.module.PdfModule.readDocCatalogDict(PdfModule.java:1344)
    at edu.harvard.hul.ois.jhove.module.PdfModule.parse(PdfModule.java:521)
    at edu.harvard.hul.ois.jhove.JhoveBase.processFile(JhoveBase.java:803)
    at edu.harvard.hul.ois.jhove.JhoveBase.process(JhoveBase.java:588)
    at edu.harvard.hul.ois.jhove.JhoveBase.dispatch(JhoveBase.java:455)
    at Jhove.main(Jhove.java:292)
    <?xml version="1.0" encoding="UTF-8"?>
    <jhove xmlns:xsi="https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.w3.org%2F2001%2FXMLSchema-instance&data=02%7C01%7C%7C97be08c4cd44485b759c08d47dd8e089%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636271817808299258&sdata=AUQ%2FNrzRMEqrh0BBQLl2muD1rMPsRjjTQQX28gylZi8%3D&reserved=0"
    xmlns="https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fhul.harvard.edu%2Fois%2Fxml%2Fns%2Fjhove&data=02%7C01%7C%7C97be08c4cd44485b759c08d47dd8e089%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636271817808299258&sdata=ib8Gf6kZBrYSRZP%2FCfdX%2BN5bHJm5V7XIk%2B42ayl1U3A%3D&reserved=0"
    xsi:schemaLocation="https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fhul.harvard.edu%2Fois%2Fxml%2Fns%2Fjhove&data=02%7C01%7C%7C97be08c4cd44485b759c08d47dd8e089%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636271817808299258&sdata=ib8Gf6kZBrYSRZP%2FCfdX%2BN5bHJm5V7XIk%2B42ayl1U3A%3D&reserved=0
    https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fhul.harvard.edu%2Fois%2Fxml%2Fxsd%2Fjhove%2F1.6%2Fjhove.xsd&data=02%7C01%7C%7C97be08c4cd44485b759c08d47dd8e089%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636271817808299258&sdata=XURY1d2FHIRCEu3SoV6pWQxRI4ebw%2FUzn30O9R%2BT2CA%3D&reserved=0" name="Jhove"
    release="1.16.5" date="2017-03-20">
     <date>2017-04-07T12:25:50-04:00</date>
     <repInfo uri="pdfunite.pdf">
      <reportingModule release="1.8" date="2017-03-14">PDF-hul</reportingModule>
      <lastModified>2017-04-07T12:25:42-04:00</lastModified>
      <size>2885504</size>
      <format>PDF</format>
      <status>Not well-formed</status>
      <sigMatch>
      <module>PDF-hul</module>
      </sigMatch>
      <messages>
       <message offset="2885066" severity="error">46</message>
       <message offset="0" severity="error">No document catalog dictionary</message>
      </messages>
      <mimeType>application/pdf</mimeType>
     </repInfo>
    </jhove>
    cihm at russell-desktop:/opt/wip/Temp/rwm$ /usr/bin/pdfunite -v
    pdfunite version 0.24.5
    Copyright 2005-2013 The Poppler Developers - https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpoppler.freedesktop.org&data=02%7C01%7C%7C97be08c4cd44485b759c08d47dd8e089%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636271817808299258&sdata=tm7yPr4hHwHTyU9QVgoQ8dxGBucy77Egee8NKiMXnFM%3D&reserved=0
    Copyright 1996-2011 Glyph & Cog, LLC
    cihm at russell-desktop:/opt/wip/Temp/rwm$
    
    
    Any suggestions?  I can make the source PDF files available if that
    would help.
    
    cihm at russell-desktop:/opt/wip/Temp/rwm$ pdfinfo MississaugaNews_2/0001.pdf
    Producer:       ABBYY Recognition Server
    CreationDate:   Sun Mar 12 09:44:20 2017 EDT
    ModDate:        Sun Mar 12 09:44:20 2017 EDT
    Tagged:         yes
    UserProperties: no
    Suspects:       no
    Form:           none
    JavaScript:     no
    Pages:          1
    Encrypted:      no
    Page size:      733.45 x 1486.1 pts
    Page rot:       0
    File size:      2388234 bytes
    Optimized:      no
    PDF version:    1.4
    cihm at russell-desktop:/opt/wip/Temp/rwm$ pdfinfo MississaugaNews_2/0002.pdf
    Producer:       ABBYY Recognition Server
    CreationDate:   Sun Mar 12 09:43:59 2017 EDT
    ModDate:        Sun Mar 12 09:43:59 2017 EDT
    Tagged:         yes
    UserProperties: no
    Suspects:       no
    Form:           none
    JavaScript:     no
    Pages:          1
    Encrypted:      no
    Page size:      783.35 x 1428.5 pts
    Page rot:       0
    File size:      511205 bytes
    Optimized:      no
    PDF version:    1.4
    cihm at russell-desktop:/opt/wip/Temp/rwm$
    
    -- 
    System Administration and software developer,
    Canadiana.org   https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.canadiana.ca&data=02%7C01%7C%7C97be08c4cd44485b759c08d47dd8e089%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636271817808299258&sdata=kGZj2EfwrJldcV91X858yPstjNDDHPVzqZnX08KY0lU%3D&reserved=0
    _______________________________________________
    poppler mailing list
    poppler at lists.freedesktop.org
    https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fpoppler&data=02%7C01%7C%7C97be08c4cd44485b759c08d47dd8e089%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636271817808299258&sdata=c0XAKZXYP33vaODALKn6KfLUA%2F6Dk3oQhmCzXvzx6HU%3D&reserved=0
    



More information about the poppler mailing list