<html>
    <head>
      <base href="https://bugs.freedesktop.org/">
    </head>
    <body><table border="1" cellspacing="0" cellpadding="8">
        <tr>
          <th>Bug ID</th>
          <td><a class="bz_bug_link 
          bz_status_NEW "
   title="NEW - pdfimages extracts lots of same images with the same object number."
   href="https://bugs.freedesktop.org/show_bug.cgi?id=99883">99883</a>
          </td>
        </tr>

        <tr>
          <th>Summary</th>
          <td>pdfimages extracts lots of same images with the same object number.
          </td>
        </tr>

        <tr>
          <th>Product</th>
          <td>poppler
          </td>
        </tr>

        <tr>
          <th>Version</th>
          <td>unspecified
          </td>
        </tr>

        <tr>
          <th>Hardware</th>
          <td>Other
          </td>
        </tr>

        <tr>
          <th>OS</th>
          <td>All
          </td>
        </tr>

        <tr>
          <th>Status</th>
          <td>NEW
          </td>
        </tr>

        <tr>
          <th>Severity</th>
          <td>normal
          </td>
        </tr>

        <tr>
          <th>Priority</th>
          <td>medium
          </td>
        </tr>

        <tr>
          <th>Component</th>
          <td>utils
          </td>
        </tr>

        <tr>
          <th>Assignee</th>
          <td>poppler-bugs@lists.freedesktop.org
          </td>
        </tr>

        <tr>
          <th>Reporter</th>
          <td>ryanorz@126.com
          </td>
        </tr></table>
      <p>
        <div>
        <pre>Created <span class=""><a href="attachment.cgi?id=129787" name="attach_129787" title="problem file">attachment 129787</a> <a href="attachment.cgi?id=129787&action=edit" title="problem file">[details]</a></span>
problem file

I have a pdf file, pdfimages list a lot of images with the object number. These
images are the same. There are only about a thousand pictures with diffrent
object number, but pdfimages list more than 256,000 items. Finally, pdfimages
extract all pictures listed and most of them are the same. The total size of
all pictures is really huge. I upload the pdf, and my simple patch below ( may
not good, but work :D ).

>From 237f4e0887eff2f22d5542dfed33fa94a8c7b0ff Mon Sep 17 00:00:00 2001
From: Ryan <<a href="mailto:ryanorz@126.com">ryanorz@126.com</a>>
Date: Tue, 21 Feb 2017 16:11:53 +0800
Subject: [PATCH] Fix(poppler-utils): pdfimages extract too many same pictures
 with the same object number.

---
 utils/ImageOutputDev.cc | 8 ++++++++
 utils/ImageOutputDev.h  | 2 ++
 2 files changed, 10 insertions(+)

diff --git a/utils/ImageOutputDev.cc b/utils/ImageOutputDev.cc
index 5de51ad..26bf95b 100644
--- a/utils/ImageOutputDev.cc
+++ b/utils/ImageOutputDev.cc
@@ -442,6 +442,14 @@ void ImageOutputDev::writeImageFile(ImgWriter *writer,
ImageFormat format, const
 void ImageOutputDev::writeImage(GfxState *state, Object *ref, Stream *str,
                                int width, int height,
                                GfxImageColorMap *colorMap, GBool inlineImg) {
+  if (ref->isRef()) {
+    const Ref imageRef = ref->getRef();
+    if (refNums.find(imageRef.num) != refNums.end())
+      return;
+    else
+      refNums.insert(imageRef.num);
+  }
+
   ImageFormat format;

   if (dumpJPEG && str->getKind() == strDCT &&
diff --git a/utils/ImageOutputDev.h b/utils/ImageOutputDev.h
index a694bbc..89c67ac 100644
--- a/utils/ImageOutputDev.h
+++ b/utils/ImageOutputDev.h
@@ -35,6 +35,7 @@
 #endif

 #include <stdio.h>
+#include <set>
 #include "goo/gtypes.h"
 #include "goo/ImgWriter.h"
 #include "OutputDev.h"
@@ -173,6 +174,7 @@ private:
   int pageNum;                 // current page number
   int imgNum;                  // current image number
   GBool ok;                    // set up ok?
+  std::set<int> refNums;
 };

 #endif
-- 
2.10.2</pre>
        </div>
      </p>


      <hr>
      <span>You are receiving this mail because:</span>

      <ul>
          <li>You are the assignee for the bug.</li>
      </ul>
    </body>
</html>