[poppler] PDF editing operations

Tue May 19 15:27:45 PDT 2009

On Tue, May 19, 2009 at 2:57 PM, Albert Astals Cid <aacid at kde.org> wrote:
> A Dimarts, 19 de maig de 2009, Shawn Rutledge va escriure:
>> Is there any plan to support some basic editing operations, some of
>> which pdftk can do, like rearranging page order, renumbering pages,
>> editing metadata or OCR text inside the PDF?
>
> No from my side, though page reordering and metadata edition should be quite
> easy to achieve.
>
> With page renumbering you mean saying page 4 is really page 14? That should be
> "doable" too.

Yes the usual example with magazines is that some of them number the
inserts differently (a special advertising section with a prefixed
page number, or just that the subscription cards are numbered as if
they were pages, and I don't scan them because they jam the ADF).

> OCR is something you do in a upper layer once poppler has rendered the page to
> an image, not sure what you want poppler to offer here.

Yeah poppler doesn't need anything for that.  My tool that I want to
write would probably crop some page image fragments (crop the likely
areas where page numbers might be found), rotate them through all 4
90-degree angles, pipe them to a gocr process, and see what comes out
(in XML format with coordinates for each "line" of text, and each
character within).

>> I saw in the Qt4 binding
>> documentation that it's possible to write an open PDF document as a
>> new PDF, and there is a flag to preserve changes or not, but what are
>> the changes that it supports?
>
> Writting form contents.

That makes sense.

>> I'm scanning a bunch of old magazines that take up too much space in
>> boxes (Radio-Electronics, Popular Science etc.) and was thinking of
>> writing a program to recognize the name and date of each scan (look
>> for the known magazine titles, month names etc. in the margins), and
>> auto-number the pages (look for page numbers in known likely
>> locations).  I confirmed that GOCR is good enough to extract page
>> numbers from page images.
>
> Lucky, i did not use GOCR but some other people i know did and told me the
> results were quite bad, maybe it has improved lately though.

It's not as good as Acrobat's OCR, but Acrobat has not been OCRing the
page numbers at all.  For one thing I guess they are too close to the
edge, where Acrobat isn't expecting to find text.  Another issue is
that it tries to find a single orientation for the page text, so if
the title or page number is written sideways along one edge, it's not
going to find that because the rest of the text is left-to-right.
(Sometimes it turns a page of schematics sideways though, if most of
the labels on the schematics have that orientation.  :-)  Another
great thing about page numbers is usually they are contiguous, so if I
just catch a few of them that I'm sure about, I can interpolate the
rest with some reasonably high probability, and don't care if it's
100% perfect.  Likewise magazine titles and dates are pretty
predictable and there are plenty of redundant opportunities to read
them.

Do you have a better suggestion?  There is a GNU OCR project too, but
it didn't seem to be any further along based on the web site.  And a
bunch of really half-baked projects.  (I was searching freshmeat, it's
surprising how many there are, actually.)