[poppler] On converting PDFs
Jauco Noordzij
jauco at jauco.nl
Mon Jul 24 05:28:54 PDT 2006
Hello,
A short while ago, I asked if you could open the api to some
outputdevs so I could use them to convert PDF's to structured
documents for my Summer of Code project. The general reply was that it
would be much better to have the conversion inside Poppler. Me and my
mentor agreed and I have been working on it.
The first few weeks I have simply been building some prototypes of
different parts of the PDF conversion process
(http://jauco.nl/blog/?p=22), but after the half-term evaluation I
have started on building a proper framework. I have a set-up finished
now, and I would like to see how you feel about it, if it meets
popplers standards etc.
My approach was to build a new outputdev that takes a pdf and parses
it to a xmldocptr using the callback functions. In the 'endpage'
callback function I can then do the actual conversion of small text
fragments to lines and paragraphs. When finished the outputdev returns
the xmldocptr to the calling program (I can easily add an overloaded
constructor that writes it to a given filename, if that's
appreciated).
Some things I like to explicitly note:
1. The outputdev uses libxml:
This allows me to have a document structure where I can rearrange and
group document parts. Also, using libxml I can make XPath queries
which is quite invaluable. The export format is xml anyway and libxml
takes care of all those easily introduced bugs, such as incorrect
nesting and all. The dependency won't be that heavy to most people
since libxml is installed on most linux systems anyway. (and there is
a windows port, should you ever wish to make one for poppler)
2. ATM it outputs abiword docs (.abw):
For SoC I'm under some time pressure and I don't think I can get to
creating actual ODF. It's quite a complex format and it might
introduce new dependencies (for one thing it's a .jar file). I am not
really happy with this because I believe pretty firmly in document
standards for the sake of standards so I _will_ implement this, only
probably after the SoC deadline.
So, my actual question: given that I write such a plugin, will it be
included in poppler? And if not, what would you want me to do
different?
If you have any questions, I'm on #poppler most of the time.
--
groeten,
Jauco Noordzij
P.S. In my spare time I'm also working on an SVGOutputDev so that
applications like inkscape can load PDF's and tinker with it. Would
that be a candidate for inclusion?
More information about the poppler
mailing list