[poppler] Access to Poppler internal C++ API by GDAL

Albert Astals Cid aacid at kde.org
Mon Sep 11 09:52:04 UTC 2017


El diumenge, 10 de setembre de 2017, a les 15:41:13 CEST, Even Rouault va 
escriure:
> Hi,

Hi, thanks for sending this email :)

> I'm one of the developper of the GDAL library (http://gdal.org) that reads
> various raster & vector formats, mostly geospatial, including PDF and its
> georeferencing extensions (either expressed wtih Adobe Supplement to ISO
> 32000 or with Open Geospatial Consortium Best Practice:
> https://portal.opengeospatial.org/files/?artifact_id=40537 )
> 
> Currently we use the Poppler internal C++ API and regularly must adjust for
> changes in it. Recently we had to do adjustments to accomodate for Poppler
> 0.58 changes. Supporting multiple Poppler versions begin to make our code
> ugly. So I and packagers from Linux distribution are wondering if there
> would be a way to access a more stable C++ API
> 
> Besides rendering as image, we need really low-level access to PDF objects,
> to be able to parse georeferencing objects, retrieve layers, turn on/off
> OCG, or even access streams to decode drawing instructions so as to build
> vector objects

Having the list of "what you want to do" as described above is much more 
important than "what classes/functions you use", because I seriously doubt you 
*need* the GooString class at all, the Catalago class and I dare say even the 
Ref classs, since why would you need to know the reference of PDF objects? 

I don't think you *really* want to access the raw stream either, since as you 
said, what you want is "get the drawing instructions" and for that you don't 
really need a class that gets you the PDF contents byte by byte.

And you don't want to inherit from SplashOutputDev either, you just want to 
control how it renders.

So if possible I'd like two things:
 * Try to think on the higher level "what do we really want to get from 
poppler" not in the low level, but in the high level, because I'd say you 
don't really care about the low level, it's the high level you care about

 * Try to think if you could contribute part of your already existing code to 
fulfill those higher level needs.

Cheers,
  Albert

> 
> I've tried to summarize below our current use of Poppler C++ API. I probably
> missed a few calls, but you should get the overall picture:
> - Object class: getType(), getTypeName(), getBool(), getInt(), getReal(),
> getString(), getName(), getStream(), getArray()
> - Dict class: lookupNF(), lookup(), getLength(), getKey()
> - Array class: getLength(), getNF(), get()
> - Stream class: getDict(), reset(), getChar(), fillGooString()
> - Catalog class: getPage(), getPageRef(), readMetadata()
> - GooString: getCString(), getLength()
> - Ref class: access to num and gen
> - PDFDoc class: isOk(), displayPageSlice(), getCatalog(),
> getOptContentConfig(), getNumPages(), getDocInfo(), getErrorCode(), str
> private member(accessed through a ugly "#define private public" before
> including poppler! we need to access it to be able to delete it with our
> heap since we allocated a stream object provided to PDFDoc() constructor.
> this is to avoid potential problems on Windows with cross-heap issues)
> - Page class: isOk(), pageObj private member (accessed through a ugly
> "#define private public" before including poppler!), getMediaBox()
> - OCGs class: isOk(), getOCGs()
> - GooList class: getLength(), get()
> - OptionalContentGroup class: setState()
> - SplashBitmap class: getBitmap(), getWidth(), getHeigh(), getDataPtr(),
> getAlphaPtr(), getAlphaRowSize(), getRowSize()
> - SplashOutputDev class: we subclass this class and override all/most
> virtual methods to be able to turn on/off rendering of various elements as
> we offer options to render selectively vector, raster and/or text elements
> (so basically just a conditional test to decide whether to return as a
> no-op or call the base implementation)
> - BaseStream class: we subclass this class to use GDAL own I/O abstraction
> layer (which beyond regular files can read in .zip files, in-memory files,
> files available through HTTP, etc...). So we implement copy(),
> makeSubStream(), getPos(), getStart(), setPos(), moveStart(), getKind(),
> getFileName(), getChar(), makeSubStream(), lookChar(), reset(),
> unfilteredReset(), close(), hasGetChars(), getChars()
> - GlobalParams class: setPrintCommands()
> - setErrorCallback() function
> 
> If you want to glance at the code, the most relevant files are:
> https://github.com/OSGeo/gdal/blob/trunk/gdal/frmts/pdf/pdfobject.cpp
> https://github.com/OSGeo/gdal/blob/trunk/gdal/frmts/pdf/pdfio.cpp
> https://github.com/OSGeo/gdal/blob/trunk/gdal/frmts/pdf/pdfdataset.cpp
> 
> I'm not clear if that would be feasible for Poppler to provide a more stable
> API for our use. At least, this makes you aware of external users of this
> API.
> 
> Best regards,
> 
> Even




More information about the poppler mailing list