[libreoffice-design] pdf import design docs?

Wed Oct 5 18:06:06 UTC 2016

On 10/05/2016 10:24 AM, Michael Meeks wrote:
 > Hi Larry,
 >
 > 	First - really great to have you looking at that
 >       code ! =)

Thanks for the encouragement Michael.

 >
 > On 10/05/2016 04:10 PM, Larry Evans wrote:
 >> I'm trying to understand how the pdf import code works.
 >> I've tried looking at the code; however, that's hard to
 >> follow; hence, I was hoping there was some sort of design
 >> document explaining somewhat how the code works.
 >
 > 	Second - the design list is really for User Experience / developer
 > interaction, and this seems like a real gnarly coding problem - so I've
 > re-sent it to the dev-list =)

OOPS.  Sorry about that.

 >
 >> TIA for any pointers.
 >
 > 	Sure - so the PDF import is a bit of a mess; it currently spawns a
 > remote process using poplar to parse the PDF, and then extracts (via a
 > simple text protocol) data from poplar's rendering to re-constitute into
 > internal ODF callbacks to produce an internal document; at least -
 > that's if I got it right =)

Well, I did see code here:

   sdext/source/pdfimport/pdfparse/pdfparse.cxx

but that looked like it used boost/spirit to parse the pdf file
(about line 553):

             boost::spirit::parse( pBuffer,
                                   pBuffer+nLen,
                                   aGrammar,
                                   boost::spirit::space_p );

but then, trying to find where that (or the caller of that) was called
lead me to:

   sdext/source/pdfimport/wrapper/wrapper.cxx

where there is a call(around line 927):

   std::unique_ptr<pdfparse::PDFEntry> pEntry(
   pdfparse::PDFReader::read( aPDFFile.getStr() ));

but that's called in a function:

  bool checkEncryption

whose name doesn't suggest any translation into something
like the xml which is what libreoffice stores its files as,
IIUC:

   https://en.wikipedia.org/wiki/OpenOffice.org_XML

but, looking further in that file, there's, as you mention,
what looks like a remote process call in function:

   bool xpdf_ImportFromFile

on about line 1079:

         osl_executeProcess_WithRedirectedIO(converterURL.pData,
                                             args,
                                             nArgs,

osl_Process_SEARCHPATH|osl_Process_HIDDEN,
                                             pSecurity,
                                             nullptr, nullptr, 0,
                                             &aProcess, &pIn, &pOut, &pErr);

So that's where I wanted some overall design help, because I
thought it odd that boost::spirit was used to parse the
file, I guess, just to determine whether it was encrypted,
and then, an xpdf process was used to parse the same file
again.  That seemed awfully redundant. 

 >
 > 	Poplar/xpdf has a GPL license and so requires all this silliness.
 >

Hence, I guess Poplar/xpdf does some sophisticated
processing that the use of boost::spirit does not do or is
incapable of doing.  Of course, I'm jumping to conclusions
which hopefully people of the devel list will correct :)

 > 	In general - it would be -way- better to pick up something like eg.
 > pdfium - and add a rendering front-end there to match first, the same
 > protocol (but we can do this in-process), and subsquently to simplify
 > and factor lots of that madness out =) PDFium seems to be gaining
 > traction in browsers (Chrome + Firefox) and so on.

Thanks for the pointer.  I'm googling for PDFium now.

 >
 > 	Does that make sense ? out of interest, what bug or mis-feature are you
 > interested in there ? are you looking at:
 >
 > 	filter/source/pdf
 > and	sdext/source/pdfimport

The latter.

 >
 > 	? =)

I'm trying to solve the problem I posed earlier in this
post:

https://lists.freedesktop.org/archives/libreoffice/2014-January/059106.html

I've also noticed that the font sizes and location of
letters is sometime not correct; hence, I'd like to figure
out how to correct that.

Thanks for your interest, Michael.

-regards,
Larry