[libreoffice-design] pdf import design docs?

Thorsten Behrens thb at libreoffice.org
Wed Oct 5 23:56:28 UTC 2016


Larry Evans wrote:
> Well, I did see code here:
> 
>   sdext/source/pdfimport/pdfparse/pdfparse.cxx
> 
> but that looked like it used boost/spirit to parse the pdf file
> (about line 553):
> 
>             boost::spirit::parse( pBuffer,
>                                   pBuffer+nLen,
>                                   aGrammar,
>                                   boost::spirit::space_p );
> 
That's chiefly to deal with hybrid pdf, which needs to detect early-on
that instead of parsing PDF, it should instead load the embedded ODF
file. So for understanding real PDF import, simply ignore that part -

> Hence, I guess Poplar/xpdf does some sophisticated
> processing that the use of boost::spirit does not do or is
> incapable of doing.  Of course, I'm jumping to conclusions
> which hopefully people of the devel list will correct :)
> 
Yes. Poppler does the actual pdf processing (it's also powering most
of the linux desktop pdf viewers, like okular or evince).

> > 	In general - it would be -way- better to pick up something like eg.
> > pdfium - and add a rendering front-end there to match first, the same
> > protocol (but we can do this in-process), and subsquently to simplify
> > and factor lots of that madness out =) PDFium seems to be gaining
> > traction in browsers (Chrome + Firefox) and so on.
> 
> Thanks for the pointer.  I'm googling for PDFium now.
> 
For the import of PDF into Draw/Writer (compared to simply rendering
PDF as a picture), the above is a bit of a red herring. The added
complexity in terms of code for doing this in a separate process is
pretty low; the challenge for that sort of thing really is decent
layout detection. There's been a GSoC project proposal to hook up
something like Tesseract or other OCR engines to help with that, sadly
with little traction so far. ;)

> I'm trying to solve the problem I posed earlier in this
> post:
> 
> https://lists.freedesktop.org/archives/libreoffice/2014-January/059106.html
> 
Ah, XFA. Well then, poppler does not have support for that, pdfium
apparently has a branch: https://pdfium.googlesource.com/pdfium/+/xfa
- no idea how useable that is though. And from the grapevines, XFA
seems pretty dead as an architecture?

> I've also noticed that the font sizes and location of
> letters is sometime not correct; hence, I'd like to figure
> out how to correct that.
> 
That's mostly due to prioritizing editability over accuracy. The code
to look at is in sdext/source/pdfimport/tree/drawtreevisiting.cxx,
which writes out ODF from the render tree.

Hope that helps,

-- Thorsten
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 949 bytes
Desc: Digital signature
URL: <https://lists.freedesktop.org/archives/libreoffice/attachments/20161006/49f82909/attachment.sig>


More information about the LibreOffice mailing list