[libreoffice-design] pdf import design docs?
Thorsten Behrens
thb at libreoffice.org
Wed Oct 5 23:56:28 UTC 2016
Larry Evans wrote:
> Well, I did see code here:
>
> sdext/source/pdfimport/pdfparse/pdfparse.cxx
>
> but that looked like it used boost/spirit to parse the pdf file
> (about line 553):
>
> boost::spirit::parse( pBuffer,
> pBuffer+nLen,
> aGrammar,
> boost::spirit::space_p );
>
That's chiefly to deal with hybrid pdf, which needs to detect early-on
that instead of parsing PDF, it should instead load the embedded ODF
file. So for understanding real PDF import, simply ignore that part -
> Hence, I guess Poplar/xpdf does some sophisticated
> processing that the use of boost::spirit does not do or is
> incapable of doing. Of course, I'm jumping to conclusions
> which hopefully people of the devel list will correct :)
>
Yes. Poppler does the actual pdf processing (it's also powering most
of the linux desktop pdf viewers, like okular or evince).
> > In general - it would be -way- better to pick up something like eg.
> > pdfium - and add a rendering front-end there to match first, the same
> > protocol (but we can do this in-process), and subsquently to simplify
> > and factor lots of that madness out =) PDFium seems to be gaining
> > traction in browsers (Chrome + Firefox) and so on.
>
> Thanks for the pointer. I'm googling for PDFium now.
>
For the import of PDF into Draw/Writer (compared to simply rendering
PDF as a picture), the above is a bit of a red herring. The added
complexity in terms of code for doing this in a separate process is
pretty low; the challenge for that sort of thing really is decent
layout detection. There's been a GSoC project proposal to hook up
something like Tesseract or other OCR engines to help with that, sadly
with little traction so far. ;)
> I'm trying to solve the problem I posed earlier in this
> post:
>
> https://lists.freedesktop.org/archives/libreoffice/2014-January/059106.html
>
Ah, XFA. Well then, poppler does not have support for that, pdfium
apparently has a branch: https://pdfium.googlesource.com/pdfium/+/xfa
- no idea how useable that is though. And from the grapevines, XFA
seems pretty dead as an architecture?
> I've also noticed that the font sizes and location of
> letters is sometime not correct; hence, I'd like to figure
> out how to correct that.
>
That's mostly due to prioritizing editability over accuracy. The code
to look at is in sdext/source/pdfimport/tree/drawtreevisiting.cxx,
which writes out ODF from the render tree.
Hope that helps,
-- Thorsten
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 949 bytes
Desc: Digital signature
URL: <https://lists.freedesktop.org/archives/libreoffice/attachments/20161006/49f82909/attachment.sig>
More information about the LibreOffice
mailing list