[Libreoffice-ux-advise] [Bug 152143] Provide a mechanism to export PDF to text

Sun Nov 20 19:10:33 UTC 2022

https://bugs.documentfoundation.org/show_bug.cgi?id=152143

V Stuart Foote <vsfoote at libreoffice.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |libreoffice-ux-advise at lists
                   |                            |.freedesktop.org,
                   |                            |vsfoote at libreoffice.org
             Status|UNCONFIRMED                 |RESOLVED
           See Also|                            |https://bugs.documentfounda
                   |                            |tion.org/show_bug.cgi?id=15
                   |                            |1598,
                   |                            |https://bugs.documentfounda
                   |                            |tion.org/show_bug.cgi?id=11
                   |                            |7428
           Keywords|                            |needsUXEval
         Resolution|---                         |DUPLICATE

--- Comment #1 from V Stuart Foote <vsfoote at libreoffice.org> ---
(In reply to Hossein from comment #0)
> Description:
> Currently it is not possible to export PDF files loaded in LibreOffice
> (Draw) to text.

Not true. Currently LO has the 'Consolidate text' feature see work done for bug
118370 [1]. Which is functional just inconvenient to move PDF imported text to
the Writer canvas for filter export. And this is a dupe of bug 32249, or at
most of bug 151598 to implement 'Consolidate text' on the Writer canvas.

In reasonable workflow, we now take an imported PDF (opened via Draw) to draw
vcl canvas. The textboxes representing the text streams read out from PDF
structures are discretely placed onto vcl canvas. 

So you can already select and consolidate entire pages of imported draw shape 
textboxes (by glyph index lookup in a ToUinicode CMAP) into a single draw shape
textbox--a sentence or paragraph of text. And then select that text, copy it
and paste it as needed. Then correct as lexically necessary.

Also, because PDF provides no lexical sense to the runs in a document (it is a
published presentation format)--the discrete imported draw shape text boxes
*must be selected in sequence* for a manual merge. That would remain the case
working with draw shape textboxes on the Writer canvas and is a limitation of
the published rendering encoded into PDF.

PDF provides an /ActualText construct that could be used more effectively than
index lookup on a Unicode CMAP. 

For bug 66597 LibreOffice export filter for PDF /ActualText construct already
is in place [2] for PDF creation but only to the grapheme cluster run. Bug
117428 is open to refactor PDF export to provide /ActualText at the word bound.

What is unclear is how our poppler PDF import filter(s) would need to be
refactored to use the lexical details to load draw shape textboxes with
/ActualText--for roundtrip, or import of other sourced PDF.

Doing more efficient and high fidelity text extraction from PDF into ODF
paragraphs is the end goal of bug 32249. 

Export of lexically correct word, sentence or paragraph to other document
formats then becomes routine export filtering that is already present. 

=-ref-=
[1] https://gerrit.libreoffice.org/c/core/+/75043/
[2] https://gerrit.libreoffice.org/c/core/+/53315/

*** This bug has been marked as a duplicate of bug 32249 ***

-- 
You are receiving this mail because:
You are on the CC list for the bug.