[Libreoffice-ux-advise] [Bug 152143] Provide a mechanism to export PDF to text

Sun Nov 20 21:00:39 UTC 2022

https://bugs.documentfoundation.org/show_bug.cgi?id=152143

Hossein <hossein at libreoffice.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
          Component|Writer                      |Draw

--- Comment #2 from Hossein <hossein at libreoffice.org> ---
I don't think this is a duplicate of tdf#32249. The title of that one is:

  Bug 32249
  "When importing PDF with text in it , it will be better to have a easy
  and fluent option to edit the imported Text".

So, the above issue is basically talking about being able to edit the text. I
am here talking about being able to export the PDF as a text file. These are
obviously different, even if you discuss about the commonalities in the
implementation.

> So you can already select and consolidate entire pages of imported draw shape
>  textboxes (by glyph index lookup in a ToUinicode CMAP) into a single draw
> shape textbox--a sentence or paragraph of text. And then select that text,
> copy it and paste it as needed. Then correct as lexically necessary.
I disagree. This is not what was intended in this feature request. I have
specifically requested means of exporting the whole PDF document as a text
file, both via UI and command line. The above consolidation feature might help
internally when you want to implement such a feature, but that is not what I
have asked for.

> Also, because PDF provides no lexical sense to the runs in a document (it is a 
> published presentation format)--the discrete imported draw shape text boxes
> *must be selected in sequence* for a manual merge. That would remain the case
> working with draw shape textboxes on the Writer canvas and is a limitation of
> the published rendering encoded into PDF.
I disagree again. We have text boxes in LibreOffice, MS Office and elsewhere,
but we can export the contents to text files. I haven't requested for a smart
software that can understand the meaning of the document. The goal is to export
the contents to a text file.

> Doing more efficient and high fidelity text extraction from PDF into ODF
> paragraphs is the end goal of bug 32249.
>
> Export of lexically correct word, sentence or paragraph to other document
> formats then becomes routine export filtering that is already present. 
Even by accepting this implementation path, it can be said that this feature
request is depending on tdf#32249, not a duplicate of it.

-- 
You are receiving this mail because:
You are on the CC list for the bug.