mso-dumper: making a PPT text extractor

Thorsten Behrens thb at documentfoundation.org
Sun Nov 24 14:30:05 PST 2013


jf at dockes.org wrote:
> The modifications to mso-dumper would be as follows:
> 
>  - Make sure that all current output goes through the output() method
>    inside glob.py instead of using direct print() calls
>  - Add a command-line option to suppress printing from output()
>  - Add a command-line option to accumulate and print the text from the
>    String() and UniString() classes inside pptrecord.py: this would be the
>    text extractor proper.
> 
All very useful IMO.

> A quick test of the approach based on commenting and changing the
> appropriate statements shows that the text extraction work very well (after
> the charset conversion fixes posted earlier).
> 
> However this would need small changes to the code (print statements mostly)
> in many places, so I would appreciate some kind of pre-approval from the
> code owners before going to work.
> 
Go for it! Just one small wish, could you sort potential output into
~fitting classes of stuff, such that people could control output
verbosity, by e.g. requesting only 'Atom output', 'Text output' etc?

Just stick one extra parameter with the output class into all output()
method calls.

> It would also be useful to rename the "src" directory to something like
> "msodump" so that the code can be installed as a Python package in a more
> standard way (this would also need a change to the import statements in the
> top level utilities).
>
Hah. Yeah, totally. And feel free to add whatever idiomatic code for
easy installation / packaging / testing to the toplevel dir.

Thanks a lot for your work there,

-- Thorsten
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: Digital signature
URL: <http://lists.freedesktop.org/archives/libreoffice/attachments/20131124/9dc67627/attachment.pgp>


More information about the LibreOffice mailing list