mso-dumper: making a PPT text extractor

Mon Nov 18 02:12:28 PST 2013

Hi,

While looking for a PPT text extractor, I found that mso-dumper was quite
close to doing a good job. I propose to make a few simple modifications to
make it work.

This would be quite useful because such a program is currently lacking, and
it is needed by text indexers:

 - catppt does not work at all on any semi-recent file.
 - unoconv works but it is extremely slow and often crashes.
 - Apache Tika probably have something but it's big and Java.

The modifications to mso-dumper would be as follows:

 - Make sure that all current output goes through the output() method
   inside glob.py instead of using direct print() calls
 - Add a command-line option to suppress printing from output()
 - Add a command-line option to accumulate and print the text from the
   String() and UniString() classes inside pptrecord.py: this would be the
   text extractor proper.

A quick test of the approach based on commenting and changing the
appropriate statements shows that the text extraction work very well (after
the charset conversion fixes posted earlier).

However this would need small changes to the code (print statements mostly)
in many places, so I would appreciate some kind of pre-approval from the
code owners before going to work.

It would also be useful to rename the "src" directory to something like
"msodump" so that the code can be installed as a Python package in a more
standard way (this would also need a change to the import statements in the
top level utilities).

Cheers,

jf