mso-dumper: making a PPT text extractor
jf at dockes.org
jf at dockes.org
Tue Nov 26 06:00:49 PST 2013
Thorsten Behrens writes:
> jf at dockes.org wrote:
> > The modifications to mso-dumper would be as follows:
> > - Make sure that all current output goes through the output() method
> > inside glob.py instead of using direct print() calls
> > - Add a command-line option to suppress printing from output()
> > - Add a command-line option to accumulate and print the text from the
> > String() and UniString() classes inside pptrecord.py: this would be the
> > text extractor proper.
> All very useful IMO.
> > A quick test of the approach based on commenting and changing the
> > appropriate statements shows that the text extraction work very well (after
> > the charset conversion fixes posted earlier).
> > However this would need small changes to the code (print statements mostly)
> > in many places, so I would appreciate some kind of pre-approval from the
> > code owners before going to work.
> Go for it! Just one small wish, could you sort potential output into
> ~fitting classes of stuff, such that people could control output
> verbosity, by e.g. requesting only 'Atom output', 'Text output' etc?
> Just stick one extra parameter with the output class into all output()
> method calls.
I've looked into this just a bit, and it's not obvious to me how this
should be done. For example the PPT-specific code accumulates text from
many places, and then prints the whole bunch by calling globals.output().
So for now, I've tried for minimal modifications to the existing code, just
separating the PPT text from all the rest.
> > It would also be useful to rename the "src" directory to something like
> > "msodump" so that the code can be installed as a Python package in a more
> > standard way (this would also need a change to the import statements in the
> > top level utilities).
> Hah. Yeah, totally. And feel free to add whatever idiomatic code for
> easy installation / packaging / testing to the toplevel dir.
I have renamed src/ to msodumper/ and added a basic setup.py to enable
using the usual Python setuptools.
"python setup.py install" will install ppt-dump.py to /usr/local/bin and
the package to the appropriate python dir under /usr/local/lib, with the
usual magic to choose a different prefix etc. It would be trivial to add
the other scripts (xls-dump etc.) of course, just add them to the 'scripts'
array. This should be an easy base for eventual distribution packaging.
The current code is here: https://github.com/medoc92/mso-dump
Everything should behave as previously except that:
- Option --no-struct-output to ppt-dump.py will silence the structure dump.
- Option --dump-text to ppt-dump.py will print the slides text as UTF-8.
I have checked on a number of files that the structure output is unchanged
in general (except for changes due to the Unicode fix).
I'm not opposed to doing a little more work on output selection, but we
should first discuss the desired features, maybe off-list as this is
probably not of general interest ?
More information about the LibreOffice