Document conversion engine

Fri Jul 6 12:13:30 PDT 2012

Hi Flavio,

On Tue, 2012-07-03 at 11:45 +0100, Flavio Moringa wrote:
> my name is Flávio Moringa, I'm from Portugal and I'm starting my
> Masters Dissertation next September (Master in Open Source software -
> http://moss.dcti.iscte.pt ).

	Welcome :-)

> I'm not a programmer, so what I'm interested in doing is something in
> the lines of investigating the main conversion problems, identifying
> the possible conversion flows, analysing the way the conversion flow
> is implemented in LibreOffice, and eventually trying to improve this
> flow somehow.

	So - it will be hard to improve the flow without being a programmer I'm
afraid :-)

> From your reply I assume that testing the filters, and doing
> regression tests is something I could do, maybe identifying the main
> conversion issues in groups of documents and kind of creating a "major
> conversion issues" table, and prioritizing those issues. Is there
> already something like that?

	There is a useful QA role in prioritising bug reports and
interoperability issues; we have a real problem with masses of bug
reports many of which could be duplicates. Having said that -
interoperability has many, many known feature / impedance mis-matches
that are non-trivial development problems to fix.

	One thing that -would- be really useful, and that Microsoft have
internally, is an analysis tool for Microsoft's XML document formats -
such that we can get a good idea of which attributes are actually used
much. ie. by analysing and comparing a large corpus of documents out
there, we can answer questions such as:

	"should we implement surface charts, or 3D doughnut charts ?"

	given whatever amount of feature-development time we have - simply by
referring to the database of crunched XML files to work out which one is
used most.

	It'd be nice to have that for ODF as well too of course for when we
have to make zero-sum back-compatibility decisions; but for
interoperability crunching those MS documents would be really good.

	Is that something you could do ? a bit of perl, zip extraction, XML
parsing, etc. ?

	Developers are -much- more likely to let themselves be lead by
objective statistics on real documents out there, rather than subjective
feelings of priority - which can prove rather controversial :-)

	Thanks !

		Michael.

-- 
michael.meeks at suse.com  <><, Pseudo Engineer, itinerant idiot