Document conversion engine

Mon Jul 9 01:22:06 PDT 2012

On Mon, 2012-07-09 at 00:25 +0100, Flavio Moringa wrote:
> nice to ear from someone so "up the ranks" like you.. makes me feel
> much more important :-)

	Ho hum; we try to avoid unpleasant hierarchy as much as possible.

>  I'll probably wont't be able to do a conversion engine by myself...
> but I can definitely mess around with code...

	Great :-)

> Yes, it's definitely something I can do... I do believe that the
> harder part is getting that " large corpus of documents out
> there...". At least as my experience goes, I've found that it's hard
> to get users to send us documents they use... either due to privacy
> questions or enterprise policies... But a tool like that makes a lot
> of sense

	Oh - so; getting the documents is not -that- hard; Google has a
document-type search that can be automated; just search for:

	filetype:docx

	And start scraping; as well as 7 million files, we get to take
advantage of Google's popularity ranking to get the most popular first
100 or whatever :-)

> For now then I'll start doing as you suggest and look in bugzilla for
> documents with conversion problems to try and compile as much examples
> as I can. Then maybe using the latest beta to do the conversion and
> see which problems are still there. Then maybe starting a perl script
> that can scrap the OOXML files to find the most used tags... and start
> from there...

	We also have tools for dumping all the documents out of bugzilla - see
the main 'core' repository:

	bin/get-bugzilla-attachments-by-mimetype

	so really the fun piece is writing the parser & element / attribute
value parser / database to analyse what pieces are popular and provide a
pretty UI or command-line for hackers to grok that.

	It'd be just great to have that data in hand.

	Thanks !

		Michael.

-- 
michael.meeks at suse.com  <><, Pseudo Engineer, itinerant idiot