Document conversion engine
Michael Meeks
michael.meeks at suse.com
Mon Jul 9 01:22:06 PDT 2012
On Mon, 2012-07-09 at 00:25 +0100, Flavio Moringa wrote:
> nice to ear from someone so "up the ranks" like you.. makes me feel
> much more important :-)
Ho hum; we try to avoid unpleasant hierarchy as much as possible.
> I'll probably wont't be able to do a conversion engine by myself...
> but I can definitely mess around with code...
Great :-)
> Yes, it's definitely something I can do... I do believe that the
> harder part is getting that " large corpus of documents out
> there...". At least as my experience goes, I've found that it's hard
> to get users to send us documents they use... either due to privacy
> questions or enterprise policies... But a tool like that makes a lot
> of sense
Oh - so; getting the documents is not -that- hard; Google has a
document-type search that can be automated; just search for:
filetype:docx
And start scraping; as well as 7 million files, we get to take
advantage of Google's popularity ranking to get the most popular first
100 or whatever :-)
> For now then I'll start doing as you suggest and look in bugzilla for
> documents with conversion problems to try and compile as much examples
> as I can. Then maybe using the latest beta to do the conversion and
> see which problems are still there. Then maybe starting a perl script
> that can scrap the OOXML files to find the most used tags... and start
> from there...
We also have tools for dumping all the documents out of bugzilla - see
the main 'core' repository:
bin/get-bugzilla-attachments-by-mimetype
so really the fun piece is writing the parser & element / attribute
value parser / database to analyse what pieces are popular and provide a
pretty UI or command-line for hackers to grok that.
It'd be just great to have that data in hand.
Thanks !
Michael.
--
michael.meeks at suse.com <><, Pseudo Engineer, itinerant idiot
More information about the LibreOffice
mailing list