[ooo-build] GSoC participation from Go oo

Thu Apr 1 00:09:42 PDT 2010

Hi Manu,

Many thanks for your interest in Go-OpenOffice GSoc 2010. It would be
better for us to explain you on IRC what the tokenizer and layout
recognition are meaning.

I'll give you here a first idea of what is happening for the RTF
project:

The .doc, .rtf and .docx file formats can be considered as cousins:
their internal concepts are basically the same thought they are not
encoded in the same way (binary, raw text and xml). There is currently
one import filter for each of them and they are starting to get old: the
code is hard to maintain easily. 

An effort to create a common filter has then been started: it's in the
writerfilter directory at the root of the ooo sources. The basic idea is
to have some specific code for each format splitting the file into
tokens. This "Tokenizer" is then providing the tokens to some other code
mapping them to the OOo internals (through UNO): that one if the
DomainMapper.

There are currently two tokenizers implemented: the ones for .doc
and .docx file formats. The tokenizer for RTF files is still missing and
it would be your task to start / implement it. The DomainMapper is
already started, thought still quite buggy: that leaves you to
concentrate on the RTF token extraction.

The RTF specs can be found here:
http://www.microsoft.com/downloads/details.aspx?FamilyId=DD422B8D-FF06-4207-B476-6B5396A18A2B&displaylang=en

I can't answer you for the layout recognition question as this is not my
area :)

Hope that helps,

--
Cedric

On Sat, 2010-03-27 at 00:35 +0530, manu c wrote:
> Hello all ,
>     I am a student who wish to participate in GSoc as a Go-oo student,
> I have been seeing the list of ideas from quite a few days , the 
> ideas seem interesting and i am actually confused to choose an idea .
> Ideas like "Improve RTF Import (RTF Tokenizer)" and 
> "Use PDF import's layout recognition for other vector formats (e.g.
> postscript, wmf/emf)"
> are those which i really liked to work on and since i have a fair
> knowledge of C , C++ 
> i think that i can take up these project.
> but since i know less about different file formats (have worked with
> formats like .bmp and .jpeg in past
> for a hobby project)
> will someone please help me in knowing the different file formats
> involved in the project,
> and also the concept of "Tokenizer" and "layot recognition"  .
> 
> Thanks and regards ,
>   Manu C
> _______________________________________________
> ooo-build mailing list
> ooo-build at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/ooo-build