[Libreoffice] [GSoc] Progress report - Visio import filter

Thu May 26 00:39:17 PDT 2011

Hello, Eilidh,

In our private conversation you asked for some guidance about how to
structure the library. Here are my basic thoughts (that are again my
thoughts that come from having contributed to several libraries, but
they are not the God's word):

1) Since you will have to parse quite often compressed chunks of stream,
it would maybe be useful to write some class like the following one:

class VSDInternalStream : public WPXInputStream
{
public:
VSDInternalStream(WPXInputStream *input, size_t dataSize, bool
isCompressed);
~VSDInternalStream();
(...)
private:
std::vector<unsigned char> m_buffer;
VSDInternalStream();
VSDInternalStream(const VSDInternalStream&)
};

That would be constructed by reading the input of dataSize into the
m_buffer and if needed it would decompress it on the fly if it is
compressed. Like that you would have this task that will be quite
frequent one in one place. The advantage would be that the resulting
stream would be seakable and you would just read it as any other
WPXInputStream.

2) Since in the isSupported function I see that you are distinguishing
two versions of Visio Document, I would suggest that you write a base
parser class something like:

class VSDXParser
{
public:
VSDXParser(WPXInputStream *input);
~VSDXParser();
protected:
....
private:
....
};

That would contain common functions for all the formats as long as the
common state that you will need to keep. It could have two derived
classes for the n=11 and n=6 

class VSD<n>Parser : protected VSDXParser
{
public:
VSD<n>Parser(WPXInputStream *input);
~VSD<n>Parser();
parse(libwpg::WPGPaintInterface *iface);
private:
....
};

Those ones would contain functions specific for the given file-format
version as well as specific state information that cannot be extracted
into the VSDXParser.

Now in the VisioDocument::parse(...) function, one could detect which
file-format we are parsing, construct the corresponding VSD<n>Parser and
call the parse on it.

3) As to the development process, I would suggest to first have some dry
parsing in place, with functions that read the different elements of the
Visio document without processing them really. You can plant several
VSD_DEBUG_MSG((...)); statements inside the functions (include the
libvisio_utils.h and optionally un-comment for the time of heavy
development the #define VERBOSE_DEBUG=1). Doing so, you get maximum of
information on your console without actually the parser calling any of
the interface callbacks. Then you can start from there by actually
processing the useful content.

Myself I would write maybe a VSDElement class that would construct
itself by getting the pointer to the current input stream and would have
some kind of processContent function that will decide whether to call
private _readContent(...) for supported elements and _skipContent(...)
for unsupported elements. But again, this is too much of implementation
details and I can clearly confess that I have a bias from what we did in
libwpd and libwpg.

4) The bottom line of a good FOSS development model is to push often
small changes. It has two big advantages:
a) it is easier to bisect changes when something broke;
b) it gives nice overview of progress.
If atomic changes are committed and pushed (or at least the day's work
at the end of the day), I will be able to look at it often and pat your
back if the things are wonderful, marvelous, beyond the wildedst
immagination; or ask questions, seek clarification and discuss
directions if needed. Communication is the main challenge of any GSoC
endavour and git repository can help us to get it right.
Sometimes, GSoC students are scared that pushing publicly code of
questionable quality would be detrimental for them when a prospective
employer googles for their work. This is largely a myth and the evidence
is that if that was true, I would probably have to have spent all my
life living on social help :)

Happy hacking

Fridrich

On Sun, 2011-05-08 at 17:08 +0100, Tibby Lickle wrote:
> Hi,
> 
> Just an update on where I am. So far I've been working on the basics
> of extracting the data from the .vsd file.
> To read Visio files, the steps are roughly:
> 1. Get the interesting part ("VisioDocument") from the OLE container.
> 2. Parse the header to get a pointer to the trailer stream (as well as
> version, length of file, etc.)
> 3. Inflate compressed trailer.
> 4. Parse out pointers in trailer to the various - potentially
> compressed - streams that hold the actual Visio document content.
> 
> I've done 1 - 3. I'm using the WPXStream and its implementation from
> libwpd (WPXStreamImplementation.h here) to read/extract OLE streams.
> The implementation of LZW-esque decompression of the trailer is
> translated from Python to C++ (i.e. shamelessly ripped off) from
> oletoy (thanks frob). 
> I suspect most of what I'll be doing will be stand-alone for now -
> developing and debugging will be too slow if LO integration is
> included at this early stage. Once I've got a very basic parser, the
> callback interface discussed in my proposal will be implemented and
> integration with LO should in theory be relatively easy.
> 
> Note to my mentor -- I've got a paper due for next Saturday so my main
> focus will be on that. I will, however, be spending some time on the
> next stage.
> 
> Eilidh 
> _______________________________________________
> LibreOffice mailing list
> LibreOffice at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/libreoffice