[Clipart] Clip Art Navigator 0.31

Wed Aug 24 14:27:49 PDT 2005

On Mon, Aug 22, 2005 at 09:44:34PM -0400, Greg Steffensen wrote:
> The index file currently uses python's internal serialization format
> (which python calls "pickling").  I did that mainly for performance;
> its faster to load the relevent datastructures from that frozen format
> than to recreate them by parsing xml.  I suspect that the speed
> difference here is irrelevent though; both are plenty fast in
> practice; I'll do some benchmarks tomorrow.  I saw index.xml, but my
> original reason for not using it was that I wanted users to have the
> flexibility to add additional content to their local clip art store. 

I've noticed that in the open source community, in general a 'best
practice' is to prefer ascii formats over binary, even if the binary
file gives a bit better performance, due to the fact that with it in
ascii, it makes it simpler for other tools to operate on the data as
well.  This is a good case in point; you can see how the pickled file
is limited in that only python scripts would be able to make use of it.

People have come up with some best practices for addressing some of the
performance issues of XML, as well.  For example, file size can be
larger with ascii file formats than with binary, however developers have
adopted the practice of compressing it with gzip.  This helps mitigate
the size issue (although it still won't be as small as a custom format
would be) while still allowing other tools to access the data (gzip
support is very widespread.)  Note that gzip is preferred over other
formats like bzip2, etc. because it is more ubiquitous; while other
compression libs may give better compression ratios, you're really
trying to minimize file size without losing accessibility to the
content.

Another approach taken to minimize performance issues of XML is in
libraries like XML::Twig, that allow you to access and modify the file
without needing to fully parse the file.  This library lets you specify
which tags you're interested in, and it only bothers with parsing enough
of the file to get at those tags.

Also, developers in oss tend to adopt the adage "don't prematurely
optimize".  I.e., strive to first do things in the easiest possible way,
that leaves things as open and flexible as possible, and only optimize
once you know there is a measurable performance problem.

> In retrospect, the best way to do that is to allow them to create
> their own indexes, but in ocal's xml format.  Is there a tool to do
> this already in the ocal tools package?  Letting the packagers do the
> indexing also has the advantage of not using python's xml parser
> (expat), which was unable to parse around 30 of the 0.16 images.

This is a good question.

Bryce