simple search api (was Re: mimetype standardisation by testsets)

Mon Nov 27 13:32:21 PST 2006

2006/11/27, Joe Shaw <joeshaw at novell.com>:
> On Mon, 2006-11-27 at 21:42 +0100, Jos van den Oever wrote:
> > Hmm, in Strigi text fragments are returned with every query and the
> > results are about as fast as i can type, so I guess this depends on
> > the search engine. Since the text fragments are an important part of
> > the user experience, I think we should have them.
> > At them moment we only return the fragment for the 'content' field though.
>
> Are you storing the full text of the document in the index?
Yes and yes this is big. This will be configurable.

> What we've found is storing the full text in the index (a) makes the
> index huge and (b) searching slow.  At the same time, extracting the
> content from the source document is pretty slow, especially if it's not
> a text document.  We've taken to caching the text content of structured
> files, but we compress the files to make disk usage a little more
> reasonable.  But finding the N terms in a potentially large document
> tend to slow down searches quite a bit.
I've not noticed that it slows down searching. At work I have a 1.5 gb
index. No sweat. Extracting text from a source doc is only slow if it
is deep in a zip or tar.
Also you only want the first X hits.
At the moment we send back the complete text content of each hit with
every query. It's for the client to find the highlights then. I've not
had any problems with this. We do limit the size of the stored text
per doc to about 100k.

Cheers,
Jos