Shared documentation system

Wed Dec 10 19:48:16 EET 2003

On Wed, 9 Dec 2003, Shaun McCance wrote:

> > Indexing documentation for its own purpose is a good idea too. We can
> > imagine at least three kinds of searches:
> >
> > Search for programs
> > Search in documentation
> > Search in user files.
> >
> > Of these, at least the first two should be considered in the same context.
>
> Indeed.  Let me repeat this point, because it's important:  The help
> system should not impose policy.  Having the ability to search all the
> documentation is probably more important and more useful than having a
> nicely-categorized listing of all the documentation.
>
> How indexing and searching is done is an implementation detail.  Right
> now I'm working on search capability for Yelp.  All I really need from
> the help system is a listing of the documents installed, though having
> more metadata is always better than less.

Agreed. The most important point of my post was not to suggest a certain
implementation for search or indexing. I just wanted to point out that I
think it is a good idea if the documentation metadata and the .desktop
files somehow are aware of each other. (This is something I also think you
mentioned) At least, I think that you from a .desktop should be able to
find the documentation metadata, and from there the
documentation. If this was possible, many nice options could be presented
for the user and documentation could be made available in many
interesting, yet-to-be-thought-of, places.

> Now, one could argue that documentation should ship a special file that
> lists the index terms and where they are located in the document.  And
> this could be useful.  It's done that way on MacOS X.  In GNOME, we work
> directly with DocBook.  A big advantage to this is that index terms are
> part of the format.  But I certainly don't oppose a standard index file
> that can be attached to HTML or other formats.  In fact, it could even
> be useful for DocBook, as a pre-generated index file will take less
> processing overhead.
>
> I view this as somewhat orthogonal to the original proposal, though it's
> certainly something to keep in mind.  To me, an index file is just some
> metadata that is attached to a copy of a document.  If we have a good
> file format for metadata, this sort of stuff can easily be worked in.
> This is why it's important for the system to be extensible.
>
> And a final point, which is a bit off-topic, but still worth saying:
> You mentioned that you got some pretty bad results, because documents
> didn't have good index terms.  Ultimately, the construction of good
> indexes is a job for humans.  Even in DocBook, although the physical
> index can be automatically generated, the index terms are still placed
> in the markup of the document by human editors.
>
> The creation of *good* indexes is a hard task.  Real publishing firms
> have employees whose sole task is to create and edit indexes.  There
> exist entire books written on the subject of creating effective indexes.

Perhaps (I am not sure) the indexes we are talking about are not really
the same thing. I was thinking about indexes, created by Lucene, by
inserting text and other data that was "crawled" by a .desktop-aware and
documentation-metadata-aware crawler. These indexes are fast to look up
and usually include all text (except stop words) that are put into them.
With Lucene, it is also possible to adjust the "score" on data that is
inserted in its index, so that data from a certain source can rank higher
than data from another. I would think text from an abstract could be
ranked higher than text from the body itself, since it probably has a
higher concentration of relevant words.

Of course, your points are still valid.

Take a look at Lucene, if you haven't already done that. It is a very
interesting open-source product. At least, I got many ideas after
learning more about what is possible to do with it.

Claes