Shared documentation system

Wed Dec 10 02:51:00 EET 2003

On Mon, 2003-12-08 at 12:56, Claes Holmerson wrote:
> I like this idea a lot. I was thinking something similar recently. Let me
> add some thoughts to the subject, from a slightly different angle.
> 
> As start menus become more and more crowded, I think it is time to take a
> step back and consider if there are any better ways to find and launch
> programs. Instead of hierarchical presentation, I think a search feature
> would be very useful, perhaps combined with some history which keeps track
> of recent and/or most used programs. Compare for example Google with
> Yahoo's hierachical index to see which is most successful :-)
> 
> My vision is that searching among installed programs should result in a
> result page somewhat looking like a freshmeat search. A more detailed
> description of the program, together with ways to start it, read more
> about it in its documentation, a link to it on the web for example. In
> order to do this, .desktop files need to hold more metadata about their
> programs .
> 
> This is where this documentation proposal comes in so well. If there was a
> way to find relevant documentation from a .desktop file, that
> documentation could be indexed as belonging to this .desktop file, and
> that would allow much more text to be indexed.
> 
> For fun, I experimented with indexing all the .desktop-files I could find
> on my system (Suse 8.2). For this I used the Lucene indexer and search
> engine, which is part of the Apache Jakarta project
> (http://jakarta.apache.org/lucene). Lucene works with the concept of
> "documents", which are filled with "fields" containing the searchable
> data, and then stored in an index. Queries to the index will return hits
> which refers to the documents that were put in it. Lucene includes a
> sofisticated query parser, and this combined with the ability to search in
> a combination of fields makes it pretty powerful. Normal search engine
> syntax, such as AND, OR, NOT, as well as prefix operators such as + and -
> are supported.
> 
> Lucene is a Java library, and not an ideal dependency for the desktop, but
> it is popular enough to have many ports in progress. At Sourceforge, there
> are a number of porting projects, to Python and C++ among others. For my
> prototype, this did not matter. My goal was more to investigate whether
> .desktop files contain enough information to build a useful index. Note
> that Lucene is not a web crawler, or  web search engine. Nothing in it
> ties it to the web, and it can easily be used to index files in a file
> system. Another useful feature is to index the user's documents, but that
> is a different issue.
> 
> My idea was to create a lucene document for every .desktop file. In each
> lucene document I stored the path to the .desktop file, which makes this
> path available in each hit. There is not a huge amount of data in a
> .desktop file that makes sense to index. Name, GenericName, Comment and
> Categories are the ones that are obvious. They are the only ones that
> contain text that the user is likely to search for.  I also looked up the
> mime type description for each MimeType from the mime definitions in
> freedesktop.org.xml and indexed that in the document too, in the cases
> mimetypes were specified.
> 
> After this, I searched against the index. It worked ok, but not great. In
> many cases grep for the same terms would give approximately the same
> results. A big problem I think is that .desktop files does not include
> that much of human readable information. The comments are designed to be
> shown in brief tooltips, and more information about the programs is not
> readily available from the .desktop file itself. A simple example: a
> search for "mp3" resulted in far fewer hits than I expected. The reason is
> that many mp3-capable players only describe themselves as media
> players, and lookup of mimetype "audio/x-mp3" in
> freedesktop.org.xml results in "MPEG layer 3 audio" rather than "MP3
> audio" or similar. That is strictly speaking correct, but not as likely
> to be searched for. With more text to index for each program, I believe
> the results would improve. The documentation is likely to mention mp3 I
> think :-)
> 
> With this proposal, if there was a way to simply find the relevant
> documentation for a .desktop file, indexing would be much more useful. I
> was also thinking about adding documentation metadata to the .desktop file
> itself, but .desktop file format is not well suited to having lots of
> readable text in it. I also agree with the "nesting problem" regarding
> .desktop files.
> 
> Indexing documentation for its own purpose is a good idea too. We can
> imagine at least three kinds of searches:
> 
> Search for programs
> Search in documentation
> Search in user files.
> 
> Of these, at least the first two should be considered in the same context.

Indeed.  Let me repeat this point, because it's important:  The help
system should not impose policy.  Having the ability to search all the
documentation is probably more important and more useful than having a
nicely-categorized listing of all the documentation.

How indexing and searching is done is an implementation detail.  Right
now I'm working on search capability for Yelp.  All I really need from
the help system is a listing of the documents installed, though having
more metadata is always better than less.

Now, one could argue that documentation should ship a special file that
lists the index terms and where they are located in the document.  And
this could be useful.  It's done that way on MacOS X.  In GNOME, we work
directly with DocBook.  A big advantage to this is that index terms are
part of the format.  But I certainly don't oppose a standard index file
that can be attached to HTML or other formats.  In fact, it could even
be useful for DocBook, as a pre-generated index file will take less
processing overhead.

I view this as somewhat orthogonal to the original proposal, though it's
certainly something to keep in mind.  To me, an index file is just some
metadata that is attached to a copy of a document.  If we have a good
file format for metadata, this sort of stuff can easily be worked in. 
This is why it's important for the system to be extensible.

And a final point, which is a bit off-topic, but still worth saying: 
You mentioned that you got some pretty bad results, because documents
didn't have good index terms.  Ultimately, the construction of good
indexes is a job for humans.  Even in DocBook, although the physical
index can be automatically generated, the index terms are still placed
in the markup of the document by human editors.

The creation of *good* indexes is a hard task.  Real publishing firms
have employees whose sole task is to create and edit indexes.  There
exist entire books written on the subject of creating effective indexes.

--
Shaun