[packagekit] [gentoo-dev] Inviting you to project "PackageMap"

Wed Jun 17 17:09:33 PDT 2009

Marijn Schouten (hkBst) wrote:
> Sebastian Pipping wrote:
>> I start to understand the real benefits of moving a larger
>> part of the maintenance down to the distro level as you proposed.
> 
>> Okay, let's add support for CPEs at distro package level
>> and sync up and down with the central packagemap database.
>> Please contact me for collaboration on sync scripts
>> and "modeling" of details.
> 
> Do we not already have enough information available to automatically determine
> derived unique identifiers like CPE?
> 
> We have the package homepage and the package name (and the package category) and
> the combination should be enough information to do direct comparisons to data
> gathered from other repos (assuming they also contain such data).

You are asking a valid question.  The homepage links can be a great
helper in mapping and they have been of help already for the mapping
of the first 1000 Gentoo packages in packagemap.

However it might not be as easy you make it sound, as there are
a few things that complicate things and produce extra work:

 - In many cases a project can be reached from several URLs.
   For a project on SF.net you might have
   - http://sf.net/projects/${name}
   - http://${name}.sf.net/
   - http://www.${name}.org/
   That case can be handled rather easily but there are many more
   special cases and a manual map may be required for stuff that's
   not hosted on a larger hosting site.

 - Split packages (think Git or Qt) may all have the same homepage.
   In Debian the source package might help there, in Gentoo you'd
   have to do common prefix detection or so, that's special
   cases again, and continuous review that it still does what you need.

> For example you can determine automatically that gentoo:dev-scheme/gambit and
> debian:gambc are the same package because although their names differ they have
> the same homepage and share a category.

To detect equal categories you need a map for categories for all
participating distros.  Yes, it's smaller than mapping all packages
but it involves a manual map and keeping it in sync.

Another word on homepage collisions:  A few days before I wrote
a script that builds a map from homepages to packagenames for the
whole Gentoo tree (code/gentoo/gentoo-world-to-homepage-map.sh).
The generated table from my run was 12330 lines long, each line for
a different package.

If you run an analysis over that table you see that many
homepages appear many more times than just once.
Here's the top ten:

     68 http://www.gnome.org/
     67 http://www.gentoo.org/
     58 http://www.gentoo.org/proj/en/perl/
     42 http://lingucomponent.openoffice.org/
     26 http://www.kde.org/
     25 http://www.gentoo.org
     20 http://sourceforge.net/projects/synce/
     19 http://www.trolltech.com/
     19 http://search.cpan.org/~rjbs/
     18 http://opensuse.foehr-it.de/

The command I used is

  $ sed 's|  *.*$||' homepage-to-package.txt \
    | sort | uniq -c | sort -n -r | head -n 10

I think this three cases alone show that it would be
- also a lot of work
- be many special cases
- still require manual mappings here and there

Another disadvantage is the current static XML approach of
packagemap is language independent.  We can easily build
tools for packagemap in any language that has an XML parser.
If the data actually is the code we suddenly have to keep
code from different languages in precise special case sync.

I'm not sure if the approach you describe is less work in total.
I guess to find out we'd have to do both in parallel :-)

It could be interesting how much the list of homepages
in say Debian packages and Gentoo packages overlap.

Sebastian