[packagekit] [gentoo-dev] Inviting you to project "PackageMap"

Marijn Schouten (hkBst) hkBst at gentoo.org
Thu Jun 18 02:07:00 PDT 2009


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Sebastian Pipping wrote:
> Marijn Schouten (hkBst) wrote:
>> Sebastian Pipping wrote:
>>> I start to understand the real benefits of moving a larger
>>> part of the maintenance down to the distro level as you proposed.
>>> Okay, let's add support for CPEs at distro package level
>>> and sync up and down with the central packagemap database.
>>> Please contact me for collaboration on sync scripts
>>> and "modeling" of details.
>> Do we not already have enough information available to automatically determine
>> derived unique identifiers like CPE?
>>
>> We have the package homepage and the package name (and the package category) and
>> the combination should be enough information to do direct comparisons to data
>> gathered from other repos (assuming they also contain such data).
> 
> You are asking a valid question.  The homepage links can be a great
> helper in mapping and they have been of help already for the mapping
> of the first 1000 Gentoo packages in packagemap.
> 
> However it might not be as easy you make it sound, as there are
> a few things that complicate things and produce extra work:
> 
>  - In many cases a project can be reached from several URLs.
>    For a project on SF.net you might have
>    - http://sf.net/projects/${name}
>    - http://${name}.sf.net/
>    - http://www.${name}.org/
>    That case can be handled rather easily but there are many more
>    special cases and a manual map may be required for stuff that's
>    not hosted on a larger hosting site.

But homepage is just ONE of the things that help you to identify a package. Some
packages that are the same will have different homepages and some packages which
are different will have the same homepage. If you take just homepage, package
name into account and the fact that packages from the same repo are different,
you can probably match over 95% of all packages correctly.

>  - Split packages (think Git or Qt) may all have the same homepage.
>    In Debian the source package might help there, in Gentoo you'd
>    have to do common prefix detection or so, that's special
>    cases again, and continuous review that it still does what you need.

Neither of the gits gentoo has seems very split, so I'll only address qt. Gentoo
has qt-core and qt-svg (and many more). I would say that they would each have to
get a different CPE and that none of them is equivalent to a package in another
or the same distro that has all of qt combined. Packages that get manually split
are a minority AFAIK, though texlive is another big one that comes to mind.
Debian does splitting into ``normal'' and ``devel'' packages. Has it been
decided what to do with those?
Now that you got me thinking about split packages, I realize that the exact
files installed by a package are also all by themselves a way to get over 95%
correct matching. For distros (like Gentoo) that have packages that have flags
that influence the list of installed files you must decide whether to add them
to the database last, or whether you will try to use an imprecise file list.

>> For example you can determine automatically that gentoo:dev-scheme/gambit and
>> debian:gambc are the same package because although their names differ they have
>> the same homepage and share a category.
> 
> To detect equal categories you need a map for categories for all
> participating distros.  Yes, it's smaller than mapping all packages
> but it involves a manual map and keeping it in sync.

No, there need not be a manual mapping. There is no reason to do true/false
comparisons. All we need is a distance function, like for example Levenshtein
distance (http://en.wikipedia.org/wiki/Levenshtein_distance). Actually on second
thought Levenshtein distance is probably not what we want, since we would be
more interested in how much strings have in common than in how much they differ.
I think the idea is clear though.

> Another word on homepage collisions:  A few days before I wrote
> a script that builds a map from homepages to packagenames for the
> whole Gentoo tree (code/gentoo/gentoo-world-to-homepage-map.sh).
> The generated table from my run was 12330 lines long, each line for
> a different package.
> 
> If you run an analysis over that table you see that many
> homepages appear many more times than just once.
> Here's the top ten:
> 
>      68 http://www.gnome.org/
>      67 http://www.gentoo.org/
>      58 http://www.gentoo.org/proj/en/perl/
>      42 http://lingucomponent.openoffice.org/
>      26 http://www.kde.org/
>      25 http://www.gentoo.org
>      20 http://sourceforge.net/projects/synce/
>      19 http://www.trolltech.com/
>      19 http://search.cpan.org/~rjbs/
>      18 http://opensuse.foehr-it.de/

texlive with (http://www.tug.org/texlive/) seems to be missing from this list.

$ eix -H http://www.tug.org/texlive/ | tail -n 1
Found 79 matches.

I suspect you used grep (or whatever) to construct your data, instead of using
the package manager or a tool that knows how to extract the data available in
packages (and eclasses).

> The command I used is
> 
>   $ sed 's|  *.*$||' homepage-to-package.txt \
>     | sort | uniq -c | sort -n -r | head -n 10
> 
> I think this three cases alone show that it would be

I'm not sure which 3 cases you mean.

> - also a lot of work
> - be many special cases
> - still require manual mappings here and there
> 
> Another disadvantage is the current static XML approach of
> packagemap is language independent.  We can easily build
> tools for packagemap in any language that has an XML parser.

I agree that XML is a disadvantage, but not that it is language independent. ;P

> If the data actually is the code we suddenly have to keep
> code from different languages in precise special case sync.

I did not argue for a data format nor for a specific language nor coding style
nor anything that seems to match what you are saying here; I only spoke about
how to populate the CPE database.

> I'm not sure if the approach you describe is less work in total.
> I guess to find out we'd have to do both in parallel :-)
> 
> It could be interesting how much the list of homepages
> in say Debian packages and Gentoo packages overlap.

It would certainly be interesting.

Marijn

- --
If you cannot read my mind, then listen to what I say.

Marijn Schouten (hkBst), Gentoo Lisp project, Gentoo ML
<http://www.gentoo.org/proj/en/lisp/>, #gentoo-{lisp,ml} on FreeNode
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAko6A7QACgkQp/VmCx0OL2wl/wCgpSNzob7skilge+56ynbmawHY
/1EAoJnOOG2Bix0IpWqySP063AJIWDta
=L9t+
-----END PGP SIGNATURE-----



More information about the PackageKit mailing list