[Freedesktop-sdk] License blacklisting [Was: license-checking script for BuildStream projects]

Tristan Van Berkom tristan.vanberkom at codethink.co.uk
Thu Aug 27 11:18:28 UTC 2020


Hi,

Forking this thread because I think this needs a wider discussion
outside of the scope of this license checker tool.

Also: Cross posting this to the BuildStream dev list as I think this is
quite relevant there. Here is a link to the freedesktop-sdk thread for
reference:

    https://lists.freedesktop.org/archives/freedesktop-sdk/2020-August/000054.html

On Tue, 2020-08-25 at 20:22 +0100, Douglas Winship wrote:
> Following on from the previous email, I've put together a basic
> license-checker in python and tested it in a CI Pipeline. I'd be very
> interested to get feedback on the html and json output.
> 
> In particular I'd be interested to get opinions about how to
> implement the blacklist: we're planning to design the license checker
> with a blacklist option, where users can supply a list of blacklisted
> licenses (possibly as regular expressions). If any blacklisted
> licenses are detected, these would be reported in the html and json
> outputs, but I'm not sure what form that ought to take.

First, I think blacklisting of the licenses should be out of scope for
this script, which essentially will scan source code and give us
summary feedback of detected licenses (and as such, provides valuable
input for project maintainers in other stages).


Here is how I would envision a workflow which involves reliable checks
and blacklisting, I will describe this in two sections since I only
recently became aware of the benefits we can gain with SPDX[0].


Traditional approach
~~~~~~~~~~~~~~~~~~~~
Traditionally linux distributions need to audit and consciously
understand what rights they have for every given module they distribute
in binary form, and then make a conscious decision under which license
they distribute those binaries (in the cases where the upstream module
is dual licensed and provides some choice to the distribution).

Binary package based distributions like rpm or deb packages, often
encode this decision into the package metadata, custom linux
integration tools like buildroot and yocto do the same. E.g. yocto has
the LICENSE[1] variable which is manually encoded into all of the
recipes in the poky distribution, users of the poky distribution (who
typically /derive/ poky to create something custom), can then set the
INCOMPATIBLE_LICENSE[2] variable for their distribution, which will
cause build errors if their distribution every inadvertently tries to
include a module with a license on their decided blacklist.

For a vast portion of open source / free software available in the
wild, this conscious interpretation and decision needs to be made by a
human being.

I would see this implemented in BuildStream in the following way:

  * Declare a new "licenses" public data format in the bst public data
    domain[3]

    This is a place where BuildStream project maintainers can record
    the decided license for the module being built, similar to yocto's
    LICENSE variable[1].

    For compatibility across tooling, and consideration of possible
    further automation (see further below), we should probably assert
    that these license annotations be valid SPDX license
    identifiers[4].

  * We would add a new Element plugin in BuildStream, and call it
    something like `assertlicense`

    In this element's `config`, it would allow the user to declare
    a blacklist.

    This element could output a manifest of licenses in the artifact,
    or produce no output at all, the important part is that this
    element can be added to the pipeline, depend on some elements,
    and halt the build with an error in the case that invalid
    licenses are detected.


Enhanced approach
~~~~~~~~~~~~~~~~~
>From my limited understanding, SPDX now provides a format for upstream
project maintainers to encode machine readable information, including
"license expressions" in an "spdx" file in their module.

This would allow for a (possibly weaker possibly stronger) trust chain
where the distributor places trust in the upstream module maintainer to
have the spdx file up to date, if that upstream does maintain one (I
suspect that depending on the use cases, a full license audit will
still be preferred).

This allows us some room to maneuver, and provide automation in the
cases where an upstream provides an spdx file. One downside I can see
from a quick blog read[5]:

    "The SPDX specification doesn't specify a file extension or file
     naming convention."

If this is true, then we would *still* need project maintainers to at
least annotate their element declarations with a bit of public data
which tell us what file is the SPDX file.

An implementation which seems suitable to me for this, building on top
of the previous "Traditional approach" would look like this:

  * Block on the ability to have elements depend on the sources of
    their dependencies in BuildStream, or another solution to the
    same problem.

    As discussed in a recent thread[6], there are already a few
    use cases needing similar capability, including the Bazel
    build plugin which wants to stage many dependency sources
    in one sandbox.

  * With the ability to depend on dependency source availability
    at build time, the new `assertlicense` Element plugin could
    have the ability to:

    * Depend on some SPDX parsing tooling, which it could stage
      in the `/` of the sandbox.

    * Stage sources for any of the dependency elements which do
      not already list manually specified licenses in their
      public data.

    * Attempt to scan the code for an spdx file.

    In this way the license assertion could be made based both
    on manually specified licenses (for any modules which do not
    export any SPDX file), and can be automated for modules which
    provide the SPDX file.


Summary
~~~~~~~
I think that the license checker script has value on it's own, as it
provides some automated feedback for those actors who need to audit the
distribution and understand what it is they are distributing, but by
itself is not the ultimately suitable place to add blacklist
assertions.

Any thoughts on the above approaches for general license metadata
checking ?


Cheers,
    -Tristan


PS: Please note that there is *another* problem related to licenses,
and that is the actually *distribution* of license files themselves,
e.g. it can be desirable to publish the COPYING/LICENSE files found in
upstream modules in the artifact payloads somewhere so that they can be
handed over at the distribution phase - the entire text above does not
address this bit, and I think it is yet another separate problem.


[0]: https://spdx.dev/
[1]: https://www.yoctoproject.org/docs/latest/ref-manual/ref-manual.html#var-LICENSE
[2]: https://www.yoctoproject.org/docs/latest/ref-manual/ref-manual.html#var-INCOMPATIBLE_LICENSE
[3]: https://docs.buildstream.build/master/format_public.html#builtin-public-data
[4]: https://spdx.org/licenses/
[5]: https://github.com/david-a-wheeler/spdx-tutorial
[6]: https://lists.apache.org/thread.html/r3ff35d36e085d1ca51f753707b24ac5e3111b5b53d74807085076033%40%3Cdev.buildstream.apache.org%3E




More information about the Freedesktop-sdk mailing list