Mail archive
alpine-devel

[alpine-devel] RFC: Fixing license field in APKBUILDs (or a bit more)

From: Przemysław Pawełczyk <przemoc_at_zoho.com>
Date: Mon, 29 Jan 2018 23:23:32 +0100

Preface
-------

It is kind of a follow up to the the previous thread started month ago:

    License naming in APKBUILD - SPDX License List

Please check it if you haven't already.


Intro
-----

Conversion from simplistic and imprecise license naming that was used
before in Alpine Linux (e.g. GPL, GPL2, BSD, etc.) to slightly more
verbose but also more precise and standardized license naming will
undoubtedly make quality of Alpine Linux packages higher.

SPDX license identifiers are already getting adoption in many
open-source circles. I believe that Alpine Linux did a good thing by
deciding to use SPDX over half year ago. Unluckily, or maybe luckily,
conversion didn't truly followed on back then. There were some changes
here and there, but nothing of greater scale to really nail all existing
packages. I wrote "luckily", because at the end of 2017 SPDX License
List got new version 3, which has some changes compared to version 2.x.

I believe, as I already wrote in previous thread, that we should stick
to this new version, and most likely to its updates too, when they will
be ready, as I doubt they will be disruptive.

One unfortunate thing about sticking with version 3 of the list is that
one of distros reusing Alpine Linux as its base, Adelie Linux, is
apparently fixed on older version of SPDX License List, so already done
and upcoming changes may be not truly welcomed by them to some extent,
but I hope we'll be able to resolve all problems eventually and Alpine
Linux and Adelie Linux relationship will remain good and healthy.

It will be great achievement if we'll manage to correctly define all
licenses of available packages before releasing Alpine Linux 3.8.
There are roughly 3 months for that. It's not much for 4000+ packages!
It's most likely even not enough, but we won't know without trying!


Present
-------

Some changes in license fields are already happening, but we need to
pause for a moment and look how they're done right now, or at least how
they were done so far.

Roughly 2 kind of activities happened in aports repository since
2017-12-30 regarding license field in various APKBUILDs:
- invidual changes,
- massive changes.

Massive change was only one, already mentioned in previous thread, and
as Jakub stated in his commit 63f5e7d29565 himself, "no verification has
been done if the specified license information is correct!"
Therefore all packages being part of this massive change will need to be
investigated anyway.

Invidual changes were about dozens to this date. They're hopefully
correct. They seem like casual changes "I read that mail, so I'll fix
this APKBUILD", and they're appreciated, but they're not good enough in
the big picture. I'll explain it soon.


Problems
--------

How these efforts could be improved and what needs to be changed to be
able to do it properly, i.e. actually fix license fields and not only
replace them from one group of letters to other groups of letters and
pretend we're done?

Let's mention the problems we're facing now.

0. Lack of organized work.

1. Lack of trackability.

   Sheer amount of packages in Alpine Linux make casual change approach
   impractical. Corrected license field in one APKBUILD is
   indistinguishable from another one that hasn't been scrutinized yet,
   which is unacceptable.

2. Lack of veritability.

   That may sound harsh, but I think that one pair of eye per package is
   not enough. Why? Because providing wrong license information is
   worse than not providing it at all, therefore such information must
   be verified by others.

3. Lack of subpackage licenses.

   Well, they're thoretically possible already in APKBUILDs. You have
   to redefine license variable in subpackage function. It is very
   rarely done, though, and it's kind of understandable why it is like
   that considering inconvenience of redefining variables.

   Let me give you an example. Let's look at LZ4 library.
   Its README.md file states "LZ4 library is provided as open-source
   software using BSD 2-Clause license." So BSD-2-Clause, easy, right?
   Checking README file is not enough. LICENSE file gives better image,
   because you can read there "all other files [not in the lib dir] use
   a GPLv2 license, unless explicitly stated otherwise". But if you'll
   look into source code of test and CLI tools, you'll find that it's
   not GPL-2.0-only, as one could presume, but actually GPL-2.0-or-later
   (and I think this is the reason why SPDX decided to abandon GPL-2.0
   and GPL-2.0+ naming style, as the first one is too similar to casual
   GPLv2, which can mean both in practice).
   Test tools usually aren't shipped in packages, so that wouldn't be a
   problem, but CLI tools are shipped. So lz4 package should have
   GPL-2.0-or-later license only, while lz4-libs should have
   BSD-2-Clause license only.

4. Lack of non-space license separator.

   Space is not good enough, because complex licenses can contain space.
   Example: LGPL-2.1-only WITH Nokia-Qt-exception-1.1

   SPDX power doesn't come only from its wide license list, but from the
   fact that people behind it actually thought about it and came with
   license expressions, so not only exceptions can be expressed, like in
   the given example, but also dual-licensing, etc.

   So you may ask, why there is a need for some separator if there are
   these expressions? I'm not an expert in this field, but I believe
   there is a difference between multiple-licensed source file
   (depending on conjunctive or disjunctive character of licensing,
   you'll use AND or OR operators, e.g. Apache-2.0 AND MIT, GPL-2.0-only
   OR MIT) and having different licenses for different source files
   that are all part of one final product. If half of program's source
   code is licensed under MIT, and other half is licensed under
   Apache-2.0, in my opinion you shouldn't describe it as MIT AND
   Apache-2.0 or MIT OR Apache-2.0, as both descriptions are misleading.
   The only way I see to describe it would be: MIT<separator>Apache-2.0.
   The separator definitely feels like "and", but it's different than
   AND and I think it's better to preserve such distinction.

5. Support for non-SPDX licenses.

   SPDX License List, including license exceptions, is quite broad, but
   there may be still some custom licenses, that aren't widely used and
   therefore weren't recognized by SPDX so far, but are used in some of
   packages available in Alpine Linux. Putting license="custom" is not
   a solution. Leaving license field empty and introducing !spdx option
   (*) is also bad, because project may use mix of SPDX and non-SPDX
   licenses.

   (*) I'm assuming that in future there will be support in abuild for
       checking license field whether licenses mentioned in it conform
       to SPDX names; Carlo together with Natanael already did some work
       toward that, which is appreciated, but with this message I hope
       it becomes clear that PoC presented so far is not good enough and
       ultimately some dedicated library/tool may be needed to properly
       deal with that, because parsing in shell script may not
       necessarily be an easy and sane way.

6. Lack of reusability.

   This part may interest Alpine Linux community the least, but if there
   are efforts related to documenting open-source world, it's better if
   they're done in a manner that is easy to be reused by others.
   APKBUILD format may look nifty, being in fact busybox's ash script,
   but it gives not only nice possibilities (that can be abused), but
   also many limitations, like poor data types, lack of nested structs,
   etc.


Solutions?
----------

I was thinking for a considerable time about it and my ideas actually
changed through this process and I would like to share them with you and
hear your feedback. First I'll address mentioned problems.

1. APKBUILD with fixed licenses needs some kind of marking.
   In my last mail I suggested adding !license option to practically all
   APKBUILDs, so after fixing the license, option would be removed and
   that's how we could differentiate APKBUILDs that already passed
   license inspection. But I'm not fond of this idea anymore, as I'm no
   longer sure that options field is the right place for such stuff.
   (Also license inspection should not overlook new packages that were
   added this year and supposedly already with good license info,
   because license inspection should happen independently of standard
   reviews happenning for new aports that land in testing. My point is
   to always try to have correct license for new packages, but don't
   stress it too much before release of Alpine Linux 3.8, because it
   will be kind of transitory period and we can become much more strict
   later, and promotion from testing to community or main should be always
   preceded with thorough license inspection anyway.)

2. License verification needs to be recorded, so people won't be
   rechecking stuff that has already reached some threshold (I think
   that 3 people sounds good for starters) and whenever mistake is
   found, previous reviews must be invalidated.
   Git commit messages alone aren't good enough for that, because you
   won't be able to invalidate them.

3. APKBUILD format needs to be somehow changed, extended or replaced.
   I believe it's a topic worth discussing, but possibly in some
   separate RFC thread.

   I don't want to dwell on it too much here now, but I think that
   introducing another file, e.g. APKBUILD.meta, for structured data in
   human-readable format (like JSON, YAML, etc.) that would take all
   variables from APKBUILD and be able to put them in some hierarchy,
   would make package info more manageable and more maintainable.
   Shell scripts are quite unfortunate to work with as data storage
   containers. So APKBUILD after such extraction wouldn't have any
   variables, or at least no package-related variables, and would
   contain only functions necessary to describe building and packaging.
   There may be need for some kind of mechanism exposing information
   stored in APKBUILD.meta for APKBUILD, but in most cases it shouldn't
   be really needed and abuild would simply need to learn reading such
   additional file.

   Instead of creating separate file, it could be embedded into
   one big variable, but that could be more error prone, because of
   lacking proper syntax check, etc.

   Anyway, any smaller or bigger revolutions regarding APKBUILD (& co)
   won't happen soon (or sadly, may not happen at all, because I can
   foresee great opposition for such changes), but the bigger and more
   widely-used Alpine Linux becomes, the harder it is to improve some
   older decisions, so it's better to approach it earlier than never.

4. License expressions can be seperated with comma for instance.
   It seems like a natural choice, and for better appearance such commas
   could be followed by a space.

5. Non-SDPX licenses need some kind of unique naming.
   That will allow to spot if there is more than one usage of such
   license. Then we can try to request a license added to the SPDX
   License List. Anyway, we need to track all non-SPDX licenses seen in
   packages and introduce some temporary identifiers for them that must
   be clearly discernible from SPDX identifiers. I think that putting
   non-SPDX identifiers in angle brackets, e.g. <Alpine-1.0>, which are
   commonly used for placeholders, should do the job, yet still make it
   possible to easily parse them and discern even if they were part of
   multiple-license expression.

6. As I wrote earlier, shell scripts are poor solutions for data
   storage, therefore I think canonical information regarding licenses
   shouldn't be put in aports, but in a completely new repository with
   flat hierarchy of software projects. No, I'm not proposing removing
   license field from APKBUILDs, but to make these fields populated or
   fixed in aports with the help of some scripts (that aren't written
   yet, but should be easy to do for 99% of cases) using data from this
   new upcoming repository, on a regular basis - weekly or every two
   weeks sounds rational.

Having dedicated repository (I'll call it spdxify for now) for gathering
data about licenses used by various software projects seems like the
best way to move forward.

It will reduce noise in aports, allowing to import fixed licenses in
batches and will avoid adding additional stuff to APKBUILD just to track the
progress. aports is also a moving target, so working outside of it will
get rid of many collisions that would be inevitable otherwise.

I think that spdxify repository layout could look like:

    +- lz4
    | +- 0NAME -- official name of the project
    | +- 0REPO -- official repository \ at least one of these
    | +- 0SRC -- official tarball location / should be present
    | +- licenses -- license expressions covering main
    | | software product (library in this case);
    | | one license expression per line
    | +- licenses-cli -- license expressions covering supplementary
    | | software products (CLI tools in this case)
    | | if they differ from main ones
    | | one license expression per line
    | +- licenses-doc -- license expressions covering documentation
    | | if they differ from main ones
    | .
    | . (perhaps more licenses* files)
    | .
    | |
    | +- reviewers -- ISO 8601 date and reviewer's full name
    | per line
    .

Hierarchy should be flat, because there is no need for favoritism,
what is in testing today in Alpine Linux, can be in community few
weeks later, and I think that reflecting Alpine Linux hierarchy
wouldn't be beneficial here, leading to noise like mentioned moves.

0NAME, 0REPO, 0SRC are files that will make information contained in
the repository useful in a standalone manner, i.e. without access to
aports. There can be same named project that will need having
different directory names (obviously), so it's important to be able
to tell what actual project is referred to in given directory.
First come, first served should work fine, and new colliding project
names would get a suffix _N, where N denotes N-th collision.

There will be at least one licenses file for each project, and more
if there are many products of its building/installing that may not
necessarily be bundled together. Each licenses file should have one
SPDX license expression per line, and first line should contain the most
prominent one license expression if there are many in the project.

Integral part of the whole idea is the concept of reviewers.
Reviewer is the person who clones repository or downloads the most
recent tarball of software project and inspects whether licenses found
there match what licenses* file state and do the fixes if there are any
mistakes. If there are mistakes, then old entries in reviewers file are
removed before adding new one, but if there are no mistakes, then new
reviewer is simply appended. Each reviewer's name should be preceded
with the date (in ISO 8601) when review has been finished.

Inside such repository there should be also .scripts folder with
simple shell scripts to ease some tasks, like adding entry to
reviewers files (based on user.name from git's config) followed by a
git commit with automatic message, finding software with particular
number of reviewers or not yet reviewed by you, etc.

Outside of this repository we will need mentioned earlier Alpine
Linux-specific scripts that will aid converting what's in licenses files
into license field of APKBUILD files, and some mapping file for
non-obvious cases (obvious cases are when package is named exactly the
same in aports as in spdxify and there is only one licenses file), e.g.:

    lz4 lz4:cli
    lz4:libs lz4

In such mapping combining more than one licenses* file into one
license field will be also possible.

That's roughly how I see it. I'm sure I didn't cover all the corners,
but you should get some picture after reading this wall of text.

I don't have all these scripts written yet and spdxify repository has
not been created yet either. I plan to "snapshot" aports state very
soon (hopefully on 2018-01-30 or 2018-01-31) and use packages in main/
as the base to create first set of software projects that will need to
be inspected. There are over 2000 packages in main, so I plan to split
it into batches of ~500, which means [a-g]* packages from main will be
the first ones. I don't even plan on importing existing license fields
from APKBUILDs, because I think it may be harmful and more error-prone.
It's better to start from scratch our license journey and not be biased
by what was already put in some APKBUILDs (I've seen some mistakes in
the past and I'm afraid there may be still many more of them).

I haven't started working on all that yet, because I wanted to get some
feedback whether people see value in such organized approach toward
fixing license matters in Alpine Linux (that may actually also benefit
other distributions in future) or not.


Final notes
-----------

It may look like I take licensing very seriously. Some may argue that
maybe even too seriously. Common view is that only functionality is
important and as long as the job can be done it doesn't matter what is
the license behind the tool used for it. It may be true for most users,
but others interested in utilizing Alpine Linux for their products,
services and/or solutions, may not always have this nice freedom of
choice.

Fixing current license mess will show that Alpine Linux cares about
quality in yet another department, and I believe it can be beneficial
to its overall image, but also to all users and developers being part
of this great community, by rising awareness that licenses do matter.


Regards,
Przemek



---
Unsubscribe:  alpine-devel+unsubscribe_at_lists.alpinelinux.org
Help:         alpine-devel+help_at_lists.alpinelinux.org
---
Received on Mon Jan 29 2018 - 23:23:32 GMT