Tooling for analysing dependency licenses

If it wasn’t already quite apparent from my recent ruminations on licenses, I’ve been thinking about licenses a bit recently. Part of this involved a long and meandering conversation on Slack about the impact of license changes, and packages gaining deps with different licenses.

While Julia’s projects operate by downloading the source files including license files, there’s not much to worry about here. However, with the advent of static compilation, and the increased attention to SBOMs we’re seeing with new developments like the EU’s Cyber Resilience Act it seems there’s a growing need for users and developers to be able to audit the licences in a project/application.

Currently there are a few packages floating around that do part of the work required (LicenseCheck.jl, LicenseGrabber.jl), but nothing as easy as pkg> license.

Looking at Rust, they have developed a few options like this, the most prominent of which would seem to be GitHub - onur/cargo-license: Cargo subcommand to see license of dependencies.

I’m thinking that it would be valuable to have a tool that will scan all of your dependencies, detect situations like dual licensing where possible, flag files not covered by a license and major license obligations — all with an easy to use entrypoint.

Thoughts?

5 Likes

I mentioned PackageAnalyzer on Slack, but this might be a good place to demo what it can do. I think PackageAnalyzer is not a complete solution here but it does have some useful pieces that I hope can be a useful part of broader solutions.

Here let us analyze a manifest with DSP.jl and Arrow.jl.

using PackageAnalyzer, DataFrames
] activate --temp
] add DSP, Arrow

pkgs = DataFrame(analyze_manifest())

gives

julia> pkgs = DataFrame(analyze_manifest())
50×22 DataFrame
 Row │ name                         uuid                                  repo                               subdir       reachable  docs    ⋯
     │ String                       Base.UUID                             String                             String       Bool       Bool    ⋯
─────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ TableTraits                  3783bdb8-4a98-5b6b-af9a-565f29a5fe9c  https://github.com/queryverse/Ta…                    true   true   ⋯
   2 │ Arrow                        69666777-d1a9-59fb-9406-91d4454c9d45  https://github.com/apache/arrow-…                    true   true
   3 │ ConstructionBase             187b0558-2788-49d3-abe0-74a17ed4e7c9  https://github.com/JuliaObjects/…                    true   true
   4 │ DSP                          717857b8-e6f2-59f4-9121-6e50c889abd2  https://github.com/JuliaDSP/DSP.…                    true   true
   5 │ EnumX                        4e289a0a-7415-4d19-859d-a7e5c4648b56  https://github.com/fredrikekre/E…                    true  false   ⋯
   6 │ IrrationalConstants          92d709cd-6900-40b7-9082-c6be49f344b6  https://github.com/JuliaMath/Irr…                    true  false
   7 │ OrderedCollections           bac558e1-5e72-5ebc-8fee-abe8a469f55d  https://github.com/JuliaCollecti…                    true   true
   8 │ IteratorInterfaceExtensions  82899510-4779-5014-852e-03e436cf321d  https://github.com/queryverse/It…                    true   true
  ⋮  │              ⋮                                ⋮                                    ⋮                       ⋮           ⋮        ⋮     ⋱
  43 │ DataValueInterfaces          e2d170a0-9d28-54be-80f0-106bbe20a464  https://github.com/queryverse/Da…                    true  false   ⋯
  44 │ oneTBB_jll                   1317d2d5-d96f-522e-a858-c73665f53c3e  https://github.com/JuliaBinaryWr…                    true  false
  45 │ CodecZstd                    6b39b394-51ab-5f42-8807-6242bab2b4c2  https://github.com/JuliaIO/Codec…                    true  false
  46 │ Tables                       bd369af6-aec1-5ad0-b16a-f7cc5008161c  https://github.com/JuliaData/Tab…                    true   true
  47 │ Scratch                      6c6a2e73-6563-6170-7368-637461726353  https://github.com/JuliaPackagin…                    true   true   ⋯
  48 │ MacroTools                   1914dd2f-81c6-5fcd-8719-6d5c9610ff09  https://github.com/FluxML/MacroT…                    true   true
  49 │ Setfield                     efcf1570-3423-57d1-acb7-fd33fddbac46  https://github.com/jw3126/Setfie…                    true   true
  50 │ FFTW                         7a1cc6ca-52ef-59f5-83cd-3a7055c09341  https://github.com/JuliaMath/FFT…                    true   true
                                                                                                                16 columns and 34 rows omitted

Note there is a lot more here than licenses (including a custom line-of-code counting implementation that correctly handles Julia docstrings), but let us focus on those.

flat = flatten(pkgs, :license_files)
select(flat, :name, :license_files => AsTable)

gives

julia> select(flat, :name, :license_files => AsTable)
57×4 DataFrame
 Row │ name                 license_filename      licenses_found  license_file_percent_covered
     │ String               String                Vector{String}  Float64
─────┼─────────────────────────────────────────────────────────────────────────────────────────
   1 │ TableTraits          LICENSE.md            ["MIT"]                              93.8547
   2 │ Arrow                LICENSE               ["Apache-2.0"]                       78.3333
   3 │ Arrow                codecov.yaml          ["Apache-2.0"]                       71.2121
   4 │ Arrow                .gitignore            ["Apache-2.0"]                       69.6296
   5 │ Arrow                .JuliaFormatter.toml  ["Apache-2.0"]                       65.7343
   6 │ Arrow                .asf.yaml             ["Apache-2.0"]                       44.5498
   7 │ Arrow                Project.toml          ["Apache-2.0"]                       33.0986
   8 │ Arrow                README.md             ["Apache-2.0"]                       25.6831
  ⋮  │          ⋮                    ⋮                  ⋮                      ⋮
  50 │ DataValueInterfaces  LICENSE.md            ["MIT"]                              93.9227
  51 │ oneTBB_jll           LICENSE               ["MIT"]                              67.2065
  52 │ CodecZstd            LICENSE.md            ["MIT"]                              93.8202
  53 │ Tables               LICENSE               ["MIT"]                              98.8166
  54 │ Scratch              LICENSE               ["MIT"]                              98.8439
  55 │ MacroTools           LICENSE.md            ["MIT"]                              93.9227
  56 │ Setfield             LICENSE.md            ["MIT"]                              93.8202
  57 │ FFTW                 LICENSE               ["MIT"]                              97.7401
                                                                                41 rows omitted

This uses the output of licensecheck. The way to interpret this is something like:

TableTraits has a file LICENSE.md which was detected as containing one or more licenses, in this case, just the MIT license. In that file, 93.8% of the lines could be identified as belonging to a SPDX license. If we look at the actual file, the remaining 7% is likely just the first line “The TableTraits.jl package is licensed under the MIT “Expat” License”.

Note that we get a bunch of rows for Arrow, since it contains the apache 2.0 URL in the header of every file, and that header is enough for licensecheck to count it as the apache license (see also Apache-2.0 URL alone is enough to match the license; is this intentional? · Issue #40 · google/licensecheck · GitHub).

If the package had more than one file with a license, it should show up here, and if there was more than 1 license in the file, that should also show up here.

Note one can also identify if only a small proportion of the file is identified as a license. If that is the case, one may need to manually investigate to understand why. For example, one can see above that oneTBB_jll only had 67% of it’s LICENSE file identified as MIT license, and the rest of the file unidentified. If we go to the file, we see it starts with:

The Julia source code within this repository (all files under src/) are
released under the terms of the MIT “Expat” License, the text of which is
included below. This license does not apply to the binary package wrapped by
this Julia package and automatically downloaded by the Julia package manager
upon installing this wrapper package. The binary package’s license is shipped
alongside the binary itself and can be found within the
share/licenses/oneTBB directory within its prefix.

So here the low % indicates there is something else going on, which could lead us to find this caveat and/or missing licenses.

8 Likes

Oh, I’ve just come across another relevant post/package:

3 Likes

I would love to see something like this supported in Pkg or similar. I think its essential for adoption of large open source dependencies in a corporate environment.

My inspiration would be GitHub - EmbarkStudios/cargo-deny: ❌ Cargo plugin for linting your dependencies 🦀 which adds an additional layer of protection when pulling in dependencies, and also warns of unmaintained packages or known security issues. Which is why I thought tight integration with Pkg might make the most sense… But PkgToSoftwareBOM also looks interesting didn’t see it before.

It might be a good starting point to require SPDX identifiers during the registration of packages.

2 Likes

Mmm, Cargo has a license / license-field attribute in package’s Cargo.toml manifests: The Manifest Format - The Cargo Book

Python has PEP 639 to introduce a similar thing to python packages: PEP 639 – Improving License Clarity with Better Package Metadata | peps.python.org

I think something similar wouldn’t be out of place in Julia’s Project.toml files.

There was some discussion of these fields in the PR where I added license checking to RegistryCI and the followup discussion. At the time we did not starting using those fields.

I don’t think they are strictly necessary for implementing something like cargo-deny though; LicenseCheck.jl can already find licenses for every package registered or who have a version registered since 2021 when we added the check (otherwise they wouldn’t have gotten merged). Our 2023 package analyzer juliacon talk shows the %s over time here.

The declarative approach using Project.toml fields is a bit tighter, but requires ecosystem wide opt-in, which probably means it needs to be enforced in AutoMerge. I would imagine that could work by doing some kind of announcement that they will become required later, doing a mass auto-PR to add the fields to packages based on LicenseCheck results, then eventually starting to require it in AutoMerge (maybe first for new packages, then later for new versions).

2 Likes

One one hand I like the idea that we can just determine by analyzing a package what the license situation is, but on the other hand I’m not sure if we can accurately do so with dual and modified licenses? E.g. would we actually be able to accurately detect the SPDX expression e.g. GPL-2.0-or-later WITH Bison-exception-2.2 OR Apache-2.0 or MIT AND (LGPL-2.1-or-later OR BSD-3-Clause)?

If so, that’s great! It just looks rather difficult to me.

Yeah, I don’t think we can accurately analyze that. However would the cargo solution work? Since you could just put license-file pointing to some complicated document; they don’t require the license field, just one or the other.

I think Trivy is a good fit here. I already added Julia support to it for dependency tracking, I can work on license detection as well.

2 Likes

Also I should add, I support revisiting this and getting a license field requirement added to General. I could also work on automating PRs to registered packages to add missing license fields.

1 Like

Would we want cargo’s semantics? It seems they try to do license-file OR license to avoid contradictions between the two, where license-file means “nonstandard” and displays in their metadata UI that way: `license-file` without `license` seems ill-advised · Issue #8537 · rust-lang/cargo · GitHub. I guess the analog to that UI is JuliaHub; I think maybe they use their own license heuristics currently, or they take it from GitHub? Not sure

I think their semantics are fine and we could adopt it wholesale.

1 Like

Mmm, to me having the license / license_file is a definite improvement over the status quo.

That said, if LicenseCheck.jl is accurate enough, I wonder if we can infer the licensing and only require license / license_file in ambiguous cases? A downside that I’ve seen mentioned with an explicit field in my reading is that if for instance a package grabs some code under a compatible but different license they need to remember to update the license / license_file attribute, and in this way automatic detection (when it works) is better.

Bonus: if we have a good quality “examine the licenses in a package” approach, it can be used to detect potential discrepancies where license may not be accurate or needs to be updated.