I mentioned PackageAnalyzer on Slack, but this might be a good place to demo what it can do. I think PackageAnalyzer is not a complete solution here but it does have some useful pieces that I hope can be a useful part of broader solutions.
Here let us analyze a manifest with DSP.jl and Arrow.jl.
using PackageAnalyzer, DataFrames
] activate --temp
] add DSP, Arrow
pkgs = DataFrame(analyze_manifest())
gives
julia> pkgs = DataFrame(analyze_manifest())
50×22 DataFrame
Row │ name uuid repo subdir reachable docs ⋯
│ String Base.UUID String String Bool Bool ⋯
─────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ TableTraits 3783bdb8-4a98-5b6b-af9a-565f29a5fe9c https://github.com/queryverse/Ta… true true ⋯
2 │ Arrow 69666777-d1a9-59fb-9406-91d4454c9d45 https://github.com/apache/arrow-… true true
3 │ ConstructionBase 187b0558-2788-49d3-abe0-74a17ed4e7c9 https://github.com/JuliaObjects/… true true
4 │ DSP 717857b8-e6f2-59f4-9121-6e50c889abd2 https://github.com/JuliaDSP/DSP.… true true
5 │ EnumX 4e289a0a-7415-4d19-859d-a7e5c4648b56 https://github.com/fredrikekre/E… true false ⋯
6 │ IrrationalConstants 92d709cd-6900-40b7-9082-c6be49f344b6 https://github.com/JuliaMath/Irr… true false
7 │ OrderedCollections bac558e1-5e72-5ebc-8fee-abe8a469f55d https://github.com/JuliaCollecti… true true
8 │ IteratorInterfaceExtensions 82899510-4779-5014-852e-03e436cf321d https://github.com/queryverse/It… true true
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
43 │ DataValueInterfaces e2d170a0-9d28-54be-80f0-106bbe20a464 https://github.com/queryverse/Da… true false ⋯
44 │ oneTBB_jll 1317d2d5-d96f-522e-a858-c73665f53c3e https://github.com/JuliaBinaryWr… true false
45 │ CodecZstd 6b39b394-51ab-5f42-8807-6242bab2b4c2 https://github.com/JuliaIO/Codec… true false
46 │ Tables bd369af6-aec1-5ad0-b16a-f7cc5008161c https://github.com/JuliaData/Tab… true true
47 │ Scratch 6c6a2e73-6563-6170-7368-637461726353 https://github.com/JuliaPackagin… true true ⋯
48 │ MacroTools 1914dd2f-81c6-5fcd-8719-6d5c9610ff09 https://github.com/FluxML/MacroT… true true
49 │ Setfield efcf1570-3423-57d1-acb7-fd33fddbac46 https://github.com/jw3126/Setfie… true true
50 │ FFTW 7a1cc6ca-52ef-59f5-83cd-3a7055c09341 https://github.com/JuliaMath/FFT… true true
16 columns and 34 rows omitted
Note there is a lot more here than licenses (including a custom line-of-code counting implementation that correctly handles Julia docstrings), but let us focus on those.
flat = flatten(pkgs, :license_files)
select(flat, :name, :license_files => AsTable)
gives
julia> select(flat, :name, :license_files => AsTable)
57×4 DataFrame
Row │ name license_filename licenses_found license_file_percent_covered
│ String String Vector{String} Float64
─────┼─────────────────────────────────────────────────────────────────────────────────────────
1 │ TableTraits LICENSE.md ["MIT"] 93.8547
2 │ Arrow LICENSE ["Apache-2.0"] 78.3333
3 │ Arrow codecov.yaml ["Apache-2.0"] 71.2121
4 │ Arrow .gitignore ["Apache-2.0"] 69.6296
5 │ Arrow .JuliaFormatter.toml ["Apache-2.0"] 65.7343
6 │ Arrow .asf.yaml ["Apache-2.0"] 44.5498
7 │ Arrow Project.toml ["Apache-2.0"] 33.0986
8 │ Arrow README.md ["Apache-2.0"] 25.6831
⋮ │ ⋮ ⋮ ⋮ ⋮
50 │ DataValueInterfaces LICENSE.md ["MIT"] 93.9227
51 │ oneTBB_jll LICENSE ["MIT"] 67.2065
52 │ CodecZstd LICENSE.md ["MIT"] 93.8202
53 │ Tables LICENSE ["MIT"] 98.8166
54 │ Scratch LICENSE ["MIT"] 98.8439
55 │ MacroTools LICENSE.md ["MIT"] 93.9227
56 │ Setfield LICENSE.md ["MIT"] 93.8202
57 │ FFTW LICENSE ["MIT"] 97.7401
41 rows omitted
This uses the output of licensecheck. The way to interpret this is something like:
TableTraits
has a file LICENSE.md
which was detected as containing one or more licenses, in this case, just the MIT license. In that file, 93.8% of the lines could be identified as belonging to a SPDX license. If we look at the actual file, the remaining 7% is likely just the first line “The TableTraits.jl package is licensed under the MIT “Expat” License”.
Note that we get a bunch of rows for Arrow, since it contains the apache 2.0 URL in the header of every file, and that header is enough for licensecheck to count it as the apache license (see also Apache-2.0 URL alone is enough to match the license; is this intentional? · Issue #40 · google/licensecheck · GitHub).
If the package had more than one file with a license, it should show up here, and if there was more than 1 license in the file, that should also show up here.
Note one can also identify if only a small proportion of the file is identified as a license. If that is the case, one may need to manually investigate to understand why. For example, one can see above that oneTBB_jll only had 67% of it’s LICENSE file identified as MIT license, and the rest of the file unidentified. If we go to the file, we see it starts with:
The Julia source code within this repository (all files under src/
) are
released under the terms of the MIT “Expat” License, the text of which is
included below. This license does not apply to the binary package wrapped by
this Julia package and automatically downloaded by the Julia package manager
upon installing this wrapper package. The binary package’s license is shipped
alongside the binary itself and can be found within the
share/licenses/oneTBB
directory within its prefix.
So here the low % indicates there is something else going on, which could lead us to find this caveat and/or missing licenses.