[ANN] PackageAnalyzer v1.0

PackageAnalyzer is a tool @giordano created while writing this awesome blog post. After that, I helped expand some of the functionality, and together we used the tool to survey the General registry, resulting in this blog post and 2021 JuliaCon talk. Recently, we updated the package with a bit more functionality and cut a v1.0 release.

What can PackageAnalyzer do?

PackageAnalyzer downloads the code associated to a package, and runs some very basic static analysis, looking for the presence of CI scripts & documentation, counting lines of source code and tests, and checking licenses. It also optionally can gather contributor data from the GitHub API. It is multithreaded, robust and somewhat battle-hardened, as Mosรจ ran PackageAnalyzer v0.1 daily on the whole General registry for a long time to collect statistics over time. Note that a lot of internals have changed in v1.0, so it is possible it has regressed on its โ€œbattle-hardenedโ€ status, although we have tried to keep in mind the lessons learned from earlier versions :slight_smile:.

The API is very simple; one calls analyze("DataFrames") for example to analyze the package DataFrames:

julia> analyze("DataFrames")
Package DataFrames:
  * repo: https://github.com/JuliaData/DataFrames.jl.git
  * uuid: a93c6f00-e57d-5684-b7b6-d8193f3e46c0
  * version: 1.4.3
  * is reachable: true
  * tree hash: 0f44494fe4271cc966ac4fea524111bef63ba86c
  * Julia code in `src`: 18778 lines
  * Julia code in `test`: 28766 lines (60.5% of `test` + `src`)
  * documentation in `docs`: 6761 lines (26.5% of `docs` + `src`)
  * documentation in README: 21 lines
  * has license(s) in file: MIT
    * filename: LICENSE.md
    * OSI approved: true
  * has `docs/make.jl`: true
  * has `test/runtests.jl`: true
  * has continuous integration: true
    * GitHub Actions

PackageAnalyzer uses RegistryInstances.jl, which is based on code taken from Pkg.jl, in order to query all installed registries for the package name, and thus supports multiple registries. The input to analyze can also be a local path or a URL.

One can also analyze an entire manifest with analyze_manifest(path) (where path defaults to the manifest of the current active project). For example, analyzing a temporary environment in which Iโ€™ve added PackageAnalyzer

pkg> activate --temp

pkg> add PackageAnalyzer

julia> using PackageAnalyzer

julia> @time results = analyze_manifest();
  0.117077 seconds (317.67 k allocations: 43.424 MiB)

julia> summary(results)
"33-element Vector{PackageAnalyzer.Package}"

PackageAnalyzer will respect the versions of each dependency in the Manifest, meaning it will take care to analyze the associated code (and not, say, the latest development code). It also properly handles code on branches (from e.g. Pkg.add(; rev=...)) and devโ€™d dependencies. It will download code if required, but if the code already exists in your .julia folder, it will find and use that (and verify the git tree hash to ensure the contents are as expected according to the hash in the manifest or registry). This makes analyzing manifests which have been instantiateโ€™d very quick.

One can easily post-process the results, since a Vector{PackageAnalyzer.Package} is a Tables.jl-compatible row table. Continuing the example above,

pkg> add DataFrames

julia> using DataFrames

julia> df = DataFrame(results)
33ร—22 DataFrame
 Row โ”‚ name                     uuid                               repo                               subdir           reachable  docs   runtests  github_actions  travis  appve โ‹ฏ
     โ”‚ String                   Base.UUID                          String                             String           Bool       Bool   Bool      Bool            Bool    Bool  โ‹ฏ
   1 โ”‚ libsodium_jll            a9144af2-ca23-56d9-984f-0d03f7b5โ€ฆ  https://github.com/JuliaBinaryWrโ€ฆ                        true  false     false           false   false     fa โ‹ฏ
   2 โ”‚ HTTP                     cd3eb016-35fb-5094-929b-558a96faโ€ฆ  https://github.com/JuliaWeb/HTTPโ€ฆ                        true   true      true            true   false     fa
   3 โ”‚ licensecheck_jll         4ecb348a-8b88-51ea-b912-4c460483โ€ฆ  https://github.com/JuliaBinaryWrโ€ฆ                        true  false     false           false   false     fa
   4 โ”‚ PackageAnalyzer          e713c705-17e4-4cec-abe0-95bf5bf3โ€ฆ  https://github.com/JuliaEcosysteโ€ฆ                        true   true      true            true   false     fa
  โ‹ฎ  โ”‚            โ‹ฎ                             โ‹ฎ                                  โ‹ฎ                         โ‹ฎ             โ‹ฎ        โ‹ฎ       โ‹ฎ            โ‹ฎ           โ‹ฎ        โ‹ฎ  โ‹ฑ
  31 โ”‚ RegistryInstances        2792f1a3-b283-48e8-9a74-f99dce51โ€ฆ  https://github.com/GunnarFarnebaโ€ฆ                        true  false      true            true   false     fa โ‹ฏ
  32 โ”‚ LazilyInitializedFields  0e77f7df-68c5-4e49-93ce-4cd80f55โ€ฆ  https://github.com/KristofferC/Lโ€ฆ                        true  false      true            true   false     fa
  33 โ”‚ LicenseCheck             726dbf0d-6eb6-41af-b36c-cd770e0fโ€ฆ  https://github.com/ericphanson/Lโ€ฆ                        true  false      true            true   false     fa
                                                                                                                                                    13 columns and 26 rows omitted

julia> code = select!(flatten(df, :lines_of_code), :name, :version, :lines_of_code => identity => AsTable);

julia> sort!(code, :code)
235ร—9 DataFrame
 Row โ”‚ name                     version    directory     language  sublanguage  files  code   comments  blanks
     โ”‚ String                   VersionNโ€ฆ  String        Symbol    Unionโ€ฆ       Int64  Int64  Int64     Int64
   1 โ”‚ libsodium_jll            1.0.20+0   README.md     Markdown                   1      0        27      11
   2 โ”‚ HTTP                     1.5.5      docs          Markdown                   5      0       465     191
   3 โ”‚ HTTP                     1.5.5      README.md     Markdown                   1      0        51      29
   4 โ”‚ HTTP                     1.5.5      CHANGELOG.md  Markdown                   1      0       218      24
   5 โ”‚ HTTP                     1.5.5      LICENSE.md    Markdown                   1      0        22       2
   6 โ”‚ licensecheck_jll         0.3.101+0  README.md     Markdown                   1      0        33      16
   7 โ”‚ PackageAnalyzer          1.0.0      docs          Markdown                   3      0        90      40
   8 โ”‚ PackageAnalyzer          1.0.0      README.md     Markdown                   1      0        22      13
  โ‹ฎ  โ”‚            โ‹ฎ                 โ‹ฎ           โ‹ฎ           โ‹ฎ           โ‹ฎ         โ‹ฎ      โ‹ฎ       โ‹ฎ        โ‹ฎ
 228 โ”‚ MbedTLS                  1.1.7      src           Julia                     13   2289        48     237
 229 โ”‚ JSON3                    1.12.0     src           Julia                     10   2512        68     199
 230 โ”‚ OpenSSL                  1.3.2      src           Julia                      2   2918       219     521
 231 โ”‚ Parsers                  2.5.1      src           Julia                      9   3252       136     154
 232 โ”‚ HTTP                     1.5.5      test          Julia                     26   4537       185     488
 233 โ”‚ URIs                     1.4.1      test          JSON                       1   4771         0       0
 234 โ”‚ LazilyInitializedFields  1.2.0      page          CSS                       94   5944       810    1167
 235 โ”‚ HTTP                     1.5.5      src           Julia                     36   6712       459     725
                                                                                               219 rows omitted

There are plenty more features and analyses that could be added to the package, so check out the issue tracker if you would like to get involved!

We hope others find it a useful way to get a quantitative understanding of their dependencies, as well as of the OSS ecosystem as a whole.


What does

julia> analyze("DataFrames")                                                                                                                                                                
ERROR: ArgumentError: collection must be non-empty   


Well, that shouldnโ€™t happen. Do you have PackageAnalyze v1.0 loaded? What is the full stacktrace?

Curious: just adding PackageAnalyzer only installs 0.1.0. I have to do an update to get 1.0.0.
Now it works!