How fresh is the General package registry?

I was wondering how many orphaned single-release “maybe should never have been registered” packages are there in the General registry. This is something I have heard other communities lament – pollution of their package namespace with one-offs that do not see any support.

I wrote a small script to check the time-since-last-release for all the packages in the registry and this is what I found (JLL packages are not counted):

Does the number of orphans grow over time

The plots show the total number of packages, the amount of brand new packages per month, the amount of new releases (whether an update or a new package) per month, and then it gives a few lines related to “stale” packages: packages with no more than two releases, packages with a single release, and package that have zero releases in the General registry (and were simply copied over from before we used a package registry like this).

Overall, the health seems pretty good, the “staleness” does not seem to be significant.

In the plot you see two dips. They are respectively:

Freshness histogram for each month

Here is a per-month snapshot of the freshness distribution of packages: for each month I plot a histogram of how-long ago a package has seen a release. You might need to open the image in a new tab.

Similarly to above, these histograms seem to show most packages see regular updates.

Today, age vs freshness

Lastly, I was curious whether in the present state of the registry one can see a wave of abandoned packages. I tried to plot something to expose a correlation between how long ago a package was first created vs how long ago it got its latest release. I do not see anything particularly interesting in this plot though.

the scripts to generate the date and the plots themselves

An INCREDIBLY inefficient fish script to gather all the data from the registry (checking out the repo for each historical month and then searching for each package the last Registrator or METADATA sync commits). You can probably write something a million times faster.

  for year in (seq 18 24)
      for month in (seq -f"%02g" 1 12)
          set checkoutdate "20$year-$month-01"
          rm -f ../"$checkoutdate".ages
          set checkoutdatesec (date -d"$checkoutdate" +%s)
          git checkout (git rev-list -n 1 --first-parent --before="$checkoutdate" master)
          for dir in (find . -maxdepth 1 -type d -and \( -not -name ".*" \) | grep -v jll)
              echo $checkoutdate $dir
              for subdir in $dir/*
                  #echo $subdir
                  if test -z "$regdate"
                      set regdatesec ""
                  else
                      set regdatesec (date -d"$regdate" +%s)
                  end
                  if test -z "$syncdate"
                      set syncdatesec ""
                  else
                      set syncdatesec (date -d"$syncdate" +%s)
                  end
                  set regdate (git log -1 --author=Registrator --pretty="format:%ci" "$subdir"/?ersions.toml)
                  set syncdate (git log -1 --grep="automatic sync with METADATA" --pretty="format:%ci" "$subdir"/?ersions.toml)
                  echo "$regdatesec,$syncdatesec,$checkoutdatesec,$subdir" >> ../"$checkoutdate".ages
              end
          end
      end
  end

And the julia script for plotting

using CairoMakie
using PairPlots
using DataFrames
using Glob
using CSV
using Dates
using AlgebraOfGraphics

##
minmissing(::Missing, x) = x
minmissing(x, ::Missing) = x
minmissing(::Missing, ::Missing) = missing
minmissing(x, y) = min(x, y)

maxmissing(::Missing, x) = x
maxmissing(x, ::Missing) = x
maxmissing(::Missing, ::Missing) = missing
maxmissing(x, y) = max(x, y)

dfs = []
eras = []
tmpbigdf = DataFrame()
for file in glob("*.ages")
    df = DataFrame(CSV.File(file, types=[Union{Int,Missing},Union{Int,Missing},Int,String]), [:reg, :sync, :checkout, :pkgpath])
    era = file[1:end-5]
    df[!, :pkg] .= (x->split(x,"/")[end]).(df.pkgpath)
    df[!, :isjll] .= (x->endswith(x,"_jll")[end]).(df.pkg)
    dfjll = df[df.isjll, :]
    df = df[.!df.isjll, :]
    sort!(df, :pkg)
    df[!, :era] .= era
    df[!, :firstappearance] .= false
    df[!, :changed] .= false
    df[!, :singleton] .= false
    df[!, :a_single_update] .= false
    df[!, :frombeforeregistry] .= false
    df[!, :untouchedfrombeforeregistry] .= true
    #df[df.agesec.==maximum(df.agesec),:agesec] .= missing
    if !isempty(dfs)
        lastdf = last(dfs)
        df[!, :firstappearance] .= [p ∉ lastdf.pkg for p in df.pkg]
        known_idx = in.(df.pkg, Ref(lastdf.pkg))
        lastdf_idx = in.(lastdf.pkg, Ref(df[known_idx, :pkg]))
        # does not count first apearances as a change
        df[known_idx, :changed] .= (df[known_idx, :sync] .!== lastdf[lastdf_idx, :sync]) .|| (df[known_idx, :reg] .!== lastdf[lastdf_idx, :reg])
        tmpbigdf = vcat(tmpbigdf,df)
        historical_changes_df = combine(groupby(tmpbigdf, :pkg), :changed=>sum=>:changes, :firstappearance=>any=>:seenappear)
        historical_changes_df = historical_changes_df[in.(historical_changes_df.pkg, Ref(df.pkg)), :]
        sort!(historical_changes_df, :pkg)
        df[!, :singleton] .= historical_changes_df.changes .== 0
        df[!, :a_single_update] .= historical_changes_df.changes .== 1
        df[!, :frombeforeregistry] .= historical_changes_df.seenappear .== false
        df[!, :untouchedfrombeforeregistry] .= df.frombeforeregistry .& df.singleton
    else
        tmpbigdf = df
    end
    push!(eras, era)
    push!(dfs, df)
end
df = vcat(dfs...)
df[!, :change] .= maxmissing.(df.sync, df.reg)
df[!, :agesec] .= df.checkout .- df.change
df[!, :agesec] .-= minimum(skipmissing(df.agesec)) # a bit of a mismatch with how we start counting
df[!, :age] .= df.agesec / 3600 / 24 / 365.25
dfcount = combine(groupby(df, :era), nrow)
maxage = maximum(skipmissing(df.age))
sort!(eras)

##

function skipmissingexceptifall(xs)
    if all(ismissing, xs)
        return [missing]
    else
        return skipmissing(xs)
    end
end

df_current = combine(groupby(df, :pkg),
  :change => minimum∘skipmissingexceptifall => :firstchange,
  :change => maximum∘skipmissingexceptifall => :lastchange,
  :firstappearance => (!)∘any => :frombeforeregistry,
  :changed => sum∘skipmissing => :totalchanges, # TODO skipmissing is needed here because we do not treat the first month correctly
)

##

df_aggregate = combine(groupby(df, :era),
  nrow => :total,
  :firstappearance => sum => :new,
  :changed => sum => :updated,
  :singleton => sum => :singleversions,
  :a_single_update => sum => :twoversions,
  :untouchedfrombeforeregistry => sum => :untouchedfrombeforeregistry
)
df_aggregate[!, :date] = Date.(df_aggregate.era)
df_aggregate[!, :up_to_one_update] = df_aggregate.twoversions .+ df_aggregate.singleversions

##

function age_hist_by_month(df; title="")
    fig = Figure(size=(800,50*length(eras)))
    bins = 30
    offset = 3/bins*1.5
    yticks = (1:length(eras)).*offset
    ax = Axis(fig[1, 1],
        xlabel = "years since latest release in the General registry",
        ylabel = "",
        yticks = (yticks, eras),
        title = title)

    for (i,era) in enumerate(eras)
        df_era = df[df.era .== era, :]
        d = hist!(collect(skipmissing(df_era.age)),
            #colormap=:thermal,
            #color=1:bins,
            color=(:gray,0.5),
            normalization=:probability,
            bins=(0:bins-1) .* (maxage/bins),
            offset=i*offset
        )
        #translate!(d, 0, 0, 1)
    end
    ylims!(ax, 0, offset*(length(eras)+1.2))
    return fig
end

f = age_hist_by_month(df, title="all packages")

##

labels = [
    "total",
    "new this month",
    "updated this month",
    "total from before General without a release since General's birth",
    "total with only a single release since General's birth",
    "total with only one or two releases since General's birth"]
plt = data(df_aggregate) * (
    mapping(:date, [:total,:new,:updated,:untouchedfrombeforeregistry,:singleversions,:up_to_one_update],
    color=dims(1) => renamer(labels) => "Packages") *
    visual(Stairs, step=:pre)
)
fg = draw(plt, axis=(width=800, height=400))

##

df_c = df_current[df_current.totalchanges .> 1, :]
fig = Figure()
ax = Axis(fig[1, 1], xlabel="time of first release in the registry", ylabel="time of last release in the registry",
    aspect = DataAspect(), title="age vs freshness")
scatter!(ax, df_c.firstchange, df_c.lastchange, color=(:black,0.05))
hidedecorations!(ax, label=false)
fig

I do not really have any closing thoughts or morals from this story, but I thought folks would be interested to see the data.

19 Likes

For new registrations, I tend to put “nice” package names under a bit more scrutiny. The shorter and more general the name, the more important it is that the package is expected to receive some long-term maintenance. In an academic setting, this would usually mean that there is a PI (someone with a permanent position) who is committed to having multiple generations of students or postdocs maintaining the package. Or, there are several people actively involved in maintaining the package.

For “personal” projects, like a student who will most likely move on once they finish their thesis projects, if the package name is long and specific, there’s not much of a problem with “polluting the namespace”.

For any package that has many dependents, I believe we should also make a deliberate effort to ensure that such packages are hosted by GitHub organizations with multiple owners, not personal accounts. There are quite a few organizations that have chosen to make some of the Julia core developers owners, even if those developers aren’t actively maintaining any of the packages in the organization. This allows for the community to let new maintainers take over if a particular package becomes unmaintained, and seems like an excellent idea.

9 Likes

I don’t think that we should focus on trying to guess if a package will continue to be maintained, because by and large that is an impossible task. Even for organizations, priorities shift, a researcher in a permanent position might be interested in something else, etc.

Instead, I think we should come up with a collection of customs (not rules, just conventions) that allow an individual, or ideally a group of motivated people to take over a package. Package authors should recognize that losing interest in a maintaining a package is a normal thing and allow for graceful retirement of takeover.

Eg most of the time it should be acknowledged that this might lead to a major refactoring and breaking the API, because that is how software works. I think that the major problem with “community maintained” packages is that only minor fixes happen, if at all. These become harder and harder and community maintainers lose motivation.

6 Likes

This is a question rather than a suggestion: the name of a Julia package is just a shorthand for a UUID, correct? Such that in principle, if a name were to be reclaimed by the General registry, any downstream software which was in fact using the original package would continue to get it, even if a different package of that name were registered in its place?

Again, not suggesting that General start doing this now, or ever necessarily. As languages grow their communities do start to run into this problem, so we can hope Julia will reach the point where users start wanting that, it’s a symptom of success.

1 Like

I don’t think that we should focus on trying to guess if a package will continue to be maintained

I wouldn’t guess, I would just ask. If someone were to try registering, say, Thermodynamics.jl, I’d point out to them that a package name that general would carry some expectations to be authoritative in scope and to have a high likelihood of being maintained in the longer term. If it turns out the author is an undergraduate who got carried away with solving their “Intro to Thermodynamics” homework I’d probably ask them to reconsider the name.

Of course, there are never any guarantees, but trying to ensure that a high-profile / “ambitious” package is more likely than not to be maintained for the foreseeable future seems wise.

Instead, I think we should come up with a collection of customs that allow an individual, or ideally a group of motivated people to take over a package.

I agree with that 100%. I feel like we’re generally on a pretty good track with this sort of thing. I’ve witnessed several instances of packages transitioning to new maintainers, both in an orderly handover and in situations where the original maintainer disappeared, and the package had to be forked to an organization and re-registered. (I’m actually maintaining one such package myself, DocumenterCitations).

And yes, new maintainer = new major version should not raise any eyebrows whatsoever.

4 Likes

A registry can support multiple packages of the same name. The only real issue is when you do a pkg> add Package, if there are multiple packages with that name you need to also specify the UUID to disambiguate. As a policy, General currently only allows one package per name.

4 Likes

Great, so if at some future point it makes sense to recycle package names which are moribund, it won’t change the functionality of any downstream code which relies on it, and Pkg could inform users of the new name of the package and suggest that they switch to that. There are benefits to getting the basic architecture right to begin with!

2 Likes

Yes, and at that point these old packages could get something like a deprecated flag whereby add Package would default to the non-deprecated package (while still allowing add Package=old_uuid to install the deprecated one.

4 Likes