I was wondering how many orphaned single-release “maybe should never have been registered” packages are there in the General registry. This is something I have heard other communities lament – pollution of their package namespace with one-offs that do not see any support.
I wrote a small script to check the time-since-last-release for all the packages in the registry and this is what I found (JLL packages are not counted):
Does the number of orphans grow over time
The plots show the total number of packages, the amount of brand new packages per month, the amount of new releases (whether an update or a new package) per month, and then it gives a few lines related to “stale” packages: packages with no more than two releases, packages with a single release, and package that have zero releases in the General registry (and were simply copied over from before we used a package registry like this).
Overall, the health seems pretty good, the “staleness” does not seem to be significant.
In the plot you see two dips. They are respectively:
- remove packages and versions of packages that do not support Julia 1.0 by KristofferC · Pull Request #4169 · JuliaRegistries/General · GitHub
- Move JLLs to own folder by oxinabox · Pull Request #79064 · JuliaRegistries/General · GitHub (I do not know why there is a dip for this one… I supposedly am not counting JLLs)
Freshness histogram for each month
Here is a per-month snapshot of the freshness distribution of packages: for each month I plot a histogram of how-long ago a package has seen a release. You might need to open the image in a new tab.
Similarly to above, these histograms seem to show most packages see regular updates.
Today, age vs freshness
Lastly, I was curious whether in the present state of the registry one can see a wave of abandoned packages. I tried to plot something to expose a correlation between how long ago a package was first created vs how long ago it got its latest release. I do not see anything particularly interesting in this plot though.
the scripts to generate the date and the plots themselves
An INCREDIBLY inefficient fish script to gather all the data from the registry (checking out the repo for each historical month and then searching for each package the last Registrator or METADATA sync commits). You can probably write something a million times faster.
for year in (seq 18 24)
for month in (seq -f"%02g" 1 12)
set checkoutdate "20$year-$month-01"
rm -f ../"$checkoutdate".ages
set checkoutdatesec (date -d"$checkoutdate" +%s)
git checkout (git rev-list -n 1 --first-parent --before="$checkoutdate" master)
for dir in (find . -maxdepth 1 -type d -and \( -not -name ".*" \) | grep -v jll)
echo $checkoutdate $dir
for subdir in $dir/*
#echo $subdir
if test -z "$regdate"
set regdatesec ""
else
set regdatesec (date -d"$regdate" +%s)
end
if test -z "$syncdate"
set syncdatesec ""
else
set syncdatesec (date -d"$syncdate" +%s)
end
set regdate (git log -1 --author=Registrator --pretty="format:%ci" "$subdir"/?ersions.toml)
set syncdate (git log -1 --grep="automatic sync with METADATA" --pretty="format:%ci" "$subdir"/?ersions.toml)
echo "$regdatesec,$syncdatesec,$checkoutdatesec,$subdir" >> ../"$checkoutdate".ages
end
end
end
end
And the julia script for plotting
using CairoMakie
using PairPlots
using DataFrames
using Glob
using CSV
using Dates
using AlgebraOfGraphics
##
minmissing(::Missing, x) = x
minmissing(x, ::Missing) = x
minmissing(::Missing, ::Missing) = missing
minmissing(x, y) = min(x, y)
maxmissing(::Missing, x) = x
maxmissing(x, ::Missing) = x
maxmissing(::Missing, ::Missing) = missing
maxmissing(x, y) = max(x, y)
dfs = []
eras = []
tmpbigdf = DataFrame()
for file in glob("*.ages")
df = DataFrame(CSV.File(file, types=[Union{Int,Missing},Union{Int,Missing},Int,String]), [:reg, :sync, :checkout, :pkgpath])
era = file[1:end-5]
df[!, :pkg] .= (x->split(x,"/")[end]).(df.pkgpath)
df[!, :isjll] .= (x->endswith(x,"_jll")[end]).(df.pkg)
dfjll = df[df.isjll, :]
df = df[.!df.isjll, :]
sort!(df, :pkg)
df[!, :era] .= era
df[!, :firstappearance] .= false
df[!, :changed] .= false
df[!, :singleton] .= false
df[!, :a_single_update] .= false
df[!, :frombeforeregistry] .= false
df[!, :untouchedfrombeforeregistry] .= true
#df[df.agesec.==maximum(df.agesec),:agesec] .= missing
if !isempty(dfs)
lastdf = last(dfs)
df[!, :firstappearance] .= [p ∉ lastdf.pkg for p in df.pkg]
known_idx = in.(df.pkg, Ref(lastdf.pkg))
lastdf_idx = in.(lastdf.pkg, Ref(df[known_idx, :pkg]))
# does not count first apearances as a change
df[known_idx, :changed] .= (df[known_idx, :sync] .!== lastdf[lastdf_idx, :sync]) .|| (df[known_idx, :reg] .!== lastdf[lastdf_idx, :reg])
tmpbigdf = vcat(tmpbigdf,df)
historical_changes_df = combine(groupby(tmpbigdf, :pkg), :changed=>sum=>:changes, :firstappearance=>any=>:seenappear)
historical_changes_df = historical_changes_df[in.(historical_changes_df.pkg, Ref(df.pkg)), :]
sort!(historical_changes_df, :pkg)
df[!, :singleton] .= historical_changes_df.changes .== 0
df[!, :a_single_update] .= historical_changes_df.changes .== 1
df[!, :frombeforeregistry] .= historical_changes_df.seenappear .== false
df[!, :untouchedfrombeforeregistry] .= df.frombeforeregistry .& df.singleton
else
tmpbigdf = df
end
push!(eras, era)
push!(dfs, df)
end
df = vcat(dfs...)
df[!, :change] .= maxmissing.(df.sync, df.reg)
df[!, :agesec] .= df.checkout .- df.change
df[!, :agesec] .-= minimum(skipmissing(df.agesec)) # a bit of a mismatch with how we start counting
df[!, :age] .= df.agesec / 3600 / 24 / 365.25
dfcount = combine(groupby(df, :era), nrow)
maxage = maximum(skipmissing(df.age))
sort!(eras)
##
function skipmissingexceptifall(xs)
if all(ismissing, xs)
return [missing]
else
return skipmissing(xs)
end
end
df_current = combine(groupby(df, :pkg),
:change => minimum∘skipmissingexceptifall => :firstchange,
:change => maximum∘skipmissingexceptifall => :lastchange,
:firstappearance => (!)∘any => :frombeforeregistry,
:changed => sum∘skipmissing => :totalchanges, # TODO skipmissing is needed here because we do not treat the first month correctly
)
##
df_aggregate = combine(groupby(df, :era),
nrow => :total,
:firstappearance => sum => :new,
:changed => sum => :updated,
:singleton => sum => :singleversions,
:a_single_update => sum => :twoversions,
:untouchedfrombeforeregistry => sum => :untouchedfrombeforeregistry
)
df_aggregate[!, :date] = Date.(df_aggregate.era)
df_aggregate[!, :up_to_one_update] = df_aggregate.twoversions .+ df_aggregate.singleversions
##
function age_hist_by_month(df; title="")
fig = Figure(size=(800,50*length(eras)))
bins = 30
offset = 3/bins*1.5
yticks = (1:length(eras)).*offset
ax = Axis(fig[1, 1],
xlabel = "years since latest release in the General registry",
ylabel = "",
yticks = (yticks, eras),
title = title)
for (i,era) in enumerate(eras)
df_era = df[df.era .== era, :]
d = hist!(collect(skipmissing(df_era.age)),
#colormap=:thermal,
#color=1:bins,
color=(:gray,0.5),
normalization=:probability,
bins=(0:bins-1) .* (maxage/bins),
offset=i*offset
)
#translate!(d, 0, 0, 1)
end
ylims!(ax, 0, offset*(length(eras)+1.2))
return fig
end
f = age_hist_by_month(df, title="all packages")
##
labels = [
"total",
"new this month",
"updated this month",
"total from before General without a release since General's birth",
"total with only a single release since General's birth",
"total with only one or two releases since General's birth"]
plt = data(df_aggregate) * (
mapping(:date, [:total,:new,:updated,:untouchedfrombeforeregistry,:singleversions,:up_to_one_update],
color=dims(1) => renamer(labels) => "Packages") *
visual(Stairs, step=:pre)
)
fg = draw(plt, axis=(width=800, height=400))
##
df_c = df_current[df_current.totalchanges .> 1, :]
fig = Figure()
ax = Axis(fig[1, 1], xlabel="time of first release in the registry", ylabel="time of last release in the registry",
aspect = DataAspect(), title="age vs freshness")
scatter!(ax, df_c.firstchange, df_c.lastchange, color=(:black,0.05))
hidedecorations!(ax, label=false)
fig
I do not really have any closing thoughts or morals from this story, but I thought folks would be interested to see the data.