Pkg Ecosystem Statistics

We are pleased to announce that the Pkg Server ecosystem has matured to the point that we’re ready to start analyzing the Pkg Server logs and providing processed log output to the community. The Pkg Server community maintainers are unable to foresee all possible desirable log analyses however, so in the spirit of open source, we are instead building a log analysis framework that will allow community developers to propose and implement log analysis passes, which will be exported to a database that any community member can then build interesting infrastructure on top of.

Because the raw logs themselves contain sensitive information (such as IP addresses) we do not publish the full logs (or store them for longer than our data retention policy allows, as of the time of this writing, set to 30 days), we only publish and store aggregations of the data. An anonymized test dataset will be provided within the analysis git repository for developers to experiment with. All pull requests adding new analyses will be inspected for possible privacy concerns before merging, and once merged, the analysis will be run as part of the regularly scheduled log analysis pipeline.

The logs being analyzed are nginx access logs with a custom log format; the log analysis package is essentially an enormous regular expression that parses the nginx access logs out to compressed .csvz files, then loads those and allows for aggregations to be run upon them. As an example, the first analysis pass written calculates package downloads per package per day, then plots them and saves them out to a .csv. Here’s the sample output for the last two weeks (Note the meteoric rise of JLLWrappers_jll; Julia’s hottest new package! :wink: ):

The full data is also available as a CSV.

There are still open questions around the workflow of the log analysis, how it should be presented to developers, what kinds of useful things we can build with this information, etc… If you’re interested in building this kind of infrastructure, head on over to the git repository and take a look at the open issues to see where we’re headed! Help us build a solid foundation so that future developers can build beautiful graphs to show those upward curves. :slight_smile:

38 Likes

I love this! Thanks so much. I just saw that the total downloads don’t add up. Is this the case because you only plotted the last 14 days but there was data before this?

I’m also wondering which downloads are actually getting caught. I assume you filter travis and github actions?

Yes, we are filtering out CI in these logs. There’s probably an ordering issue in the statistics analysis, feel free to check out the source, it’s quite straightforward. :slight_smile:

3 Likes