Announcing: package download stats!

After some years of getting the package server architecture in place and working (mostly) reliably, @staticfloat and I finally had some time to work on collecting logs into a data warehouse (we’re using Snowflake) and designing a set of queries over the logs that we can run and publish regularly. The current public aggregated stats are available with the prefix

https://julialang-logs.s3.amazonaws.com/public_outputs/current/

followed by a rollup name and the suffix .csv.gz indicating that all the files are gzip-compressed CSV files. The rollups that we currently publish are variations on the following basic views, listed here with the rollup name and the log fields that each one is indexed by:

  • client_types: client_type
  • julia_systems: julia_system, client_type
  • julia_versions: julia_version_prefix, client_type
  • resource_types: resource_type, status, client_type
  • package_requests: package_uuid, status, client_type

In addition to these, there are variants of these rollups that also have region and/or date as keys, which is indicated by appending _by_region and/or _by_date to the rollup name. The full list of rollups is:

There is also a log_recency CSV file published here:

This CSV file has two columns: time_utc and pkgserver with a row for each package server where time_utc is the latest log entry available for that server. This is helpful to check if some package servers are delayed in uploading their logs. There is also a row in this file where the pkgserver is empty (null/missing) and the time_utc field is the time at which the query was run, which should be just before all the CSV files were uploaded.

These CSV files are generated each day at 4am UTC over the past year’s worth of logs. Logs are uploaded from servers every 2 hours, so aggregating at 4am and excluding the current day means that we should typically have complete logs from the last day that’s included in the stats, but we’ll never have stats including the current day. For the first four hours of each UTC-day, the last day of logs will be from the day before yesterday.

Rollup keys

The fields by which we aggregate various rollups are the following:

  • client_type: one of user, ci or empty (null/missing) indicating whether the request appears to be from a normal user, a CI run or some other kind of client like a normal web browser, web crawler or some server that’s mirroring the package servers.

  • julia_system: a hypen-separated “tuple” indicating the client system’s characteristics that Pkg uses to determine what system-specific pre-compiled binaries to install on that system. For example, the most common system tuple for regular users (client_type == "user") over the past month is x86_64-linux-gnu-libgfortran4-cxx11-libstdcxx26-julia_version+1.6.1, which indicates:

    • x86_64 hardware
    • linux kernel
    • GNU libc
    • version 4 of the gfortran ABI
    • version 11 of the C++ ABI
    • version 26 of the C++ standard library ABI
    • version 1.6.1 of the Julia “ABI”

    Amazingly enough, these are all things that need to be taken into consideration when deciding what binaries will work on a system and BinaryBuilder pre-builds all the possible combinations of ABIs for all the versions of platforms that we support. The package logs contain 198 distinct platform tuples, so many combinations really do occur in the wild.

  • julia_version_prefix: the Julia version of the client with any build number stripped away. For releases this just the version number, but for people who build Julia from source, version numbers may look like 1.8.0-DEV.485 (my current version number), which would be truncated to 1.8.0-DEV in rollups.

  • resource_type: this classifies the types of things clients can request from package servers into:

    • package – a package tarball
    • artifact – an artifact tarball
    • registry – a specific registry tarball
    • registries –the literal resource /registries (to get a list of current registry versions)
    • meta – any resource path that starts with /meta
    • empty (null/missing) for anything else
  • status: the HTTP status code of the request, where 2xx responses indicate success and other status codes indicate various kinds of failure.

  • package_uuid: for package requests this is the UUID that identifies the package being requested. When this is one of the key columns, logs are limited to those with request_type == "package".

  • region: when clients make requests to pkg.julialang.org they are redirected to a specific package server that should be close to them geographically, which helps make sure that package downloads are fast no matter where you are in the world. This field identifies region of the package server that served their request. The regions where we currently have package servers are:

    • au – Australia
    • cn-east – China East
    • cn-northeast – China North East
    • cn-southeast – China South East
    • eu-central – Europe Central
    • in – India
    • sa – South America
    • sg – Singapore
    • us-east – US East
    • us-west – US West

    We’ll be adding more servers as AWS adds Lightsail support in more regions. For example, we’d like to add a server in Africa.

  • date: the date on which a request was made in the UTC timezone.

Rollup values

Each rollup table is keyed by some subset of the above fields, but what we’re interested in is various aggregated data about the set of request logs that have that set of key field values. These are the aggregates we currently compute for each slice of logs:

  • request_addrs: the approximate number of unique requesting IP addresses. Details below on why this is approximate and not exact.

  • request_count: the number of requests.

  • successes: the number of requests which resulted in a 2xx HTTP response code. Only included if status is not one of the key fields of the rollup. To get the success rate, divide by request_count.

  • cache_misses: the number of requests which resulted in the package server attempting to fetch a resource from an upstream storage server. To get the cache miss rate, divide by request_count.

  • body_bytes_sent: total number of bytes served in the bodies of HTTP responses for all requests (i.e. not including HTTP headers or TLS/IP data). To get average request body size, divide by request_count.

  • request_time: total time spent serving these requests. To get average request time, divide by request_count.

  • date_count: the number of distinct UTC dates when requests occurred. Only included if date is not one of the key fields of the rollup.

  • date_min: the earliest date of any request in this group. Only included if date is not one of the key fields of the rollup.

  • date_max: the latest date of any request in this group. Only included if date is not one of the key fields of the rollup.

Anonymity and unique IP address count approximation

The reported numbers of unique IP addresses for package server requests are approximate because we don’t store IP addresses in our data warehouse. Instead, we store a HyperLogLog hash value of each IP address that allows us to accurately approximate the number of unique IP address to within ±1.6% error on average. This technique allows us to accurately estimate the number of unique IP addresses in any set of request logs while preserving quite strong anonymity properties with respect to individual IP addresses:

  • No IP address can be uniquely identified by its hash value.

  • There are only 102,400 distinct hash values, each less than 17 bits.

  • Since there are more than 300k distinct IP addresses that have made requests in the past month, there are already about three IP addresses among the ones we’ve actually seen (which we don’t record) per distinct hash value on average. This ratio will only increase over time since the set of hash values remains the same but we will more IP addresses.

  • Most IP addresses have hash values with far more collisions than this: a typical IP address will share a hash value with dozens of IP addresses that we’ve seen and over half a million IPv4 addresses from the full 32-bit space, and over 4×10^34 addresses from the IPv6 address space.

This HyperLogLog hash technique allows us estimate the number of unique IP addresses in any subset of logs while preserving the anonymity of end-users even in our internal data systems. As an extra measure of privacy protection, we do not publish even HyperLogLog hash values, but only the aggregate counts derived from them.

Examples

Here’s the current contents of the client_type_by_region rollup as a gist:

https://gist.github.com/StefanKarpinski/217fdf89f621704683dc2df4dc5b1ab8

Click through for a nice tabular view, courtesy of GitHub. To download and uncompress the latest version of this rollup table yourself, you can run this UNIX command:

curl -s https://julialang-logs.s3.amazonaws.com/public_outputs/current/client_types_by_region.csv.gz | gzcat

But any way you care to download it and uncompress it will work, and after that it’s just a (nicely formatted) CSV file. The key columns of this rollup are client_type and region. The value columns are all of the possible values columns since neither status nor date are in set of key columns. Some random observations:

  • The US eastern region is the busiest by both number of users (well, IP addresses) and requests for both CI traffic and real user traffic.
  • The US western region is the next busiest for CI requests whereas Europe is the next busiest in terms of real user requests.
  • India had CI requests for only two days in the last month including a total of only 72 requests from (approximately) 13 IP addresses.

One data set that will no doubt be of particular interest to people is the package_requests rollup, which is keyed by package_uuid, status and client_type and includes probably most interestingly, the number of unique request addresses requests with client_type == "user", which is a decent proxy for “how many people use this package?” In the past month, the three most popular packages by this metric were:

*users = “Number of unique IP addresses which requested said package without any indicators of being a CI process.”

Conclusion

There’s certainly many more interesting things to be gleaned from all this data and we hope that some people will take a look at these data sets and do some interesting analysis, stand up some cool visualization apps, etc. There are also almost certainly more ways to slice and dice our logs, so if anyone has any suggestions for new data sets they’d like to see, please let us know!

121 Likes

This is great! Thank you to all those involved for their work on this.

5 Likes

This was just myself and @staticfloat which somewhat explains why it took so long :grimacing: :grin:. Very glad to be done with it though and we’re now in a position to add more ways to slice the data pretty easily. In the future, we’ll also add historical snapshot rollups for each year in separate files in a /historical/$year/ directory with the same general schema.

18 Likes

This is awesome!! Incredible work @StefanKarpinski and @staticfloat.

Quick question: how does one take these stats and create a GitHub badge for a packages README?

16 Likes

Will these stats (resp. a short per-package summary) be visible on JuliaHub, eventually?

8 Likes

Love this! Would it be possible to turn the sql scripts / data warehouse definitions into a public repository so that the community can see and collaborate on the definition of these (and future) metrics?

While there is lots to be done with these aggregates, analysis fizzles out over time as high level data gets ‘wrung dry’, there’s probably even more that can be done via different raw → roll up transformations and I’d expect the hive mind might have opinions. :wink:

While giving (anyone) access to the raw data is clearly a data privacy risk, the scripts themselves should be harmless? This might then be the seed for generating synthetic data via Soss/Turing generative models which could be incorporated into CI/CD and let more people prototype analysis against non-actual data that can then map onto the real data.

5 Likes

Yeah, would be great if someone made that happen! Now that the devops bits and the parts that require access to data that we don’t want to publish in unaggregated form are done hopefully others will carry on with stuff like that.

2 Likes

That would be great, I’m sure that will happen soon.

2 Likes

Yes, that’s something I want to do. Currently the code is all in snowflake worksheets. I’d prefer to use something that’s in a git repo, but haven’t figured out a good workflow with snowflake and git yet. But yes, the SQL queries should be public and people can propose new ones that we can add. Having synthetic generated data sets to play with is a very good idea as well!

4 Likes

This data should be added to the badges that Juliahub currently exposes, soon (ish)

13 Likes

I guess we’d want to filter out downloads by GitHub Actions and other CI systems, if possible (based on IP)?

1 Like

I believe this has been done to some extent?

Yes, the Julia client sends indicators of whether environment variables that are set by CI services are present or not and we check for those to decide if something is a CI request.

1 Like

Not sure if this is relevant as I haven’t used snowflake previously, but for Postgres and Redshift data proceeding, I’ve found dbt a brilliant tool for structuring and executing query based data pipelines.

3 Likes

Here’s the data for last month with UUIDs converted into package names if anyone is curious and don’t feel like wrangling the data themselves
https://gist.github.com/baggepinnen/e05e8dc6989980f558691d9ce6771801

The first columns are sorted based on number of users and the last columns based on organization name.
The script use to produce the CSV file is here
https://gist.github.com/baggepinnen/b98625dbfd2a7194a744f90926ef6f66

15 Likes

To be clear, these numbers refer to unique IPs installing the package within the time window, right? So stable packages with unfrequent releases won’t be (re)downloaded very often by regular users, and users on dynamic IPs may be counted multiple times, correct?

4 Likes

this is cool. here’s my suggestions:

  • create a github repo with FAQ, Docs, etc and where everyone can contribute their scripts or codes to extract, analyze, cluster, predict, visualize the data
  • have a list of questions we want to know given the datasets so that some of these scripts who have solutions can be associated with these questions
  • include a bof or workshop in the next juliacon related to these activities
  • have a slack/zulips/discourse channel for anything about these datasets for brainstorming and follow-up discussion about insights on the data
4 Likes

Yes, IIUC. So in theory one could just tag a new patch release to estimate the number of “unique active users”.

4 Likes

This is great. Is the content of the original post going to be put somewhere more permanent and updateable as new views of the data come along?

1 Like

Ah, sorry, I did overlook that. Nice!

1 Like