After some years of getting the package server architecture in place and working (mostly) reliably, @staticfloat and I finally had some time to work on collecting logs into a data warehouse (we’re using Snowflake) and designing a set of queries over the logs that we can run and publish regularly. The current public aggregated stats are available with the prefix
followed by a rollup name and the suffix
.csv.gz indicating that all the files are gzip-compressed CSV files. The rollups that we currently publish are variations on the following basic views, listed here with the rollup name and the log fields that each one is indexed by:
In addition to these, there are variants of these rollups that also have
date as keys, which is indicated by appending
_by_date to the rollup name. The full list of rollups is:
There is also a
log_recency CSV file published here:
This CSV file has two columns:
pkgserver with a row for each package server where
time_utc is the latest log entry available for that server. This is helpful to check if some package servers are delayed in uploading their logs. There is also a row in this file where the
pkgserver is empty (null/missing) and the
time_utc field is the time at which the query was run, which should be just before all the CSV files were uploaded.
These CSV files are generated each day at 4am UTC over the past year’s worth of logs. Logs are uploaded from servers every 2 hours, so aggregating at 4am and excluding the current day means that we should typically have complete logs from the last day that’s included in the stats, but we’ll never have stats including the current day. For the first four hours of each UTC-day, the last day of logs will be from the day before yesterday.
The fields by which we aggregate various rollups are the following:
client_type: one of
cior empty (null/missing) indicating whether the request appears to be from a normal user, a CI run or some other kind of client like a normal web browser, web crawler or some server that’s mirroring the package servers.
julia_system: a hypen-separated “tuple” indicating the client system’s characteristics that Pkg uses to determine what system-specific pre-compiled binaries to install on that system. For example, the most common system tuple for regular users (
client_type == "user") over the past month is
x86_64-linux-gnu-libgfortran4-cxx11-libstdcxx26-julia_version+1.6.1, which indicates:
- x86_64 hardware
- linux kernel
- GNU libc
- version 4 of the gfortran ABI
- version 11 of the C++ ABI
- version 26 of the C++ standard library ABI
- version 1.6.1 of the Julia “ABI”
Amazingly enough, these are all things that need to be taken into consideration when deciding what binaries will work on a system and BinaryBuilder pre-builds all the possible combinations of ABIs for all the versions of platforms that we support. The package logs contain 198 distinct platform tuples, so many combinations really do occur in the wild.
julia_version_prefix: the Julia version of the client with any build number stripped away. For releases this just the version number, but for people who build Julia from source, version numbers may look like
1.8.0-DEV.485(my current version number), which would be truncated to
resource_type: this classifies the types of things clients can request from package servers into:
package– a package tarball
artifact– an artifact tarball
registry– a specific registry tarball
registries–the literal resource
/registries(to get a list of current registry versions)
meta– any resource path that starts with
- empty (null/missing) for anything else
status: the HTTP status code of the request, where 2xx responses indicate success and other status codes indicate various kinds of failure.
package_uuid: for package requests this is the UUID that identifies the package being requested. When this is one of the key columns, logs are limited to those with
request_type == "package".
region: when clients make requests to
pkg.julialang.orgthey are redirected to a specific package server that should be close to them geographically, which helps make sure that package downloads are fast no matter where you are in the world. This field identifies region of the package server that served their request. The regions where we currently have package servers are:
cn-east– China East
cn-northeast– China North East
cn-southeast– China South East
eu-central– Europe Central
sa– South America
us-east– US East
us-west– US West
We’ll be adding more servers as AWS adds Lightsail support in more regions. For example, we’d like to add a server in Africa.
date: the date on which a request was made in the UTC timezone.
Each rollup table is keyed by some subset of the above fields, but what we’re interested in is various aggregated data about the set of request logs that have that set of key field values. These are the aggregates we currently compute for each slice of logs:
request_addrs: the approximate number of unique requesting IP addresses. Details below on why this is approximate and not exact.
request_count: the number of requests.
successes: the number of requests which resulted in a 2xx HTTP response code. Only included if
statusis not one of the key fields of the rollup. To get the success rate, divide by
cache_misses: the number of requests which resulted in the package server attempting to fetch a resource from an upstream storage server. To get the cache miss rate, divide by
body_bytes_sent: total number of bytes served in the bodies of HTTP responses for all requests (i.e. not including HTTP headers or TLS/IP data). To get average request body size, divide by
request_time: total time spent serving these requests. To get average request time, divide by
date_count: the number of distinct UTC dates when requests occurred. Only included if
dateis not one of the key fields of the rollup.
date_min: the earliest date of any request in this group. Only included if
dateis not one of the key fields of the rollup.
date_max: the latest date of any request in this group. Only included if
dateis not one of the key fields of the rollup.
The reported numbers of unique IP addresses for package server requests are approximate because we don’t store IP addresses in our data warehouse. Instead, we store a HyperLogLog hash value of each IP address that allows us to accurately approximate the number of unique IP address to within ±1.6% error on average. This technique allows us to accurately estimate the number of unique IP addresses in any set of request logs while preserving quite strong anonymity properties with respect to individual IP addresses:
No IP address can be uniquely identified by its hash value.
There are only 102,400 distinct hash values, each less than 17 bits.
Since there are more than 300k distinct IP addresses that have made requests in the past month, there are already about three IP addresses among the ones we’ve actually seen (which we don’t record) per distinct hash value on average. This ratio will only increase over time since the set of hash values remains the same but we will more IP addresses.
Most IP addresses have hash values with far more collisions than this: a typical IP address will share a hash value with dozens of IP addresses that we’ve seen and over half a million IPv4 addresses from the full 32-bit space, and over 4×10^34 addresses from the IPv6 address space.
This HyperLogLog hash technique allows us estimate the number of unique IP addresses in any subset of logs while preserving the anonymity of end-users even in our internal data systems. As an extra measure of privacy protection, we do not publish even HyperLogLog hash values, but only the aggregate counts derived from them.
Here’s the current contents of the
client_type_by_region rollup as a gist:
Click through for a nice tabular view, courtesy of GitHub. To download and uncompress the latest version of this rollup table yourself, you can run this UNIX command:
curl -s https://julialang-logs.s3.amazonaws.com/public_outputs/current/client_types_by_region.csv.gz | gzcat
But any way you care to download it and uncompress it will work, and after that it’s just a (nicely formatted) CSV file. The key columns of this rollup are
region. The value columns are all of the possible values columns since neither
date are in set of key columns. Some random observations:
- The US eastern region is the busiest by both number of users (well, IP addresses) and requests for both CI traffic and real user traffic.
- The US western region is the next busiest for CI requests whereas Europe is the next busiest in terms of real user requests.
- India had CI requests for only two days in the last month including a total of only 72 requests from (approximately) 13 IP addresses.
One data set that will no doubt be of particular interest to people is the
package_requests rollup, which is keyed by
client_type and includes probably most interestingly, the number of unique request addresses requests with
client_type == "user", which is a decent proxy for “how many people use this package?” In the past month, the three most popular packages by this metric were:
*users = “Number of unique IP addresses which requested said package without any indicators of being a CI process.”
There’s certainly many more interesting things to be gleaned from all this data and we hope that some people will take a look at these data sets and do some interesting analysis, stand up some cool visualization apps, etc. There are also almost certainly more ways to slice and dice our logs, so if anyone has any suggestions for new data sets they’d like to see, please let us know!