After some years of getting the package server architecture in place and working (mostly) reliably, @staticfloat and I finally had some time to work on collecting logs into a data warehouse (we’re using Snowflake) and designing a set of queries over the logs that we can run and publish regularly. The current public aggregated stats are available with the prefix
https://julialang-logs.s3.amazonaws.com/public_outputs/current/
followed by a rollup name and the suffix .csv.gz
indicating that all the files are gzip-compressed CSV files. The rollups that we currently publish are variations on the following basic views, listed here with the rollup name and the log fields that each one is indexed by:
-
client_types
:client_type
-
julia_systems
:julia_system
,client_type
-
julia_versions
:julia_version_prefix
,client_type
-
resource_types
:resource_type
,status
,client_type
-
package_requests
:package_uuid
,status
,client_type
In addition to these, there are variants of these rollups that also have region
and/or date
as keys, which is indicated by appending _by_region
and/or _by_date
to the rollup name. The full list of rollups is:
client_types
client_types_by_region
client_types_by_date
client_types_by_region_by_date
julia_systems
julia_systems_by_region
julia_systems_by_date
julia_systems_by_region_by_date
julia_versions
julia_versions_by_region
julia_versions_by_date
julia_versions_by_region_by_date
resource_types
resource_types_by_region
resource_types_by_date
resource_types_by_region_by_date
package_requests
package_requests_by_region
package_requests_by_date
package_requests_by_region_by_date
There is also a log_recency
CSV file published here:
This CSV file has two columns: time_utc
and pkgserver
with a row for each package server where time_utc
is the latest log entry available for that server. This is helpful to check if some package servers are delayed in uploading their logs. There is also a row in this file where the pkgserver
is empty (null/missing) and the time_utc
field is the time at which the query was run, which should be just before all the CSV files were uploaded.
These CSV files are generated each day at 4am UTC over the past year’s worth of logs. Logs are uploaded from servers every 2 hours, so aggregating at 4am and excluding the current day means that we should typically have complete logs from the last day that’s included in the stats, but we’ll never have stats including the current day. For the first four hours of each UTC-day, the last day of logs will be from the day before yesterday.
Rollup keys
The fields by which we aggregate various rollups are the following:
-
client_type
: one ofuser
,ci
or empty (null/missing) indicating whether the request appears to be from a normal user, a CI run or some other kind of client like a normal web browser, web crawler or some server that’s mirroring the package servers. -
julia_system
: a hypen-separated “tuple” indicating the client system’s characteristics that Pkg uses to determine what system-specific pre-compiled binaries to install on that system. For example, the most common system tuple for regular users (client_type == "user"
) over the past month isx86_64-linux-gnu-libgfortran4-cxx11-libstdcxx26-julia_version+1.6.1
, which indicates:- x86_64 hardware
- linux kernel
- GNU libc
- version 4 of the gfortran ABI
- version 11 of the C++ ABI
- version 26 of the C++ standard library ABI
- version 1.6.1 of the Julia “ABI”
Amazingly enough, these are all things that need to be taken into consideration when deciding what binaries will work on a system and BinaryBuilder pre-builds all the possible combinations of ABIs for all the versions of platforms that we support. The package logs contain 198 distinct platform tuples, so many combinations really do occur in the wild.
-
julia_version_prefix
: the Julia version of the client with any build number stripped away. For releases this just the version number, but for people who build Julia from source, version numbers may look like1.8.0-DEV.485
(my current version number), which would be truncated to1.8.0-DEV
in rollups. -
resource_type
: this classifies the types of things clients can request from package servers into:-
package
– a package tarball -
artifact
– an artifact tarball -
registry
– a specific registry tarball -
registries
–the literal resource/registries
(to get a list of current registry versions) -
meta
– any resource path that starts with/meta
- empty (null/missing) for anything else
-
-
status
: the HTTP status code of the request, where 2xx responses indicate success and other status codes indicate various kinds of failure. -
package_uuid
: for package requests this is the UUID that identifies the package being requested. When this is one of the key columns, logs are limited to those withrequest_type == "package"
. -
region
: when clients make requests topkg.julialang.org
they are redirected to a specific package server that should be close to them geographically, which helps make sure that package downloads are fast no matter where you are in the world. This field identifies region of the package server that served their request. The regions where we currently have package servers are:-
au
– Australia -
cn-east
– China East -
cn-northeast
– China North East -
cn-southeast
– China South East -
eu-central
– Europe Central -
in
– India -
sa
– South America -
sg
– Singapore -
us-east
– US East -
us-west
– US West
We’ll be adding more servers as AWS adds Lightsail support in more regions. For example, we’d like to add a server in Africa.
-
-
date
: the date on which a request was made in the UTC timezone.
Rollup values
Each rollup table is keyed by some subset of the above fields, but what we’re interested in is various aggregated data about the set of request logs that have that set of key field values. These are the aggregates we currently compute for each slice of logs:
-
request_addrs
: the approximate number of unique requesting IP addresses. Details below on why this is approximate and not exact. -
request_count
: the number of requests. -
successes
: the number of requests which resulted in a 2xx HTTP response code. Only included ifstatus
is not one of the key fields of the rollup. To get the success rate, divide byrequest_count
. -
cache_misses
: the number of requests which resulted in the package server attempting to fetch a resource from an upstream storage server. To get the cache miss rate, divide byrequest_count
. -
body_bytes_sent
: total number of bytes served in the bodies of HTTP responses for all requests (i.e. not including HTTP headers or TLS/IP data). To get average request body size, divide byrequest_count
. -
request_time
: total time spent serving these requests. To get average request time, divide byrequest_count
. -
date_count
: the number of distinct UTC dates when requests occurred. Only included ifdate
is not one of the key fields of the rollup. -
date_min
: the earliest date of any request in this group. Only included ifdate
is not one of the key fields of the rollup. -
date_max
: the latest date of any request in this group. Only included ifdate
is not one of the key fields of the rollup.
Anonymity and unique IP address count approximation
The reported numbers of unique IP addresses for package server requests are approximate because we don’t store IP addresses in our data warehouse. Instead, we store a HyperLogLog hash value of each IP address that allows us to accurately approximate the number of unique IP address to within ±1.6% error on average. This technique allows us to accurately estimate the number of unique IP addresses in any set of request logs while preserving quite strong anonymity properties with respect to individual IP addresses:
-
No IP address can be uniquely identified by its hash value.
-
There are only 102,400 distinct hash values, each less than 17 bits.
-
Since there are more than 300k distinct IP addresses that have made requests in the past month, there are already about three IP addresses among the ones we’ve actually seen (which we don’t record) per distinct hash value on average. This ratio will only increase over time since the set of hash values remains the same but we will more IP addresses.
-
Most IP addresses have hash values with far more collisions than this: a typical IP address will share a hash value with dozens of IP addresses that we’ve seen and over half a million IPv4 addresses from the full 32-bit space, and over 4×10^34 addresses from the IPv6 address space.
This HyperLogLog hash technique allows us estimate the number of unique IP addresses in any subset of logs while preserving the anonymity of end-users even in our internal data systems. As an extra measure of privacy protection, we do not publish even HyperLogLog hash values, but only the aggregate counts derived from them.
Examples
Here’s the current contents of the client_type_by_region
rollup as a gist:
https://gist.github.com/StefanKarpinski/217fdf89f621704683dc2df4dc5b1ab8
Click through for a nice tabular view, courtesy of GitHub. To download and uncompress the latest version of this rollup table yourself, you can run this UNIX command:
curl -s https://julialang-logs.s3.amazonaws.com/public_outputs/current/client_types_by_region.csv.gz | gzcat
But any way you care to download it and uncompress it will work, and after that it’s just a (nicely formatted) CSV file. The key columns of this rollup are client_type
and region
. The value columns are all of the possible values columns since neither status
nor date
are in set of key columns. Some random observations:
- The US eastern region is the busiest by both number of users (well, IP addresses) and requests for both CI traffic and real user traffic.
- The US western region is the next busiest for CI requests whereas Europe is the next busiest in terms of real user requests.
- India had CI requests for only two days in the last month including a total of only 72 requests from (approximately) 13 IP addresses.
One data set that will no doubt be of particular interest to people is the package_requests
rollup, which is keyed by package_uuid
, status
and client_type
and includes probably most interestingly, the number of unique request addresses requests with client_type == "user"
, which is a decent proxy for “how many people use this package?” In the past month, the three most popular packages by this metric were:
*users = “Number of unique IP addresses which requested said package without any indicators of being a CI process.”
Conclusion
There’s certainly many more interesting things to be gleaned from all this data and we hope that some people will take a look at these data sets and do some interesting analysis, stand up some cool visualization apps, etc. There are also almost certainly more ways to slice and dice our logs, so if anyone has any suggestions for new data sets they’d like to see, please let us know!