Announcing: package download stats!

Yep. We can’t know what packages were downloaded before we have log data in our dataset and if people change IP addresses that messes with IP-based counting. In addition to people changing IPs, there can also be multiple people behind a NAT that appear as the same public IP address. So this metric is subject to both under and over counting as a way to estimate users. The only way to do better is to do something that helps identify unique client installs. I’m considering doing similar HyperLogLog hashing to count clients.

2 Likes

Is there a way to distinguish between something added as a dependency and something added directly via ] add?

2 Likes

There is not.

1 Like

Is there a way to distinguish between something added as a dependency and something added directly via ] add

In the end, I’d say any use of a package is as a dependency, whether in an “end-user” environment’s Project.toml, or used by a small local “convenience package”, or used by a registered package, or as a deep transitive dependency. IMHO each deserves equal recognition.

6 Likes

thanks & half-off question:

  • How can I query this info ? ( x86_64-linux-gnu-libgfortran4-cxx11-libstdcxx26-julia_version+1.6.1)
    • so I would like to print - my julia_system value to the console.
    • and I would like to add to my “error” reports.
  • IMHO: the libgfortran4-cxx11-libstdcxx26 missing from the versioninfo(verbose=true) output.
    • Is it possible to add this info for the mere mortals?
julia> using Base.BinaryPlatforms

julia> HostPlatform()
macOS aarch64 {cxxstring_abi=cxx11, julia_version=1.8.0, libgfortran_version=5.0.0}

julia> triplet(HostPlatform())
"aarch64-apple-darwin-libgfortran5-cxx11-julia_version+1.8.0"
4 Likes

It occurs to me that it would be really handy to have all of the latest rollup tables available for download in a single sqlite database file that someone could download and start writing SQL queries against. I’m sure I could figure that out with a bit of effort, but I’m posting it here hoping that someone with prior expertise using sqlite might contribute a script that takes a list of URLs, downloads the compressed CSV files, creates a sqlite database, and populates that database with tables corresponding to each file. We could arrange to run that after the uploads are all done and then upload a corresponding sqlite database file each day.

1 Like

As much as I agree with all of these points, I think this is where the community can hopefully step in because I don’t really have time to do all of that.

Here are some details of how the processing works. At the core of this is the logs table, which we upload new log data into regularly. Here is the schema of that table:

 Row │ name                type
     │ String              String
─────┼───────────────────────────────────────
   1 │ row_id              NUMBER(38,0)
   2 │ time_utc            TIMESTAMP_NTZ(9)
   3 │ request             VARCHAR(16777216)
   4 │ request_method      VARCHAR(16777216)
   5 │ request_url         VARCHAR(16777216)
   6 │ package             VARCHAR(16777216)
   7 │ package_uuid        VARCHAR(16777216)
   8 │ package_hash        VARCHAR(16777216)
   9 │ registry            VARCHAR(16777216)
  10 │ registry_uuid       VARCHAR(16777216)
  11 │ registry_hash       VARCHAR(16777216)
  12 │ artifact            VARCHAR(16777216)
  13 │ artifact_hash       VARCHAR(16777216)
  14 │ request_http        VARCHAR(16777216)
  15 │ status              NUMBER(38,0)
  16 │ body_bytes_sent     NUMBER(38,0)
  17 │ http_user_agent     VARCHAR(16777216)
  18 │ request_time        FLOAT
  19 │ julia_version       VARCHAR(16777216)
  20 │ julia_system        VARCHAR(16777216)
  21 │ julia_ci_variables  VARCHAR(16777216)
  22 │ julia_interactive   BOOLEAN
  23 │ julia_pkg_server    VARCHAR(16777216)
  24 │ request_id          VARCHAR(16777216)
  25 │ cache_miss          VARCHAR(16777216)
  26 │ pkgserver           VARCHAR(16777216)
  27 │ hll_hash            BINARY(8388608)

The following view definition is used to add some classification logic to the logs:

create or replace view logs_classified as
select *,
-- requester attributes
  case
    when "request_method" not in ('GET', 'HEAD') then null
    when "julia_ci_variables" regexp '.*=[^n].*' then 'ci'
    when "julia_ci_variables" is not null        then 'user'
                                                 else null
  end as "client_type",
  regexp_substr("julia_version", '^[0-9]+\.[0-9]+\.[0-9]+(-[^.]+)?') as "julia_version_prefix",
  regexp_substr("pkgserver", '^[a-z-]+') as "region",
-- requested attributes
  case
    when "request_method" not in ('GET', 'HEAD') then null
    when "request_url" regexp '/meta(|/.*)'      then 'meta'
    when "request_url" = '/registries'           then 'registries'
    when "registry" is not null                  then 'registry'
    when "package"  is not null                  then 'package'
    when "artifact" is not null                  then 'artifact'
                                                 else null
  end as "resource_type",
-- status classification
  200 <= "status" and "status" < 300 as "status_success",
-- time attributes
  date("time_utc") as "date"
from logs;

The following Julia code is then used to generate all the rollups as views (which are then used to generate the files that are uploaded):

function rollup_sql(name::AbstractString, fields::AbstractVector{<:AbstractString})
    push!(rollup_names, name)
    keys_sql = join(map(repr, fields), ", ")
    vals_sql = """
      , hll_estimate(hll_combine("hll_hash")) as "request_addrs"
      , count(*) as "request_count"
    """
    if !("status" in fields)
        vals_sql *= """
          , sum(iff("status_success", 1, 0)) as "successes"
        """
    end
    vals_sql *= """
      , sum(iff("cache_miss" = 'miss', 1, 0)) as "cache_misses"
      , sum("body_bytes_sent") as "body_bytes_sent"
      , sum("request_time") as "request_time"
    """
    if !("date" in fields)
        vals_sql *= """
          , count(distinct "date") as "date_count"
          , min("date") as "date_min"
          , max("date") as "date_max"
        """
    end
    where_clauses = String[]
    "package_uuid"  in fields && push!(where_clauses, "\"resource_type\" = 'package'")
    "registry_uuid" in fields && push!(where_clauses, "\"resource_type\" = 'registry'")
    "artifact_hash" in fields && push!(where_clauses, "\"resource_type\" = 'artifact'")
    if "date" in fields
        push!(where_clauses, "dateadd(year, -1, current_date()) <= \"date\"")
        push!(where_clauses, "\"date\" < current_date()")
    end
    from_sql = isempty(where_clauses) ?
        "logs_classified" :
        "logs_classified where $(join(where_clauses, "\n  and "))"
    return """
    create or replace view $name as
    select $keys_sql
    $(chomp(vals_sql))
    from $from_sql
    group by $keys_sql
    order by $keys_sql;
    """
end

rollups = [
    "client_types"     => ["client_type"],
    "julia_systems"    => ["julia_system", "client_type"],
    "julia_versions"   => ["julia_version_prefix", "client_type"],
    "resource_types"   => ["resource_type", "status", "client_type"],
    "package_requests" => ["package_uuid", "status", "client_type"],
]

rollup_names = ["log_recency"]
for (name, fields) in rollups
    println(rollup_sql(name, fields))
    println(rollup_sql("$(name)_by_region", [fields; "region"]))
    println(rollup_sql("$(name)_by_date", [fields; "date"]))
    println(rollup_sql("$(name)_by_region_by_date", [fields; "region"; "date"]))
end

You can run this code yourself to see all the rollup definitions, but here’s the first one for free:

create or replace view client_types as
select "client_type"
  , hll_estimate(hll_combine("hll_hash")) as "request_addrs"
  , count(*) as "request_count"
  , sum(iff("status_success", 1, 0)) as "successes"
  , sum(iff("cache_miss" = 'miss', 1, 0)) as "cache_misses"
  , sum("body_bytes_sent") as "body_bytes_sent"
  , sum("request_time") as "request_time"
  , count(distinct "date") as "date_count"
  , min("date") as "date_min"
  , max("date") as "date_max"
from logs_classified
group by "client_type"
order by "client_type";

Most of the aggregate functions are standard SQL. The hll_combine and hll_estimate functions are Snowflake-specific: hll_combine aggregates a bunch of HyperLogLog sketch values (represented as binary blobs) into an aggregate sketch describing the entire collection that has been aggregated over; hll_estimate takes an aggregate sketch and produces a cardinality estimate.

The hll_hash field is the HyperLogLog hash value of the request IP address of the log line in the same binary format that Snowflake understands. This hash is computed based on the IP address before the logs are uploaded for storage and then the IP address is discarded. (We do store a separate copy of the logs that retains the IP address in a different S3 bucket that is set to auto-delete after 30 days, which is only used to diagnose and mitigate attacks and never enters our data analysis systems.) The code to compute the HyperLogLog hash can be found here:

https://github.com/JuliaPackaging/PkgServerLogAnalysis.jl/blob/master/src/hll_hashing.jl

The rest of the code to process and upload package server logs can also be found in that repo. At a high level, the hashing works by encrypting each IP address using a 16-round Feistel cipher with SHA512 as its round function (this is serious overkill in every way and should definitely be secure; also, the security only impacts making it hard to intentionally craft an IP address with any specific hash value; the real reason I went to town here was to get a better shot at good statistical properties). It then uses 30 bits of the encrypted value to generate a 17-bit HyperLogLog hash value (12 bits for the bucket, log2(19) for the exponential sample value). Using encryption guarantees that every hash value has multiple IPv4 and IPv6 addresses that correspond to it, which would be likely but not guaranteed if we used a hash function instead of encryption.

6 Likes

One thing I did do from this list is create a Slack channel to discuss the data: #pkg-log-data on the JuliaLang Slack.

6 Likes

Is this right? You must be padding the addresses up to 1024 bits to use the encryption you said, but then the encryption only guarantees you have a permutation on that whole 1024-bit space, not the 32-bit space of IP addresses. Or am I missing something?

The Feistel cipher is not a block cipher, it is format preserving, which is why I’m using it. There is no fixed block size, it adjusts to whatever the size of the data is, up to 128 bytes (2 × 512 bits): it induces a “random” permutation of the space of all byte vectors of a given size. Of course that space of permutations is absolutely huge—factorial(2^(8*length(data)))—so even a fairly large key size can only possibly access a tiny slice of of that space of permutations, but it’s also impossible to analytically understand what that subspace is due to the complex nature of the round function (SHA512), so encrypting with a large random key effectively induces a random permutation of the space of data vectors.

You can also tell this all of this by playing around with the code. For a given key:

  • The encrypt! function encrypts any n-byte vector as another n-byte vector;
  • The decrypt! function inverts that mapping, taking the n-byte cipher text vector back to the original plain text vector.

Since this is the case for any input data, these functions must both be bijections on n-byte vectors for any n. Which is another way of saying that they each compute a permutation on the set of possible input/output values. If this were a block cipher, the cipher text would be longer than the input data for data sizes that aren’t a multiple of the block size because it would include padding.

Ok I’ve looked at the code and believe you now. :slight_smile: Feistel is a block cipher, but what you didn’t mention is you wrap the hash round so it gives the desired block length.

I guess I would say that a Feistel network is a construction that can be used to build block ciphers, rather than being a block cipher itself. But maybe that’s just splitting hairs. In any case, it is used here to construct, for each specific size of input, a cipher whose block size is precisely the input size.

The use of SHA512 for the round function also means that the construction doesn’t really care what the size of the key is—the key is just some data that’s hashed and thereby parameterizes the permutation that the cipher implements for each size.

2 Likes

Thank you for this great work!

I am looking at csv files and was wondering if there is a way to get the package_uuid form the Package name or viceversa.

The registry used defines the mapping between UUID and name. There are helper functions in the script @baggepinnen shared above

2 Likes

Would be good to rig up importing of the mapping from package uuids to names into snowflake so that the output tables can include the names for easier analysis, even though it’s possible to do it later on as well.

1 Like

Thanks!

Hi. I would like to share with you a Julia-Dash App based on the data warehouse of Stephan and Elliot (Thank you for their work!). The App uses their data to plot Package’s requests using Dash for Julia. The source code can be found in https://github.com/valerocar/JuliaPackages.jl

JuliaPackages

30 Likes

Very cool!

2 Likes