Yep. We can’t know what packages were downloaded before we have log data in our dataset and if people change IP addresses that messes with IP-based counting. In addition to people changing IPs, there can also be multiple people behind a NAT that appear as the same public IP address. So this metric is subject to both under and over counting as a way to estimate users. The only way to do better is to do something that helps identify unique client installs. I’m considering doing similar HyperLogLog hashing to count clients.
Is there a way to distinguish between something added as a dependency and something added directly via ] add
?
There is not.
Is there a way to distinguish between something added as a dependency and something added directly via
] add
In the end, I’d say any use of a package is as a dependency, whether in an “end-user” environment’s Project.toml, or used by a small local “convenience package”, or used by a registered package, or as a deep transitive dependency. IMHO each deserves equal recognition.
thanks & half-off question:
- How can I query this info ? (
x86_64-linux-gnu-libgfortran4-cxx11-libstdcxx26-julia_version+1.6.1
)- so I would like to print - my
julia_system
value to the console. - and I would like to add to my “error” reports.
- so I would like to print - my
- IMHO: the
libgfortran4-cxx11-libstdcxx26
missing from theversioninfo(verbose=true)
output.- Is it possible to add this info for the mere mortals?
julia> using Base.BinaryPlatforms
julia> HostPlatform()
macOS aarch64 {cxxstring_abi=cxx11, julia_version=1.8.0, libgfortran_version=5.0.0}
julia> triplet(HostPlatform())
"aarch64-apple-darwin-libgfortran5-cxx11-julia_version+1.8.0"
It occurs to me that it would be really handy to have all of the latest rollup tables available for download in a single sqlite database file that someone could download and start writing SQL queries against. I’m sure I could figure that out with a bit of effort, but I’m posting it here hoping that someone with prior expertise using sqlite might contribute a script that takes a list of URLs, downloads the compressed CSV files, creates a sqlite database, and populates that database with tables corresponding to each file. We could arrange to run that after the uploads are all done and then upload a corresponding sqlite database file each day.
As much as I agree with all of these points, I think this is where the community can hopefully step in because I don’t really have time to do all of that.
Here are some details of how the processing works. At the core of this is the logs
table, which we upload new log data into regularly. Here is the schema of that table:
Row │ name type
│ String String
─────┼───────────────────────────────────────
1 │ row_id NUMBER(38,0)
2 │ time_utc TIMESTAMP_NTZ(9)
3 │ request VARCHAR(16777216)
4 │ request_method VARCHAR(16777216)
5 │ request_url VARCHAR(16777216)
6 │ package VARCHAR(16777216)
7 │ package_uuid VARCHAR(16777216)
8 │ package_hash VARCHAR(16777216)
9 │ registry VARCHAR(16777216)
10 │ registry_uuid VARCHAR(16777216)
11 │ registry_hash VARCHAR(16777216)
12 │ artifact VARCHAR(16777216)
13 │ artifact_hash VARCHAR(16777216)
14 │ request_http VARCHAR(16777216)
15 │ status NUMBER(38,0)
16 │ body_bytes_sent NUMBER(38,0)
17 │ http_user_agent VARCHAR(16777216)
18 │ request_time FLOAT
19 │ julia_version VARCHAR(16777216)
20 │ julia_system VARCHAR(16777216)
21 │ julia_ci_variables VARCHAR(16777216)
22 │ julia_interactive BOOLEAN
23 │ julia_pkg_server VARCHAR(16777216)
24 │ request_id VARCHAR(16777216)
25 │ cache_miss VARCHAR(16777216)
26 │ pkgserver VARCHAR(16777216)
27 │ hll_hash BINARY(8388608)
The following view definition is used to add some classification logic to the logs:
create or replace view logs_classified as
select *,
-- requester attributes
case
when "request_method" not in ('GET', 'HEAD') then null
when "julia_ci_variables" regexp '.*=[^n].*' then 'ci'
when "julia_ci_variables" is not null then 'user'
else null
end as "client_type",
regexp_substr("julia_version", '^[0-9]+\.[0-9]+\.[0-9]+(-[^.]+)?') as "julia_version_prefix",
regexp_substr("pkgserver", '^[a-z-]+') as "region",
-- requested attributes
case
when "request_method" not in ('GET', 'HEAD') then null
when "request_url" regexp '/meta(|/.*)' then 'meta'
when "request_url" = '/registries' then 'registries'
when "registry" is not null then 'registry'
when "package" is not null then 'package'
when "artifact" is not null then 'artifact'
else null
end as "resource_type",
-- status classification
200 <= "status" and "status" < 300 as "status_success",
-- time attributes
date("time_utc") as "date"
from logs;
The following Julia code is then used to generate all the rollups as views (which are then used to generate the files that are uploaded):
function rollup_sql(name::AbstractString, fields::AbstractVector{<:AbstractString})
push!(rollup_names, name)
keys_sql = join(map(repr, fields), ", ")
vals_sql = """
, hll_estimate(hll_combine("hll_hash")) as "request_addrs"
, count(*) as "request_count"
"""
if !("status" in fields)
vals_sql *= """
, sum(iff("status_success", 1, 0)) as "successes"
"""
end
vals_sql *= """
, sum(iff("cache_miss" = 'miss', 1, 0)) as "cache_misses"
, sum("body_bytes_sent") as "body_bytes_sent"
, sum("request_time") as "request_time"
"""
if !("date" in fields)
vals_sql *= """
, count(distinct "date") as "date_count"
, min("date") as "date_min"
, max("date") as "date_max"
"""
end
where_clauses = String[]
"package_uuid" in fields && push!(where_clauses, "\"resource_type\" = 'package'")
"registry_uuid" in fields && push!(where_clauses, "\"resource_type\" = 'registry'")
"artifact_hash" in fields && push!(where_clauses, "\"resource_type\" = 'artifact'")
if "date" in fields
push!(where_clauses, "dateadd(year, -1, current_date()) <= \"date\"")
push!(where_clauses, "\"date\" < current_date()")
end
from_sql = isempty(where_clauses) ?
"logs_classified" :
"logs_classified where $(join(where_clauses, "\n and "))"
return """
create or replace view $name as
select $keys_sql
$(chomp(vals_sql))
from $from_sql
group by $keys_sql
order by $keys_sql;
"""
end
rollups = [
"client_types" => ["client_type"],
"julia_systems" => ["julia_system", "client_type"],
"julia_versions" => ["julia_version_prefix", "client_type"],
"resource_types" => ["resource_type", "status", "client_type"],
"package_requests" => ["package_uuid", "status", "client_type"],
]
rollup_names = ["log_recency"]
for (name, fields) in rollups
println(rollup_sql(name, fields))
println(rollup_sql("$(name)_by_region", [fields; "region"]))
println(rollup_sql("$(name)_by_date", [fields; "date"]))
println(rollup_sql("$(name)_by_region_by_date", [fields; "region"; "date"]))
end
You can run this code yourself to see all the rollup definitions, but here’s the first one for free:
create or replace view client_types as
select "client_type"
, hll_estimate(hll_combine("hll_hash")) as "request_addrs"
, count(*) as "request_count"
, sum(iff("status_success", 1, 0)) as "successes"
, sum(iff("cache_miss" = 'miss', 1, 0)) as "cache_misses"
, sum("body_bytes_sent") as "body_bytes_sent"
, sum("request_time") as "request_time"
, count(distinct "date") as "date_count"
, min("date") as "date_min"
, max("date") as "date_max"
from logs_classified
group by "client_type"
order by "client_type";
Most of the aggregate functions are standard SQL. The hll_combine
and hll_estimate
functions are Snowflake-specific: hll_combine
aggregates a bunch of HyperLogLog sketch values (represented as binary blobs) into an aggregate sketch describing the entire collection that has been aggregated over; hll_estimate
takes an aggregate sketch and produces a cardinality estimate.
The hll_hash
field is the HyperLogLog hash value of the request IP address of the log line in the same binary format that Snowflake understands. This hash is computed based on the IP address before the logs are uploaded for storage and then the IP address is discarded. (We do store a separate copy of the logs that retains the IP address in a different S3 bucket that is set to auto-delete after 30 days, which is only used to diagnose and mitigate attacks and never enters our data analysis systems.) The code to compute the HyperLogLog hash can be found here:
https://github.com/JuliaPackaging/PkgServerLogAnalysis.jl/blob/master/src/hll_hashing.jl
The rest of the code to process and upload package server logs can also be found in that repo. At a high level, the hashing works by encrypting each IP address using a 16-round Feistel cipher with SHA512 as its round function (this is serious overkill in every way and should definitely be secure; also, the security only impacts making it hard to intentionally craft an IP address with any specific hash value; the real reason I went to town here was to get a better shot at good statistical properties). It then uses 30 bits of the encrypted value to generate a 17-bit HyperLogLog hash value (12 bits for the bucket, log2(19)
for the exponential sample value). Using encryption guarantees that every hash value has multiple IPv4 and IPv6 addresses that correspond to it, which would be likely but not guaranteed if we used a hash function instead of encryption.
One thing I did do from this list is create a Slack channel to discuss the data: #pkg-log-data on the JuliaLang Slack.
Is this right? You must be padding the addresses up to 1024 bits to use the encryption you said, but then the encryption only guarantees you have a permutation on that whole 1024-bit space, not the 32-bit space of IP addresses. Or am I missing something?
The Feistel cipher is not a block cipher, it is format preserving, which is why I’m using it. There is no fixed block size, it adjusts to whatever the size of the data is, up to 128 bytes (2 × 512 bits): it induces a “random” permutation of the space of all byte vectors of a given size. Of course that space of permutations is absolutely huge—factorial(2^(8*length(data)))
—so even a fairly large key size can only possibly access a tiny slice of of that space of permutations, but it’s also impossible to analytically understand what that subspace is due to the complex nature of the round function (SHA512), so encrypting with a large random key effectively induces a random permutation of the space of data vectors.
You can also tell this all of this by playing around with the code. For a given key:
- The
encrypt!
function encrypts anyn
-byte vector as anothern
-byte vector; - The
decrypt!
function inverts that mapping, taking then
-byte cipher text vector back to the original plain text vector.
Since this is the case for any input data, these functions must both be bijections on n
-byte vectors for any n
. Which is another way of saying that they each compute a permutation on the set of possible input/output values. If this were a block cipher, the cipher text would be longer than the input data for data sizes that aren’t a multiple of the block size because it would include padding.
Ok I’ve looked at the code and believe you now. Feistel is a block cipher, but what you didn’t mention is you wrap the hash round so it gives the desired block length.
I guess I would say that a Feistel network is a construction that can be used to build block ciphers, rather than being a block cipher itself. But maybe that’s just splitting hairs. In any case, it is used here to construct, for each specific size of input, a cipher whose block size is precisely the input size.
The use of SHA512 for the round function also means that the construction doesn’t really care what the size of the key is—the key is just some data that’s hashed and thereby parameterizes the permutation that the cipher implements for each size.
Thank you for this great work!
I am looking at csv files and was wondering if there is a way to get the package_uuid form the Package name or viceversa.
The registry used defines the mapping between UUID and name. There are helper functions in the script @baggepinnen shared above
Would be good to rig up importing of the mapping from package uuids to names into snowflake so that the output tables can include the names for easier analysis, even though it’s possible to do it later on as well.
Thanks!
Hi. I would like to share with you a Julia-Dash App based on the data warehouse of Stephan and Elliot (Thank you for their work!). The App uses their data to plot Package’s requests using Dash for Julia. The source code can be found in https://github.com/valerocar/JuliaPackages.jl
Very cool!