Pkg.jl telemetry should be opt-in

Just going to throw out one concrete data analysis that originally led me to reach out to R’s CRAN team that would be possible using UUID’s and wouldn’t be possible without some kind of entity-level identifier: co-occurrence counts of (long-term) package installations. This isn’t as useful if it’s computed from short-term information from a single 10-minute session as you’ll often just recover the dependency graph, but it would let you build a package recommendation system if the co-occurrences occur over days/weeks. So this is compatible with @dlakelan’s request that UUID’s flip after N months have passed.

In practice, I think this example is worth considering for two reasons: (1) it shows there’s valuable information in package installations data that the current proposal isn’t discussing and which an aggregate like HyperLogLog wouldn’t and (2) it suggests that the space of interesting analyses often requires storing a lot of bits. In this case for an ecosystem of K packages, you could either (1) store a K-bit string for each UUID or (2) store an 64-bit integer count for each of the O(K^2) pairs of packages.

4 Likes

From a technical standpoint, you could use the HyperLogLog estimator currently proposed to choose a hash of the UUID which has say a 50% probability of collisions, so that you retain in the logs only a deniable record, but at the same time you can analyze the usage pattern for a given UUID and you’re analyzing a mixture of around 2 people. Even with 80% chance of collision, you’d be able to analyze groups of around 5 people. Is it so important to know that this one person installed xyz packages, rather than to say in a pool of 5 people we see the following patterns of installation…

3 Likes

Would this still work if the rule was not to re-generate every N months exactly, but to re-generate with some small probability p per day, such that the average life is N months? That way there is also no guarantee that your ID was the same yesterday.

2 Likes

This actually was my first suggestion WAY back up the thread.

Another useful idea is to have the data collection server generate a random salt with a small probability each day, then encrypt the UUID that’s sent and store that. So there is NEVER the actual UUID stored on the server… but everything is still unique and NOONE knows the mapping, we only know that over a given time… the mapping changes.

@StefanKarpinski wdyt?

edit: thinking about it more, if the database server regenerates a salt using /dev/urandom with probability 1/100 each day, stores it only in RAM, and whenever it needs to store an identifier it simply calculates hash(ID concatwith salt) and stores that… I’d probably have no objections.

2 Likes

Oh I missed that, here, sorry. But a slightly different randomisation – if the expiry date is saved, then your backups from a month ago contain a little more information.

If randomising on the server, is the same salt used for all users, and updated on some days? Then I suppose the intervals between randomisations would still be obvious in its logs. Or is it done per-ID, so that changes of salt look just like users leaving & joining?

exactly. so you’re storing a one-way encrypted version of all the UUIDs, and the encryption key is changing with say 1% probability per day.

For example, you could calculate the sha_256(ID * salt) and take the first 48 bits… it’s not reversible, it has a small probability of collisions, but for usage stats it’s almost identically useful as the UUID itself and yet no one could know what the mapping was.

1 Like

How would the statistics look in this case, if I install some packages today and next time after one month. Would the outcome be one user or two users? My understanding for the explanation was that every Pkg request would get almost unique id and thus counting the amount of users wouldn’t be trivial anymore.

Rather than thinking about this as a fixed time window, think about it as analyzing fixed hashed IDs… You can be pretty sure that a given hash is not a large group of people… So you go through your database and select a random set of hashes, and calculate what those people are doing… Now you have all the info about packages used together, the variety of projects individual users are contributing to, or whatever requires being able to discern an individual.

If you just want to know how big the Julia community is. the HyperLogLog estimator they’re using gives you this information.

SHA256 is sensitive to small changes in the input - changing what is given by a single bit will create large changes in the output, making the UUID completely useless since you’re now saving random data. You lose all potential analysis you’d want to do.

Having locality hashing also doesn’t help, since you retain the associability of UUID since the inputs are similar (and thus the outputs). You’re back to having a UUID, just with some constant computational cost associated with it on the server side.

Computing transforms of server-side data that allow useful analyses but don’t keep UUIDs around is something that I’ve put a lot of thought into and would like to develop, but it does not seem like all people who object to client UUIDs would be satisfied by this.

10 Likes

At the moment, I think it would satisfy all the cases I’ve discussed. Obviously it could use a little more thought than just taking the first transform that came into my head, but considering the transform suggested above:

Even if all 7 billion people on the planet are downloading julia packages, you’re looking at an expected collision rate of something like (i’m not thinking this through in detail, so there could be some issue I’m not considering) 2^33/2^48 = 3e-5 so only around one in 32k entries represent more than one person. This means you can for all practical purposes in doing statistical analysis of usage patterns treat each hash in the database as representing one person.

But if someone has a client UUID and wants to figure out which entries in your database are represented by that UUID… they can’t without knowing the say 128 bit salt which is a number that was literally stored nowhere except transiently within the RAM of the running Pkg server (this also means the salt will be regenerated each time you reboot the Pkg daemon).

Thinking about this “one salt to rule them all” thing some more, this is a REALLY bad idea: UUIDs have a fixed size, so you can just generate them as fast as you want. SHA256 is also hardware accelerated since it’s not a memory hard hash, so this is simply compute bound. Now you just have to vary the salt - since the salt for every record of a day is the same, you can easily parallelise this using ASICs or FPGAs (or both, or even with a GPU cluster). What’s more, since the search space for UUIDs is known, you can build a rainbow table! Boom, easy deanonymization.

Please don’t roll your own crypto.

1 Like

You can’t build a rainbow table because the UUID comes first… and with a 128 bit salt you’re going to have more atoms than exist in the sun to store that rainbow table.

Oh I see if you’re trying to crack a single UUID, then yes you can build a rainbow table for that… you still need to search a 128 bit salt space… you can maybe mitigate this with sha256(salt * ID * salt)

in any case the point wasn’t to suggest a particular implementation but rather the basic idea that you’re storing not the UUID but a much lower bit length irreversible transform of the UUID.

(calculation: if you’re doing 1 billion hashes a second on 1 million machines, it will take

julia> 2.0^128/1e9/1e6 / (3600*24*365)
1.0790283070806016e16

10^16 years to crack.
(sorry forgot the perens initially)
also note that since you’re only storing the first 48 bits, you’d expect this to be non-unique, so knowing only a given UUID plaintext, you’ll find 2^(128)/2^(48) = 1e24 different possible solutions.

I still don’t see why server-side collection of the usage data is not taken seriously.
One objection I’ve heard was that CI would generate “fake” usage data. But that problem is solvable: Julia “knows” when it is running CI, Pkg can detect this and report the data correspondingly.

1 Like

OK, it may be just me, but I detect a problem for Julia from this thread. People find telemetry distasteful. They have come to expect this from the big corporations that steal our data everyday in every way, but somehow expect the Julia project to be “better”. Could it be that the telemetry that was supposed to “help” Julia grow will in fact hurt it?

6 Likes

Because server side collection alone doesn’t tell you anything about the environment the code is ultimately run in. Whether it’s run in CI or run on a developers laptop is a big difference. If the difference didn’t matter, I could simply run useless CI jobs for my favorite project noone uses, promote it a little bit, use those statistics to apply for funding and be on my merry way.

That’s precisely what’s being done with the CI environment variables (https://julialang.org/legal/data/#ci_variables).

As I understand both the code and legal disclaimers (the often linked julialang.org/legal/data/) julia is trying to be better by not sending what you clicked on in a different tab. By not tracking everything you do by checking your clipboard like apps do. By not following your day to day activities through social media blurbs and widgets. Putting those practices on the same level as providing a service (hosting packages) and finding out how this service (and nothing more!) is used is appalling, to say the least.

Julia is trying to exist as an open source entity, which means getting funding. That funding has to come from somewhere and has to be justified (there really aren’t that many philantropists in the tech scene, be real). Evidently, and as can be seen in a lot of other open source projects, community contributions don’t pay rent nearly enough to cover hosting, developer time and conventions. I for one can’t contribute directly because I’m a student and simply can’t afford to do so.

Every person outside of this community I’ve talked to and asked about this had the same reaction: “Wow, that’s it? Feels fine”. I really doubt that this minimal telemetry here will hurt adoption.

5 Likes

Somehow I don’t see this as a real threat. The usage statistics are not likely to be the only criterion in deciding which packages get support. There are publications, stars on github, …

1 Like

The big problem is disambiguating active users from just random downloads. We have this problem for Julia itself. Julia downloads are up to 15 million or so, but we only have a very rough idea of what the actual number of active users is (“a few hundred thousand”) and even less so for packages. It’s a hard problem to solve, but the hypothesis is that looking at request patterns would separate out these groups fairly cleanly and give actual answers.

9 Likes

Computing transforms of server-side data that allow useful analyses but
don’t keep UUIDs around is something that I’ve put a lot of thought into
and would like to develop, but it does not seem like all people who
object to client UUIDs would be satisfied by this.

Maybe since it is another promise about serverside data post-processing that
the user needs to trust but cannot verify for himself.

On a lighter note: I’d be suprised and wary if someone (might even be a good
friend) takes my fingerprint in order to help me cooking, largely because it
feels unnecessary and out-of-place to invade my privacy in that way. No matter
how much I trust that person or what explanations he has, it does leave a weird
feeling that is not exactly beneficial.

3 Likes

Sorry, I don’t follow. When Pkg handles a request for a package to be installed it is not a “random” request, is it?