Pkg.jl telemetry should be opt-in

But I think this has the opposite effect than you intend; when I see wild exaggeration I’m likely to dismiss a post out of hand. Conversely, a well-reasoned and plausible scenario which could directly lead to user harm and with low effort/complexity for the attacker would immediately catch my notice.

I’ll add that I’m completely confident this is true for other core contributors commenting on this thread who are certainly among the most skilled engineers I’ve ever had the pleasure of working with.

10 Likes

I think you have to agree that the scenario where it’s time for little Billy’s coding camp, so he fires up his torrent client and downloads the latest hit movie, and then while it downloads he signs into his coding camp and they tell him to install packages xyz, and now some forensic investigator working for the MPAA is able to show that the same IP address accessed the torrent site and the julia site at the same time, and that the UUID used with julia proves that the IP address was in use by little billy at the time and not some other computer, is not entirely farfetched and outlandish.

In the early 2000s multiple rounds of unfortunate parents lost their house over less than this.

This would only work if the proposal stored UUID and IP (and time) cross-linked, which we’ve gotten specific confirmation that it won’t happen.

It won’t happen intentionally by Julia. Suppose I’m the MPAA and I am looking to set an example of some children and their families (ugh). I get cooperation from FBI etc to monitor the torrent site since it’s a major lawbreaker, so I have logs of IP addresses that persistently infringe… so then I go to the ISP that provides the service and I get a subpoena to monitor users… I get a packet dump of the flows that various people who access the torrent site use… I see that packets go to julia at certain times, and then I subpoena julia to provide UUID logs. I can link it all together no problem. That’s exactly the kind of thing they did in the early 2000s

Then I sue the mother, and I get a court order to dump the hard drive, and I find the UUID. I can now prove to the jury that this particular computer at this particular time was using this particular IP address and downloading these particular files…

It doesn’t even require julia to do anything “wrong” or to do the IP address linking.

Basically hidden UUIDs are kind of toxic, because they completely foil plausible deniability of much of anything.

1 Like

While it’s not un-true for me personally, I am a bit of a sucker for the use of hypotheticals in legal analysis and line drawing arguments. I will say that I appreciate @dlakelan’s point of view, not because I think any of it is particularly likely, but because I think that it comes from a slightly different direction than previous commentary, as well as clearly highlighting one particular aspect (one needs to be careful about non-deniable long-term server logs).

All that said, I think it might be time for everyone to take a step back and breathe for a week or two. I think there’ve been a lot of good suggestions among the comments here, but I know that I personally have spent much of the past several days talking to people about this in various fora both public and private and have had very little time to actually take a step back and consider the issue as a whole. And I know others have shouldered far more of the interaction burden here, so I imagine the same is true for them. I think it’s fair to say that you have been heard and that the subject has been brightly illuminated from a plethora of perspectives. Let’s let the folks working on this take a few days away from all this, and then re-approach the topic with fresh eyes.

50 Likes

thanks for that acknowledgement, it was effective in convincing me that someone involved understands the issue. I’m happy to step away.

4 Likes

Except that argument about IP addresses being plausibly deniable hasn’t been true in decades, since ISPs are required by law (both European and American) to keep logs of IP ↔ endpoint pairings for threat mitigation and law enforcement purposes. Your whole scenario with subpoenaing julia to establish a link between two endpoints is moot, since ISPs are required to keep logs for far longer than the UUID would (presumably) be saved for and past case law has shown time and again that courts happily accept ISP logs as damning evidence already. There’s no need to subpoena one of the endpoints when you’re already reading everything that’s coming through the pipe, it’s just extra hassle with no gain.

Just because Copyright law and associated institutions are broken to hell and back, does not mean that making this telemetry available to julia is as bad a thing as you make it out to be. You’re trying to apply a technical solution to a social problem, which never works out in practice.


Don’t get me wrong, I agree that UUIDs should rotate at least with the same frequency that old entries on the server get deleted. I do not agree with the inflammatory and hyperbolic reasoning.

6 Likes

Out of all the discussion in the last day or two, I think this is one of the best questions which has been asked. I don’t have a precise answer but I feel like you’re right and it should be possible to design more inherently privacy-preserving estimators for the quantities of interest. The main problem is the sheer amount of technical complexity this would add in designing and validating the right estimators, ensuring they’re robust against abuse and dealing with the compatibility between client and server versions in the future.

5 Likes

Sorry for the stupid question: would it help anything if we would use a linear map to convert the uuids before saving them to the database? This way the uuids in database and the uuids in people computers wouldn’t match. Still there would be all the benefits to use the linear map to match the uuid to database uuid and get accurate statistics.

2 Likes

That really just kicks the can down the road since now anyone who has the transformed UUIDs and the map can recover the original UUIDs, so in terms of security it would be window dressing. Same thing with encrypting UUIDs on the client side: if that’s done in a consistent way (which it needs to be in order to be useful), then the encrypted UUID becomes the equivalent of a UUID itself.

11 Likes

Just going to throw out one concrete data analysis that originally led me to reach out to R’s CRAN team that would be possible using UUID’s and wouldn’t be possible without some kind of entity-level identifier: co-occurrence counts of (long-term) package installations. This isn’t as useful if it’s computed from short-term information from a single 10-minute session as you’ll often just recover the dependency graph, but it would let you build a package recommendation system if the co-occurrences occur over days/weeks. So this is compatible with @dlakelan’s request that UUID’s flip after N months have passed.

In practice, I think this example is worth considering for two reasons: (1) it shows there’s valuable information in package installations data that the current proposal isn’t discussing and which an aggregate like HyperLogLog wouldn’t and (2) it suggests that the space of interesting analyses often requires storing a lot of bits. In this case for an ecosystem of K packages, you could either (1) store a K-bit string for each UUID or (2) store an 64-bit integer count for each of the O(K^2) pairs of packages.

4 Likes

From a technical standpoint, you could use the HyperLogLog estimator currently proposed to choose a hash of the UUID which has say a 50% probability of collisions, so that you retain in the logs only a deniable record, but at the same time you can analyze the usage pattern for a given UUID and you’re analyzing a mixture of around 2 people. Even with 80% chance of collision, you’d be able to analyze groups of around 5 people. Is it so important to know that this one person installed xyz packages, rather than to say in a pool of 5 people we see the following patterns of installation…

3 Likes

Would this still work if the rule was not to re-generate every N months exactly, but to re-generate with some small probability p per day, such that the average life is N months? That way there is also no guarantee that your ID was the same yesterday.

2 Likes

This actually was my first suggestion WAY back up the thread.

Another useful idea is to have the data collection server generate a random salt with a small probability each day, then encrypt the UUID that’s sent and store that. So there is NEVER the actual UUID stored on the server… but everything is still unique and NOONE knows the mapping, we only know that over a given time… the mapping changes.

@StefanKarpinski wdyt?

edit: thinking about it more, if the database server regenerates a salt using /dev/urandom with probability 1/100 each day, stores it only in RAM, and whenever it needs to store an identifier it simply calculates hash(ID concatwith salt) and stores that… I’d probably have no objections.

2 Likes

Oh I missed that, here, sorry. But a slightly different randomisation – if the expiry date is saved, then your backups from a month ago contain a little more information.

If randomising on the server, is the same salt used for all users, and updated on some days? Then I suppose the intervals between randomisations would still be obvious in its logs. Or is it done per-ID, so that changes of salt look just like users leaving & joining?

exactly. so you’re storing a one-way encrypted version of all the UUIDs, and the encryption key is changing with say 1% probability per day.

For example, you could calculate the sha_256(ID * salt) and take the first 48 bits… it’s not reversible, it has a small probability of collisions, but for usage stats it’s almost identically useful as the UUID itself and yet no one could know what the mapping was.

1 Like

How would the statistics look in this case, if I install some packages today and next time after one month. Would the outcome be one user or two users? My understanding for the explanation was that every Pkg request would get almost unique id and thus counting the amount of users wouldn’t be trivial anymore.

Rather than thinking about this as a fixed time window, think about it as analyzing fixed hashed IDs… You can be pretty sure that a given hash is not a large group of people… So you go through your database and select a random set of hashes, and calculate what those people are doing… Now you have all the info about packages used together, the variety of projects individual users are contributing to, or whatever requires being able to discern an individual.

If you just want to know how big the Julia community is. the HyperLogLog estimator they’re using gives you this information.

SHA256 is sensitive to small changes in the input - changing what is given by a single bit will create large changes in the output, making the UUID completely useless since you’re now saving random data. You lose all potential analysis you’d want to do.

Having locality hashing also doesn’t help, since you retain the associability of UUID since the inputs are similar (and thus the outputs). You’re back to having a UUID, just with some constant computational cost associated with it on the server side.

Computing transforms of server-side data that allow useful analyses but don’t keep UUIDs around is something that I’ve put a lot of thought into and would like to develop, but it does not seem like all people who object to client UUIDs would be satisfied by this.

10 Likes