Pkg.jl telemetry should be opt-in

Random as in various automated processes that don’t set CI flags, docker builds scripts, misconfigured University clusters that install Julia from scratch once an hour over 10000 nodes etc. It’s really hard to know what requests are from actual users, because they are absolutely drowned out by automated processes. From experience the server side request data is not super convincing in establishing user numbers with any sort of accuracy. Ironically the most accurate numbers I ever heard were from a big tech company that just looked at their existing data and had a fairly convincing estimate based on that, that was much better than anything we had.

9 Likes

Okay, I begin to see the problem. Thanks.

Even better! If you’re only storing 48 out of 64 bits of the output hash, we have 1e24 different salts that result in the same hash if combined with our known UUID. Now we’re down to 2^102 inputs that result in a unique output, that’s a significant reduction in complexity.

According to this, we can calculate a sha256 hash of 567 bytes (longer than we need) in ~2 cycles on a 2019 AMD EPYC 7702. At 3.2 Ghz for that CPU, we get 1.6e9 hashes per second. Now sure, there’s no million AMD EPYC CPUs out there, but that speed is achieved because it’s done in hardware. Putting that circuit on ASICs more than a million times certainly seems feasible for the nation state attackers you’ve brought up so often, thereby making this attack that much more feasible.

This is actually a good point. Instead of doing it on the server side, I don’t see a problem with doing it on the client side…

The client generates a UUID which is permanent, and a 128 bit salt which has some percent chance of changing each day… It calculates sha256(salt * UUID * salt) and takes the first 48 bits… it sends this.

We can get “pseudo-opt-in” by making the probability of changing the salt say 0.25 per day, so by default Julia gets only around 4 days of history. If you “fully opt in” you can set your own probability of changing per day, to whatever you want (maybe express this as the expected lifetime). Opt out it changes every time you interact with Pkg and the salt is not stored anywhere.

yes, and now we can say with certainty that there’s a 1/1e24 chance that you’re this person. Totally incriminating.

It’s like saying you rolled a die and it didn’t explode and therefore you must be convicted of the crime.

Better yet we can just send the first bit, and then half of the world is guilty all the time!

If you’re doing it client side and don’t synchronize salts between users, you lose the ability to distinguish data serverside since the server only receives random junk it can’t realistically correlate across your chosen window of variability. You’re effectively back to sending random new “UUIDs” every request/day.

You seem to be missing the point like there’s no tomorrow. Here is the desirable goal set:

  1. A person with access to a client machine can not prove that entries in julia’s log correspond to this machine for more than a few days at a time, with the duration settable by the client (a kind of “continuous opt in”)

  2. Julia can assume that a given ID corresponds to 1 person most of the time.

This is exactly what the class of proposal I’m discussing does.

No, only the handful of connections that actually meet that hash on any given day in the database. You’re not exhausting the hash space on the serverside here.

Besides, this is totally beside the point already. The telemetry is sent via TLS, any attacker willing enough to identify you is either not going to care about you connecting to julia (you already connected to the torrenting service and the ISP has logs of that, remember?) or they have better means of achieving their goal (Hey, random certificate authority, mind giving us the private keys to your TLS certificates? Thanks!).

This is a technical solution to a social problem. You can’t win on the grounds of technology here.

In any given situation, there will be 1e24 different salts that can produce that hash from your UUID… the fact that you were able to find one is not evidence that it was the one in use and therefore that your chosen salt was correct and therefore that you’ve proven my UUID corresponds to that hash.

you’re missing the point. Please think this through a little.

No, I very much do get your point. I just don’t think its the correct point to make here.

  1. This assumes that finding out about your connection to julia is worse than having literal access to the data on your PC and the data on PkgServer. As the owner of that client machine, you have already lost at the point that your data is exposed.

  2. Julia can already assume this without having to deal with convoluting their UUID scheme by telling the user “Hey, opt out of the UUID, HyperLogLog and statistics take it from here, mkay?” if fuzziness and blending into the crowd is what you want.

I’ve made my technical point, and already promised to step away from the argument about the goals and soforth (Pkg.jl telemetry should be opt-in - #338 by dlakelan). so I’m going to reply on technical merit alone. The point is my proposal achieves the technical goal of not having direct connection between data on the client machine and data on julia’s servers. This is something that I value, and it’s something that a plethora of people keep clicking the little heart on, and that I’ve gotten private messages supporting.

Thanks for the productive discussion and suggestions, everyone. We’ve tossed this around a bit among the Julia committers group, and I think it’s clear that all this requires some additional design work, so here’s how we’re going to proceed:

  1. For 1.5, we will remove the UUID and all other stateful data from Pkg requests. Since there won’t be any tracking, that also obsoletes the opt-in/opt-out question for 1.5, since nothing is sent that allows profiling the user in any way. I have made a pull request that implements this change.

  2. A number of people have pointed out that “telemetry” is a misleading term, because it evokes the kind of all-out behavioral tracking that some of the big tech companies do, which this isn’t. (Technically, telemetry is when you collect data offline and send it later, which doesn’t happen here.) Accordingly, we will be renaming the few remaining non-user-specific headers to “request metadata”. As mentioned above, no data will be sent that can be used to track or profile anyone.

  3. Several alternative estimators have been proposed, but one big problem is that we have no way to determine whether or not such estimators would be effective, or to evaluate any privacy-precision tradeoffs. To help with this, we will keep the opt-out UUIDs on master for now. This gives us more time to evaluate appropriate trade-offs, look at some real data and experiment with alternative approaches, without the time pressure of needing it done in 1.5. With this data, we will be able to simulate several non-UUID based schemes that allow us to estimate population-level usage patterns with real request data and see what the impact on the precision of the final analysis would be. We can’t say yet what the final solution will look like, but there have been several good proposals that we want to look at.

Finally I want to thank people for their level-headed responses and suggestions. While there was some frustration at times, I think by and large people made helpful suggestions along several different axes and we’ll try to incorporate them as we go along.

107 Likes

This page states what is sent in the header, but it doesn’t say what is stored in your database. The package server has access to additional information, such as IP address, time stamp, list of packages, etc. Which of these are stored?

At the minimal end it just stores “I have seen this UUID at some point”. At the other end, each time the package server is contacted creates a record that is stored. Presumably without the IP address, but if there is a time stamp, it can be compared to other server logs that are presumably kept to fight abuse and attacks. And if sets of package UUIDs are stored, then this can also be used to (approximately) track people when they change their UUID.

1 Like

I am not sure what “your database” means here. It seems like you are suggesting that Stefan is creating this database for his own use, and is going to engage in the kinds of analysis you suggest. Presumably, that is not what you meant, and your use was casual. I am asking to ensure there is no unfortunate misunderstanding.

Also, @StefanKarpinski’s message that you reply to clear mentions not using UUID for Julia 1.5. I didn’t understand your scenario and how anyone can be tracked when there are no unique identifiers.

-viral

2 Likes

He was actually replying to a message much higher up in the thread…

That probably explains my confusion! I didn’t realize you can click and see the message to which someone is replying.

-viral

3 Likes

Apologies; I removed one “your” from my draft, but missed this one. I meant “the Julia package server database”. I did not mean to attribute any database or technology to a person, rather to the community server and service.

Also, I replied to the wrong message. (That’s not helpful in a long discussion such as this one. I had to log in before posting, and probably ended up at a different message afterwards.) I intended to reply to a message pointing to https://julialang.org/legal/data/, and wanted to point out what I think is unclear in the description there: in addition to describing what is transmitted, it should also described what is stored.

-erik

4 Likes

Thank you. Also, on re-reading, my message does sound a bit snarky and please accept my apologies as well. I wanted to highlight the fact that the data collected is a community resource, mainly to be used for making Julia and the package ecosystem better for all of us. Thus, while the concerns raised are valid - we are actively making trade-offs as a community about what we will let ourselves collect and analyze.

Yes, that is quite a reasonable expectation.

-viral

9 Likes

Just because they wont store it, doesn’t mean an advanced persistent threat won’t store both UUID and the IP address together as it intercepts all data.

Perhaps the transmitted UUID and metadata should be encrypted as it is transferred over the internet, instead of using plain text, to prevent data collection from packet sniffing and so on.

And I mean encrypting all Pkg server traffic over the internet as a whole, not just the UUID. That way you can separate the IP and the UUID when it is decrypted and somewhat prevent lazy attacks listening in on plain text requests for Pkg data.

I was under the impression from Pkg.jl#1377 that

Both protocols work over HTTPS, using only GET and HEAD requests

5 Likes