Pkg.jl telemetry should be opt-in

You seem to be missing the point like there’s no tomorrow. Here is the desirable goal set:

  1. A person with access to a client machine can not prove that entries in julia’s log correspond to this machine for more than a few days at a time, with the duration settable by the client (a kind of “continuous opt in”)

  2. Julia can assume that a given ID corresponds to 1 person most of the time.

This is exactly what the class of proposal I’m discussing does.

No, only the handful of connections that actually meet that hash on any given day in the database. You’re not exhausting the hash space on the serverside here.

Besides, this is totally beside the point already. The telemetry is sent via TLS, any attacker willing enough to identify you is either not going to care about you connecting to julia (you already connected to the torrenting service and the ISP has logs of that, remember?) or they have better means of achieving their goal (Hey, random certificate authority, mind giving us the private keys to your TLS certificates? Thanks!).

This is a technical solution to a social problem. You can’t win on the grounds of technology here.

In any given situation, there will be 1e24 different salts that can produce that hash from your UUID… the fact that you were able to find one is not evidence that it was the one in use and therefore that your chosen salt was correct and therefore that you’ve proven my UUID corresponds to that hash.

you’re missing the point. Please think this through a little.

No, I very much do get your point. I just don’t think its the correct point to make here.

  1. This assumes that finding out about your connection to julia is worse than having literal access to the data on your PC and the data on PkgServer. As the owner of that client machine, you have already lost at the point that your data is exposed.

  2. Julia can already assume this without having to deal with convoluting their UUID scheme by telling the user “Hey, opt out of the UUID, HyperLogLog and statistics take it from here, mkay?” if fuzziness and blending into the crowd is what you want.

I’ve made my technical point, and already promised to step away from the argument about the goals and soforth (Pkg.jl telemetry should be opt-in - #338 by dlakelan). so I’m going to reply on technical merit alone. The point is my proposal achieves the technical goal of not having direct connection between data on the client machine and data on julia’s servers. This is something that I value, and it’s something that a plethora of people keep clicking the little heart on, and that I’ve gotten private messages supporting.

Thanks for the productive discussion and suggestions, everyone. We’ve tossed this around a bit among the Julia committers group, and I think it’s clear that all this requires some additional design work, so here’s how we’re going to proceed:

  1. For 1.5, we will remove the UUID and all other stateful data from Pkg requests. Since there won’t be any tracking, that also obsoletes the opt-in/opt-out question for 1.5, since nothing is sent that allows profiling the user in any way. I have made a pull request that implements this change.

  2. A number of people have pointed out that “telemetry” is a misleading term, because it evokes the kind of all-out behavioral tracking that some of the big tech companies do, which this isn’t. (Technically, telemetry is when you collect data offline and send it later, which doesn’t happen here.) Accordingly, we will be renaming the few remaining non-user-specific headers to “request metadata”. As mentioned above, no data will be sent that can be used to track or profile anyone.

  3. Several alternative estimators have been proposed, but one big problem is that we have no way to determine whether or not such estimators would be effective, or to evaluate any privacy-precision tradeoffs. To help with this, we will keep the opt-out UUIDs on master for now. This gives us more time to evaluate appropriate trade-offs, look at some real data and experiment with alternative approaches, without the time pressure of needing it done in 1.5. With this data, we will be able to simulate several non-UUID based schemes that allow us to estimate population-level usage patterns with real request data and see what the impact on the precision of the final analysis would be. We can’t say yet what the final solution will look like, but there have been several good proposals that we want to look at.

Finally I want to thank people for their level-headed responses and suggestions. While there was some frustration at times, I think by and large people made helpful suggestions along several different axes and we’ll try to incorporate them as we go along.

107 Likes

This page states what is sent in the header, but it doesn’t say what is stored in your database. The package server has access to additional information, such as IP address, time stamp, list of packages, etc. Which of these are stored?

At the minimal end it just stores “I have seen this UUID at some point”. At the other end, each time the package server is contacted creates a record that is stored. Presumably without the IP address, but if there is a time stamp, it can be compared to other server logs that are presumably kept to fight abuse and attacks. And if sets of package UUIDs are stored, then this can also be used to (approximately) track people when they change their UUID.

1 Like

I am not sure what “your database” means here. It seems like you are suggesting that Stefan is creating this database for his own use, and is going to engage in the kinds of analysis you suggest. Presumably, that is not what you meant, and your use was casual. I am asking to ensure there is no unfortunate misunderstanding.

Also, @StefanKarpinski’s message that you reply to clear mentions not using UUID for Julia 1.5. I didn’t understand your scenario and how anyone can be tracked when there are no unique identifiers.

-viral

2 Likes

He was actually replying to a message much higher up in the thread…

That probably explains my confusion! I didn’t realize you can click and see the message to which someone is replying.

-viral

3 Likes

Apologies; I removed one “your” from my draft, but missed this one. I meant “the Julia package server database”. I did not mean to attribute any database or technology to a person, rather to the community server and service.

Also, I replied to the wrong message. (That’s not helpful in a long discussion such as this one. I had to log in before posting, and probably ended up at a different message afterwards.) I intended to reply to a message pointing to https://julialang.org/legal/data/, and wanted to point out what I think is unclear in the description there: in addition to describing what is transmitted, it should also described what is stored.

-erik

4 Likes

Thank you. Also, on re-reading, my message does sound a bit snarky and please accept my apologies as well. I wanted to highlight the fact that the data collected is a community resource, mainly to be used for making Julia and the package ecosystem better for all of us. Thus, while the concerns raised are valid - we are actively making trade-offs as a community about what we will let ourselves collect and analyze.

Yes, that is quite a reasonable expectation.

-viral

9 Likes

Just because they wont store it, doesn’t mean an advanced persistent threat won’t store both UUID and the IP address together as it intercepts all data.

Perhaps the transmitted UUID and metadata should be encrypted as it is transferred over the internet, instead of using plain text, to prevent data collection from packet sniffing and so on.

And I mean encrypting all Pkg server traffic over the internet as a whole, not just the UUID. That way you can separate the IP and the UUID when it is decrypted and somewhat prevent lazy attacks listening in on plain text requests for Pkg data.

I was under the impression from Pkg.jl#1377 that

Both protocols work over HTTPS, using only GET and HEAD requests

5 Likes

Do you just mean HTTPS?

1 Like

Yea, HTTPS would do it, I wasn’t aware of what kind of data transmission would be used for this. It really should be using that by default for all this.

Pkg refuses to use HTTP unless the host is local, even if you explicitly use an http:// URL as your package server value. This prevents people from accidentally leaving themselves open to snooping or MITM attacks.

34 Likes

I think that this is uncalled for.

4 Likes

The short TL/DR is that no UUIDs are sent in Julia 1.5. Julia 1.5 sends less information and is more protective of the information than Python is — and said information is only sent if and when you download packages from a package server. The package server is easily changed. The slightly longer version is the marked solution in this thread.

To be abundantly clear:

  • The data collected is not owned or managed by MIT, the Julia Lab, nor Julia Computing. It’s a community resource.
  • IP addresses are sent because they’re needed to send the packages. It’s kinda how the internet works.
  • IP addresses are only stored to help identify abuse and DDoS attacks (intentional or not) and thus are purged on a regular basis.
  • This data is not for targeting ads or emails.
  • If your sensitive research topic can be revealed through open source package usage, you may want to re-evaluate your security model.

If you don’t care about this issue, please don’t spread FUD about it.

22 Likes

Thanks for clarifying! Not trying to spread FUD - but having worked at universities, in security like roles, and general industrial environments, those are the things that come to mind.

I’ll remove my posts, its good to know how you all decided to handle the situation and have really thought about protecting the end users. Most companies would not have gone as far as this.

Would it be worth making some kind of official post stating the resolution of this thread because it’s huge and full of twists and turns?