Pkg.jl telemetry should be opt-in

Both of which are required to even connect to any service for downloading packages, so this is hardly unique to PkgServer disclosing what else they collect. Preventing this means not connecting at all.

Ok, so the attack vector here is someone getting a hold of Pkg telemetry, correlating that data with something else to find you/your service and then launching a dedicated attack against your service because of the specific version that has been installed (nevermind that the specific version of a package is not part of the telemetry)?

I’m sorry, but that’s outlandish and much too cumbersome for an imaginary attacker. If I want to attack your service, I don’t go and steal all telemetry data ever recorded, that’s much too high profile.

The specific packages versions may not be part of the telemetry data but are necessarily part of the request you are doing to the package server and might linger (as was confirmed for the ip-address at least) in the server-logs.

I was giving a worst-case scenario to someone who asked. What is “too cumbersome” for an attacker always depends on how much of an high-value target we are talking about.

2 Likes

As far as I know, all traffic to PkgServer is already encrypted, so drive-by MITM attacks won’t know what you use. Attacking PkgServer just to find out which clients at some point or another used some version of some package is a great risk for an attacker. If they’re willing to take that risk and jeopardize their operation by being caught on Pkg’s side, you (as the service provider/extremely High-Value target) got much bigger problems already and should probably think twice about connecting to a public service at all.

If you weren’t the original target, but PkgServer was and by coincidence it comes to light that you were using outdated or vulnerable packages, that’s hardly the fault of PkgServer. Sure, the leak could have been prevented, but that doesn’t change the fact that running vulnerable software in and of itself is negligent. Relying on obscurity of that fact as your security model is bound to go wrong, sooner or later.

Right, at which point the minimal telemetry that is sent to the PkgServer is much less of a target than your actual service. There simply isn’t much to gain here, since an attacker can, just by observing the IP addresses in the traffic, correlate that you’re using julia anyway.

Don’t get me wrong, I’m all for taking control of one’s privacy, data and traffic, but if your attack vector is leaking metadata through a free, open source, openly discussed service you really have much bigger fish to fry, so to speak. You’re also probably in the vast, vast minority of people using that service and the good that this truly minimal amount of data can do for julia as a whole far outweighs the benefits for a small group of people that are going to check the settings and options anyway.

4 Likes

As far as I know, all traffic to PkgServer is already encrypted, so
drive-by MITM attacks won’t know what you use.

I presumed this was about the potential danger of the collected data in case it
has been compromised already. Without auditing the server infrastructure
not much can be said about how difficult it is for an attacker to obtain the
data. Usually such data gets leaked by accident and is then available for free.

you (as the service provider/extremely High-Value target) got much
bigger problems already and should probably think twice about connecting
to a public service at all.
[…] you really have much bigger fish to fry, so to speak.

A problem does not get smaller by comparing it to an even bigger problem, that
kind of misses the point of discussion. To give an example: our world has far
bigger problems than a data-privacy issue the package manager of some
programming language, which does not mean we shouldn’t be allowed to discuss it
anyway.

Relying on obscurity of that fact as your security model is bound to go
wrong, sooner or later.

I definitely agree with you that there is no sustainable security by obscurity.
People might be running outdated versions for various reasons (mainly stability
and maintenance) and vulnerabilities might even be introduced in newer
versions. Solving software security is a different and much bigger issue, which
again does not prevent one from taking measures to reduce the risks for people
affected.

[…] far outweighs the benefits for a small group of people that are going
to check the settings and options anyway.

It is not about those who opt-out but about those who don’t.
What outweighs what is very difficult to claim without data and especially in
the field of privacy.

3 Likes

My whole argument is that you’re not giving up more of your privacy by having that telemetry compared to not having it. Not having it leaves julia development blind to how their work is used, while you’re still giving that information to GitHub/MicroSoft, just by lieu of connecting to them to download a specific package. GitHub doesn’t easily share that data with julia core developers at large, so where’s the harm to you, an end user, in giving that data to the people producing the thing you want to use anyway?

To me, the choice here isn’t about whether or not this telemetry is collected, since it already is (albeit not by julia). Since it’s already being collected and we can’t really do much about this (next to using PkgServer), why not give that very minimal amount of information to the people helping us here? I understand that this can be seen as an opportunity to get away from evil Microsoft and not share anything with anyone, but that’s just not a viable strategy for hosting PkgServer itself, since hosting is very expensive.

3 Likes

[…] while you’re still giving that information to GitHub/MicroSoft, just by
lieu of connecting to them to download a specific package.

One of the stated goals of redirecting downloads to the package server is to
become more independent of GitHub. Therefore it is probably not correct to
assume that GitHub will be involved in every single transaction.

so where’s the harm to you, an end user, in giving that data to the people
producing the thing you want to use anyway?

There is no immediate harm being done by giving away data in itself. It is
about the potential misuse of amassed data if it gets compromised and whether
it is a good idea to accumulate it for a non-essential reason in the first
place.

3 Likes

Thanks for clarifying. This is a valid general concern, but not specific to Pkg: it applies to all sites which log IP (which is a pervasive practice). I wonder if

  1. keeping IP logs separate from the telemetry,
  2. with restricted persistence (compatible with basic security)

would address your concerns.

I am not sure this is something users should be concerned about. In case Julia is used on a server which accepts all kinds of requests, it is not much more difficult to just try out all known vulnerabilities, and/or fingerprint the OS.

There is no denying that when information is collected, it increases the chance that it will be used for malicious purposes. You are also right that not all attacks can be foreseen, especially those that result from collating with data generated after the original protocol was envisioned. That said, from a practical perspective, the marginal increase in potential security risks may be small for the average Julia user.

However, we should also consider users who are operating in a high-security environment who have different needs. From this forum, it is my understanding that in such contexts, computers are not online (firewalled) and installing software (including packages) requires pre-approval, so telemetry requests will simply not go through and no data will be collected.

4 Likes

No, the worst case is that people who have access to the disaggregated data can figure out that UUID x belongs to you (based on your git commits or whatever), and that you spent an inordinate amount of time doing julia programming with a particular high profile politician from an IP address that happens to be assigned to a particular potentially scandalous location (a strip club, a nonprofit foundation devoted to electing vehemently racist politicians, a meeting room where KKK members congregate, private religious services for satan worshipers, whatever. just insert whatever would be the most damaging thing you could imagine for yourself)

The point is that if the UUID is kept only on a machine that has no access to your IP address then it’s impossible to link your persistent identity to your location and the location of others.

It’s the persistent identity which tracks your install across all locations where your laptop happens to go that’s the issue. The IP should not be collected on the same machine where the UUID is collected.

5 Likes

I’m sorry, that was worded poorly on my part. What I meant was that right now, basically all packages are downloaded through GitHub, which already collects IPs and endpoints for quite the same purposes as PkgServer would.

Again, what misuse do you have in mind here that’s enabled by this telemetry should the data get compromised? I’m pretty sure just about every scenario is already possible and isn’t prevented by not having this telemetry…

Also, I fail to see how judging the size of the userbase to plan for how much bandwidth PkgServer is using and how it relates to which endpoints (CI, end users, developers…) is not essential? Further, protecting PkgServer against DOS to ensure availability is pretty much a must-have, since it’s a free, public service.

see above: Pkg.jl telemetry should be opt-in - #149 by dlakelan

Sites that collect your IP have no way of knowing who is using that IP… particularly as a machine travels around the world. But sites that collect an IP and a UUID do know when a particular machine moves around the world.

github might collect IP address and userid data in the same database/table, but if they are concerned about privacy they shouldn’t do that. Since Julia is concerned enough about privacy that it’s trying to do the right thing here… It’s important that they don’t collect UUID and IP address in a correlated way.

4 Likes
  1. keeping IP logs separate from the telemetry,
  2. with restricted persistence (compatible with basic security)

Yes, these are certainly valid ways to reduce the risks of misuse.

But on a related note: from a user point-of-view all of the data is still sent
to pkg.julialang.org and since one is in no position to validate how the data
is being separated or handled, the user needs to put a certain amount of trust
in that service.
A good way to design a trustful service is to only require the information that
is strictly needed for operation (GDPR calls this “essential information”). If
more data than this is sent without explicit consent then it seems natural that
people begin to raise concerns.

4 Likes

Then we need to rework it. We are not trying to trick users, we are trying to nag them. I should say, though, that the screenshot I posted appears in the area for notifications, where a lot of notifications appear and then disappear if you just ignore them. So it is not something like a modal dialog box that gives you the impression that you need to do anything or interact with it. But I take your point, we should probably add another half sentence that makes it more clear that one can just ignore the whole prompt without problems.

I have a couple other questions around the project hash, mostly just understanding what that is about. Is the general idea that you want to be able to reverse engineer the Manifest.tomls for each individual user? Or at least those parts of the Manifest.toml that make up packages that are on a given package server? That is possible with the information that you are collecting, I think?

Or is that actually not really possible, because say I have two environments, I instantiate the first (and you’ll be able to reconstruct the content of my environment from the telemetry) and then I instantiate the second one, but now the package server will only get requests for any packages in that second env that haven’t already be downloaded as part of the instantiation of the first env. So will the “picture” of my second env that the package server gets be incomplete there?

So I guess right now I’m a bit confused what the goal of the whole project hash thing is. To understand what environments users are using?

7 Likes

Alright, so JuliaComputing wants analytics on their customers/users to get more funding/improve features - good for them. I also recall this being discussed on here before, and I don’t see it as a concern and I’m a pretty paranoid person. What are they honestly going to do with it - target more free open source projects to me? Sounds great…

Keep in mind Julia is open source - they aren’t doing something evil, and even still it is optional. Maybe say “Here’s some legal crap, but you’re a software person click here to see the telemetry code”. At least it’s transparent :). If it was something evil do you know how fast it’d be on HackerNews, or like… anywhere?

I say let’s all relax and remember, JuliaComputing is a really nice inclusive group pushing OSS to it’s limits. Let’s try to work with them to mitigate concerns, but not freak out. Trust me your cellphone, and laptop have multiple concurrent processes doing worse things than this as you’re reading this…

8 Likes

@StefanKarpinski please please please take my data and use it to make julia better! thank you!!

16 Likes

I think that this needs clarification to avoid noise in the conversation: it’s not Julia Computing who will get the data, but “a limited subset of core Julia developers” (there may be coincidences in the people, but legally this is different).

JC may enable other telemetry options in their products (Julia Pro, etc.), which have their own terms of service.

6 Likes

As far as I know, the only way to not send the telemetry is to either disable it manually in the settings or click the small x on the notification (which just hides it once, it will show up again). If you don’t notice the x or don’t know about the settings page, you have no choice but to accept the telemetry, since that notification doesn’t just vanish iirc. I’d personally prefer all choices (accept, deny once, deny always), including an explanation what is sent.

I think the project hash is used for determining the number of different projects (see here):

This hash value uniquely identifies the path of the active project without revealing any information about that path. Having this value allows determining when packages are dependencies of the same project, as opposed to being used in different projects on the same client: if two requests have the same project hash value, they are used by the same project; if they have different project hashes, they are not.

the hash function is applied to the client UUID, the secret salt value, and the active project path.

The salt ensures that without the Server and the corresponding requests, you can’t reconstruct which packages are used in the same project just by observing that hash. You can’t realistically create a hash collision here.

3 Likes

On the contrary, this only goes to prove that you don’t really care about getting reliable usage data. If the usage data also counts CI usage, then you are not actually counting human users, and the usage stats will be skewed towards CI pipelines.

This is another reason why the telemetry should be opt in, if you care about getting actual human user data.

Also, I’m going to have to consider switching away from Julia if this telemetry thing continues. You say it’s okay without first consulting the community. You should have held a public discussion before even making a decision on what to do.

So you decided it is okay to automatically collect data on users, where does it stop? In Julia 1.6 you’ll probably keep adding more telemetry. I’m a bit disgusted by all this, and doubt you would be getting reliable data in the first place with an opt-out approach, since the CI pipelines are constantly downloading fresh installs of Julia.

Terrible idea, and what abuse are you attempting to prevent? Are you the package police now?

1 Like

If 0% of users want to opt in, maybe that means it is a feature most people would rather not have. Another reason to not have it or make it opt in.

2 Likes

Let me guess, in Julia v1.6 or beyond, @StefanKarpinski will decide without discussion that he needs to collect crash data from all Julia users, and make it opt out as well.

This opt-out telemetry is a breach of trust for the Julia community, since it’s not clear where this will stop. I’m sure that as this is normalized, your investors will keep asking for more data, and you’ll want their money, so you weakly give in to their invasive demands.

This is kind of weak minded, and shows Julia cannot stand up for what is right.

Funding should not be primarily based on usage stats, since those stats can be skewed anyways. What’s to stop someone from automating and gaming the system with a VPN and MAC address randomizer? Then they could get more funding for their package by creating fake usage stats.

1 Like

A piece of well-intentioned advice: I can see this topic matters to you, but, as I’ve seen happen in previous threads on other topics, the intensity of your rhetoric and the use of a tone that suggests a personal resentment weakens your message.

32 Likes