Pkg.jl telemetry should be opt-in

Hi,

I do not use Julia as much mainly due to lack of adoption in my circle of collaborators (but there are other reasons too), so it should weight what is taken of my opinions below:

  1. With telemetry being an opt-out, Julia just went into the pile of things that I recommend or even cajole people into using with the caveat of after having understood the terms and conditions. Things in that pile that are not as widely recommended by others too, I have rarely found being considered seriously, let alone adopted.
  2. Consider this decision again imagining that the size of the Julia community was multiple orders of magnitude larger than it currently is. I believe that it will color the threat of potential misuse a lot differently than it currently does.
  3. It does not help that Pkg.jl comes bundled with Julia by default. There are other problems with this (which evidently only I have gripes with for any language that does this) but what is relevant here is that this sets a precedence regarding what participating in the Julia “platform” as a whole comes with or endorses. Pkg.jl is unlikely to remain the only “blessed” package to need telemetry, and the way it does it will inform how the rest will. Any missteps by it will reproduce further with mutations, for better or for worse.

Lastly, I have a suggestion framed as a question. Would it be possible to have these statistics be collected and processed locally rather than on a server, and then only the anonymized aggregates being transmitted? It was already suggested by someone earlier in this thread, and to me it seems more acceptable. Why should it matter in order of milliseconds when exactly a user wanted to download a package? In fact, when collected and analyzed locally (and anonymized before transmission), even more and “precise” information may be available, e.g. was a package even used after being installed as a transitive dependency.

PS I am sad I had to make this as my first post in this community instead of something a lot more useful or impactful, but I had to put my vote into this discussion, since some people are counting.

5 Likes

For most people, it’s not optional to own a phone and the associated tracking that comes with it. It’s become a necessary aspect in modern life and these companies have hijacked this necessity for spying.

What’s distasteful about the opt-out tracking in Julia is that it’s also hijacking a lot of users who have started to rely on free software development practices.

It can be argues that it’s not as invasive as the phone companies, but it is still the same hijacking, although it is indeed to a lesser degree.

2 Likes

You should look at the data that’s being collected. There is no way for a user to send less identifiable information over the internet.

6 Likes

In contrast, what is nice about Julia’s opt-out package telemetry (compared to social media, mobile phones, etc) is that you can just… opt out. Completely, and with little effort.

5 Likes

Opting out still requires additional knowledge and understanding. It adds an extra layer of friction to new users for adopting the language and also for existing users. Now I need to give a disclaimer to everyone so they will have to beware of the data collection of Julia. It’s not going to help with the adoption of the language.

Instead of just getting started with Julia, they now have to first configure their toml files properly, if they wish to opt out, before they even get started with Julia for the first time as a user.

If Julia wants to be honest, then there should be NO dark-patterns to the data collection. Editing a TOML file for new users is a hidden dark-pattern.

Having a do not track environment variable would be better, as proposed above.

2 Likes

With all due respect of course:

Of course it is. But what is simple is constrained by what one values. Running an experiment multiple times is more complicated, but I value the reliability of my results. User’s consent and privacy should be more valuable than what informs this evaluation.

The proposed solution already has that information in a “database” not in my control. I will be freaked out more by that because there’s no forgiving if I don’t opt-out at the very beginning.

Yes, bit very little. I imagine that users who consistently object to all kinds of data collection on principle should find it trivial (since for pretty much everything else, even mitigating but not completely avoiding data collection requires a lot more work and knowledge).

I think that the key here is that all users are informed about the telemetry and how to opt out easily. This should be covered by

and

6 Likes

I don’t disagree with you, but I’ve a question: This forum (along with millions of other websites) uses google-analytics.com, which means that, every time you use it, Google is aware of it. Would you warn people about that, too?

The way I deal with it personally is to use uMatrix to block the analytics, not just on this website but on all of them.

In other words, the sane position to take these days is to assume that everything is tracking you, and then take measures against that according to your principles. While it’s unfortunate that Julia is no longer an exception to the rule, the tracking that they do is as innocous as can be.

In the hypothetical scenario that I could decide whether to get rid of google analytics in discourse, or tracking in Pkg.jl, I’d get rid of google analytics without hesitation.

2 Likes

Yes! I do make the warning!

Thankfully, for desktop browsers it is easy to recommend uBlock Origin, and for Android, Blokada. Those take care of many such problems, so I can say “you know it is Google, so it is going to track you, but you’re already using the blockers, right?”.

However, I have no idea what to recommend when things from within my terminal do the tracking. I need a little more experience using Pi-Hole before I make a claim let alone recommend it someone, but I am being pushed towards it.

Thanks to the discussions here (and my specific suggestion) there is already a pull request making it a trivial single line command in the package manager to opt out. so you literally just type ] telemetry off or something similar.

7 Likes

You are required to do so according to GDPR.

I meant “personally warn people when you recommend them to use Julia’s discourse”, which is the context of @motjuste’s comment. I don’t think GDPR applies in this case.

1 Like

I have filed “suggestion issues” that do not force Pkg.jl telemetry to be opt-in, but help to alleviate a few concerns I have. My suggested changes are based in part on some of the posts in this thread. See the following top-level issue, and please ensure that the discussions which move into the issues are productive in nature: https://github.com/JuliaLang/julia/issues/36548

1 Like

There is a distinction I don’t think has been emphasized enough here. We are talking about “telemetry” (perhaps we should use a different word), and “usage data”, which both (to me) imply that data is collected at various random points in the background and sent somewhere. For example that is my understanding of what happens in VS Code. But Pkg does not do anything like that. Data is sent only when you are already doing package server operations, so the data footprint is scarcely different from a normal server log. Some here have clarified that the random client UUID is the only contentious issue to them, and I appreciate that. But let’s please focus on that instead of expanding this to “julia now spies on you”.

PyPI is a useful point of comparison: Analyzing PyPI package downloads — Python Packaging User Guide You can click through from there to see their full schema, which includes country as well as more detailed system and distro information. AFAICT there is no UUID, but there are enough details that it seems fairly fingerprint-able.

Some might object to PyPI’s data collection as well — fair enough. But the comparison is relevant when communicating to others: if you only “warn” people about julia’s package manager and not anything else, you are sending the message that julia is somehow uniquely nefarious, so be aware of whether you intend to send that message.

Thanks to those who have filed specific issues and PRs about this; I imagine we will be taking at least some of those suggestions on board.

39 Likes

We are talking about “telemetry” (perhaps we should use a different word), and “usage data”, which both (to me) imply that data is collected at various random points in the background and sent somewhere. For example that is my understanding of what happens in VS Code.

I had the same thought. I think what Pkg is actually doing is less invasive than what people think of when they hear the word “telemetry”, of which my opinion is based on what VS Code does.

Though, I’m not sure if VS Code collects a reliable UUID. In the docs here they say

One question we expect people to ask is to see the data we collect. However, we don’t have a reliable way to do this as VS Code does not have a ‘sign-in’ experience that would uniquely identify a user. We do send information that helps us approximate a single user for diagnostic purposes (this is based on a hash of the network adapter NIC) but this is not guaranteed to be unique. For example, virtual machines (VMs) often rotate NIC IDs or allocate from a pool. This technique is sufficient to help us when working through problems, but it is not reliable enough for us to ‘provide your data’.

and I’m not sure how unique the NIC really is. Maybe it’s effectively the same thing.

What about a way to generate a new UUID? Of course the idea is that it is supposed to be persistent, but would providing people with a way to just generate a new one be considered an alternative to opting-out of telemetry all together? Or does that not check any of the boxes that people are worried about?

IMO, the risk with the UUID is that package usage patterns could be utilized to determine what public packages a private package is using (rather than any given user, a worry that is already addressed). To a certain extent, we should worry about this with e.g. GitHub as well. The difference is that GitHub downloads are state-less, and it is straightforward to cache them and play routing tricks to essentially randomize any other unique identification (such as IP address and download time/order). With a UUID, if multiple UUIDs have identical or correspond-able download patterns, it can be inferred that certain packages are being combined. This information could be seen as sensitive. This general idea that this private package data will be leaked by default is what drove me to file https://github.com/JuliaLang/Pkg.jl/issues/1899 which seeks to disable telemetry with a package scope.

As for the suggestion about generating a new UUID, I think my concerns are addressed if a new UUID is generated for each request. This is equivalent to not having a UUID though, and it appears this is covered.

In terms of the level of “polish” of the existing functionality, IT admins are immediately going to want to disable this either globally or on a per-package basis (hopefully the latter, so more stats are available). If this is left up to individual users and is not controllable at a higher level, newer versions of Julia and/or Pkg.jl will be seen as a risk. I filed https://github.com/JuliaLang/Pkg.jl/issues/1900 to this end.

Because there is an upcoming release, I am concerned that the coincident changes of Pkg.jl with this functionality will reflect negatively on Julia 1.5. If telemetry were made opt-in, I think most everybody would be satisfied, except it would be useless… :slight_smile:

1 Like

If this is a serious consideration for your organization, you should really consider running your own package server, which would also bring other benefits. The LocalPackageServer package may be of interest and maybe also Allow the package server to opt out of telemetry · Issue #1901 · JuliaLang/Pkg.jl · GitHub.

3 Likes

Perhaps a silly question that betrays my lack of understanding of the issues, but if the goal is to track the use of the components of the Julia ecosystem, why can’t the components be tied to the Julia executable instead of the user? In other words, each Julia executable would have a unique ID, and the telemetry would report usage tied to the executable. There would be no link between the user and the executable, hence complete privacy.

This is not an unreasonable concern for some organizations, but if you do have these requirements, you probably want to be running your own package server.

7 Likes