Pkg.jl telemetry should be opt-in

Thanks, this is a very useful explanation and clarifies a lot of the motivation. It would be great to include these additional reasons in the announcement of the new telemetry feature.

6 Likes

The second image in the Telemetry article on Wikipedia shows a crocodile with a GPS and radio on its head. This radio is collecting valuable information for scientists, yet I guess this device was placed there without the crocodile’s consent. The Pkg telemetry will likewise collect undeniably valuable information that will be used in fundraising, at present without the users’ explicit consent. I expect this sort of tracking and monetization from Facebook and Google, yet find it surprising and distasteful that Pkg.jl telemetry is monetizing Julia users.

Whether or not we keep the telemetry opt-out, I’d prefer that the Pkg telemetry page on Julialang.org mention fundraising; i.e. not hide the fact that the telemetry is monetizing Julia users.

4 Likes

Saying “Julia has approximately $n users according to telemetry, therefore it is a viable platform” is very different than “we have collected detailed identifiable data on $n users that we will use to target advertisements if you pay us.”

20 Likes

Just so I understand, your argument is:

  • With telemetry it is possible to count the number of Julia users.
  • The count of Julia users is information that could possibly be useful when applying for grants and fundraising.
  • Therefore, Pkg.jl is “monetizing Julia users”.

Is that a roughly accurate implication chain?

12 Likes

Yes; see data monetization on Wikipedia. I’d prefer that we not hide the fact that this information will be used for fundraising

2 Likes

Don’t you think that this is already clearly explained in the following statement?

5 Likes

Yes, I apologize for not reading more closely.

1 Like

I don’t feel monetized! I feel grateful that Julia exists and there are people who work hard to make Julia happen. Now, as a user, complaining to be monetized, really feels wrong to me (If I would be complaining).

I know, this is not the best argument, as you may also feel grateful for Googles great search engine, so I am happy to give all my data to the ads machine.

Whats real is that it is always a balance which needs evaluated. And just argueing that we are monetized and this is bad per se is not balanced, it is just following some kind of zeitgeist, where raising data has been bad before, it must be bad always.

13 Likes

I am hugely conservative in my data footprint. I regularly purge all cookies, use privacy browser plugins, and have even gone so far as to block (and manually whitelist) javascript at times. I use DDG, have bailed on much social media, and such. I’ve gotten into many arguments with family members saying they “have nothing to hide” because they definitely do. This discussion, though, is puzzling to me.

What is the threat model? The telemetry is extremely conservative, really only adding three things above and beyond what is required for any package server in any language. The biggest one is that persistent client ID that is unique to Julia. Unlike an IP address or a browser fingerprint, you cannot connect it with any other service or action.

Unlike a typical TOS, the data page is extraordinarily transparent, understandable, and legible.

37 Likes

By the way
https://julialang.org/legal/data/#opting_out
is either outdated or doesn’t hold on windows systems:
_

   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.4.1 (2020-04-14)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> import Pkg; Pkg.telemetryinfo()
ERROR: MethodError: no method matching get_telemetry_headers(::Nothing)
Closest candidates are:
  get_telemetry_headers(::AbstractString) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Pkg\src\PlatformEngines.jl:770
Stacktrace:
 [1] telemetryinfo(::Base.TTY) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Pkg\src\Pkg.jl:30 (repeats 2 times)
 [2] top-level scope at REPL[1]:1

~/.julia/servers
doesn’t exists and no telemetry* file in .julia anywhere.

Does this mean only linux users are monetized? :wink:

It’s in 1.5.

Oh yes. And I didn’t want to optout. I was looking for the client_uuid to see if this id is really persistent. My guess is, it is persistent as long you don’t delete it and it is created again and than different from before.
From time to time I am removing/renaming my .julia because of different reasons.

I’m not certain I’m answering your question, yet I do have some questions that could be related:

  • What mechanism will stop a Sybil attack on Pkg.jl telemetry or any other package that has similar anonymous telemetry"?
  • Now that Pkg.jl has paved the way, can we expect other packages to also send even more detailed opt-out telemetry? I’m sure all package developers would love to know more information about how their packages are used
  • If the norm is to allow an open-source package to have opt-out telemetry, are we going to ask that the server code also be open-source?
  • Will there be a norm that open-source packages cannot require telemetry that is not anonymous?
4 Likes

Completely agree; I have never encountered such an explicit and clear explanation of user data processing in any other service.

Moreover, the data requested is unusually minimal. Actually I was surprised of not reading about other data that could be legitimately requested, for instance, locale - that would be very useful to know about how much Julia is used by people whose primary language is not English, etc., and to what extent it would be useful to spend efforts in internationalization of documentation, location of servers, etc.

9 Likes

Pkg.jl is a very special package. I don’t think that it is comparable to community-contributed pages, if you are referring to that.

At least the code for the analysis is promised to be public (see the last section: https://julialang.org/legal/data/#data_access_analysis_amp_retention)

1 Like

How many Julia users are there?"

probably the code should detect that it is running inside the Docker environment

The docker is my - local - poor man CI.

Every time I start a new julia docker environment - I got a new Julia-Client-UUID

and we need a 1 line warning for the non-interactive users …

  • upgrading from julia1.4 docker image to julia1.5 docker image … in a batch system … very easy to miss the telemetry info; no warning …

2 results - different Julia-Client-UUID ; no telemetry_notice(); :

user@telemetry:~$ docker run -it --rm julia:1.5.0-rc1 \
>      julia -e 'import Pkg; Pkg.telemetryinfo();Pkg.PlatformEngines.telemetry_notice(); '
Julia-Pkg-Protocol: 1.0
Julia-Version: 1.5.0-rc1.0
Julia-System: x86_64-linux-gnu-libgfortran4-cxx11
Julia-Client-UUID: 8d80039d-6f4a-405c-9ecc-5c41657aa8cc
Julia-Project-Hash: f4622112b09fb05cf352b6ce94610526c2217b0e
Julia-CI-Variables: APPVEYOR=n;CI=n;CIRCLECI=n;CONTINUOUS_INTEGRATION=n;GITHUB_ACTIONS=n;GITLAB_CI=n;JULIA_CI=n;TF_BUILD=n;TRAVIS=n
Julia-HyperLogLog: 734,1
Julia-Interactive: false


user@telemetry:~$ docker run -it --rm julia:1.5.0-rc1 \
>      julia -e 'import Pkg; Pkg.telemetryinfo();Pkg.PlatformEngines.telemetry_notice(); '
Julia-Pkg-Protocol: 1.0
Julia-Version: 1.5.0-rc1.0
Julia-System: x86_64-linux-gnu-libgfortran4-cxx11
Julia-Client-UUID: 498784a4-132c-4e59-ad33-07b2cc7a4120
Julia-Project-Hash: 1a25669047b3b619e83f455d4a860935dac55075
Julia-CI-Variables: APPVEYOR=n;CI=n;CIRCLECI=n;CONTINUOUS_INTEGRATION=n;GITHUB_ACTIONS=n;GITLAB_CI=n;JULIA_CI=n;TF_BUILD=n;TRAVIS=n
Julia-HyperLogLog: 881,1
Julia-Interactive: false


user@telemetry:~$ 
1 Like

There’s not much we can do about that kind of thing except look at the data and try to understand what’s going on so that we can filter out attacks. Fortunately, unlike marketplaces where Sybil attacks are used to game reputation which has real monetary value, there’s nothing to gain here. One common approach to deal with this kind of behavior is to look at “active users” where there’s some minimum threshold of activity in terms of total requests or request frequency/time span. In fact, being able to do that is in one of the reasons client UUIDs are essential. Without them, how can you determine which traffic looks like a legitimate normally behaving user versus some kind of attack bot? If all requests look alike, you can’t.

Now that Pkg.jl has paved the way, can we expect other packages to also send even more detailed opt-out telemetry? I’m sure all package developers would love to know more information about how their packages are used

As I pointed out in the issue you opened, you already give packages nearly complete trust. If you don’t trust them not to spy on you, you definitely should not be letting them run arbitrary code on your machine. There’s nothing about Pkg doing responsible collection of minimal telemetry that makes it any more or less technically possible or socially acceptable for packages to do anything. We will, of course, not allow people to register packages that do illegal or shady things. Moreover, by collecting data in Pkg in a legal, responsible and transparent way, and sharing aggregated statistics with everyone, including package developers, we are eliminating the temptation for packages to try to do something on their own that might be less strictly by-the-book.

If the norm is to allow an open-source package to have opt-out telemetry, are we going to ask that the server code also be open-source?

The package server code is open source and always has been:

Once we start analyzing request telemetry data, the code that does the analysis will also be open sourced and the aggregated results (but not the raw data that includes UUIDs or any other user-level data) will be made publicly available so that the entire community benefits from it.

Will there be a norm that open-source packages cannot require telemetry that is not anonymous?

No, packages cannot do that because, among other reasons, it would be illegal (in the EU and possibly California). There is no “new norm” allowing package to do shady and/or illegal things. We don’t allow registered packages to do sketchy things and illegally telemetry collection would certainly be grounds for removal from the registry. Again, the fact that Pkg includes responsible, legal, transparent telemetry for the benefit of the entire community does not give anyone license to perform illegal data collection, and the fact that Pkg collects and shares usage data responsibly decreases the temptation of package developers to do anything questionable to try to get such data.

29 Likes

Looking at locale data is an interesting idea, but it would be a bit tricky to do safely and it feels like it crosses a line by sending the user’s data, which is very different from generating a random client UUID that is only used by Pkg itself. It’s very much not ok to send the contents of any environment variable, so the only way to do this responsibly would be to have a list of valid locale values and only send the locale if it’s on the list. Even with that precaution, it doesn’t seem right to send user data at all. The CI indicator variables only seem ok because (a) we really need to be able to identify CI and (b) no normal user will have any of those set, only automated systems.

7 Likes

Similar to looking at the locale, it doesn’t feel entirely kosher to look at the presence or absence of files on the user’s file system. I guess if people are running Julia processes inside of docker containers and doing package operations in them, that will just look like a lot of ephemeral clients.

1 Like

Stefan, all,

Thanks for the thoughtful responses. If nothing else, the telemetry is an interesting experiment; I look forward to the results.

10 Likes