Pkg.jl telemetry should be opt-in

That is how it read to me. I’m glad to hear you did not intend it, but I think it did sound as if you were implying that they are deliberately balancing right on the edge of breaking the law.

Edit: Anyway, some community pushback is good, it helps the process stay healthy, so I don’t think that’s a bad thing. I simply reacted to the phrasing.

1 Like

For me, after @StefanKarpinski 's explanation about how lawyers see the IP issue, it seems reasonable to me. There could be a remark regarding the IP adress in the https://julialang.org/legal/data/ but all in all, it doesn’t look wrong to me and I am pretty sure that it helps a lot to improve Julia.

What exactly do you think is wrong? That it isn’t opt-in, just opt-out? As there is no personal data I would say opt-in is really not necessary. Looking at the data it is hard to imagine how this could be abused. And it seems quite minimal, which should be like that, no unnecessary data as far as I can tell.

7 Likes

Thinking about this, I can imagine the following “nudge opt-in” mechanism:

  1. users have to opt-in to the telemetry explicitly

  2. until they do this, they get a friendly message at, say, each Pkg.update():

    pkg> update
    [packages get updated]
    Please consider participating in the anonymous
    package telemetry survey with
    
        pkg> telemetry enable
    
    To disable this message, use
    
        pkg> telemetry disable
    
    For more information, see
    
        pkg> telemetry info
    
    
  3. after disabling it, the message is not shown again until the next major release.

23 Likes

I considered that but having a nag screen could be quite annoying and there are potential issues with incorrectly prompting the user in a non-interactive situation, which would effectively hang the Julia process. It does not seem worth making Julia potentially less reliable and annoying people. Furthermore, telemetry data can also be useful for helping to figure out what’s going on with CI and other automated systems (both for abuse prevention and to understand usage); if this required a manual opt-in during an interactive session, we wouldn’t get telemetry from any automated systems.

18 Likes

I understand your point about CI and non-interactive use, but given the reactions above and that the primary goal is to collect information about actual user installations, perhaps an “nudge opt-in” framework could just disable telemetry (& nagging) altogether when !Base.isinteractive(), since interactive use is bound to happen at some point for users anyway.

This is just a suggestion for a compromise, I am actually fine with telemetry as implemented.

13 Likes

While that’s one of the top priorities, it’s not the only reason. Serving requests to CI processes is expensive—network bandwidth is the primary cost of running a pkg server, not compute. Telemetry data from CI systems helps understand what people are doing in those automated processes and mitigate those expenses. For example, by deploying package servers that are colocated with CI services (so bandwidth is cheaper or even free). That’s why we check all those CI indicator variables: to try to help understand what services are making requests. If we see a huge deluge of new traffic (this is realistic and does happen already for services we host) and all we have is IP addresses, it’s much harder to figure out what’s going on than if we also have CI indicator variables, Julia version numbers, and client UUIDs, which allow us to figure out which requests are coming from the same instance and which are coming from different ones. Debugging these kinds of situations is hard and doing it completely blind is much harder, so having more context when this happens really helps.

Knowing which CI services people are using is also helpful for prioritizing quality of support for those CI services. Right now we collectively are good at supporting Travis and AppVeyor because that’s what Julia itself uses, but if we find out from CI variables that a ton of people are using Azure Pipelines, for example, then it may be worth the time and effort to make sure that works really flawlessly in the Julia ecosystem. Without those telemtry headers, we can’t know to spend time and energy on that.

27 Likes

Thanks, this is a very useful explanation and clarifies a lot of the motivation. It would be great to include these additional reasons in the announcement of the new telemetry feature.

6 Likes

The second image in the Telemetry article on Wikipedia shows a crocodile with a GPS and radio on its head. This radio is collecting valuable information for scientists, yet I guess this device was placed there without the crocodile’s consent. The Pkg telemetry will likewise collect undeniably valuable information that will be used in fundraising, at present without the users’ explicit consent. I expect this sort of tracking and monetization from Facebook and Google, yet find it surprising and distasteful that Pkg.jl telemetry is monetizing Julia users.

Whether or not we keep the telemetry opt-out, I’d prefer that the Pkg telemetry page on Julialang.org mention fundraising; i.e. not hide the fact that the telemetry is monetizing Julia users.

4 Likes

Saying “Julia has approximately $n users according to telemetry, therefore it is a viable platform” is very different than “we have collected detailed identifiable data on $n users that we will use to target advertisements if you pay us.”

20 Likes

Just so I understand, your argument is:

  • With telemetry it is possible to count the number of Julia users.
  • The count of Julia users is information that could possibly be useful when applying for grants and fundraising.
  • Therefore, Pkg.jl is “monetizing Julia users”.

Is that a roughly accurate implication chain?

12 Likes

Yes; see data monetization on Wikipedia. I’d prefer that we not hide the fact that this information will be used for fundraising

2 Likes

Don’t you think that this is already clearly explained in the following statement?

5 Likes

Yes, I apologize for not reading more closely.

1 Like

I don’t feel monetized! I feel grateful that Julia exists and there are people who work hard to make Julia happen. Now, as a user, complaining to be monetized, really feels wrong to me (If I would be complaining).

I know, this is not the best argument, as you may also feel grateful for Googles great search engine, so I am happy to give all my data to the ads machine.

Whats real is that it is always a balance which needs evaluated. And just argueing that we are monetized and this is bad per se is not balanced, it is just following some kind of zeitgeist, where raising data has been bad before, it must be bad always.

13 Likes

I am hugely conservative in my data footprint. I regularly purge all cookies, use privacy browser plugins, and have even gone so far as to block (and manually whitelist) javascript at times. I use DDG, have bailed on much social media, and such. I’ve gotten into many arguments with family members saying they “have nothing to hide” because they definitely do. This discussion, though, is puzzling to me.

What is the threat model? The telemetry is extremely conservative, really only adding three things above and beyond what is required for any package server in any language. The biggest one is that persistent client ID that is unique to Julia. Unlike an IP address or a browser fingerprint, you cannot connect it with any other service or action.

Unlike a typical TOS, the data page is extraordinarily transparent, understandable, and legible.

37 Likes

By the way
https://julialang.org/legal/data/#opting_out
is either outdated or doesn’t hold on windows systems:
_

   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.4.1 (2020-04-14)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> import Pkg; Pkg.telemetryinfo()
ERROR: MethodError: no method matching get_telemetry_headers(::Nothing)
Closest candidates are:
  get_telemetry_headers(::AbstractString) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Pkg\src\PlatformEngines.jl:770
Stacktrace:
 [1] telemetryinfo(::Base.TTY) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Pkg\src\Pkg.jl:30 (repeats 2 times)
 [2] top-level scope at REPL[1]:1

~/.julia/servers
doesn’t exists and no telemetry* file in .julia anywhere.

Does this mean only linux users are monetized? :wink:

It’s in 1.5.

Oh yes. And I didn’t want to optout. I was looking for the client_uuid to see if this id is really persistent. My guess is, it is persistent as long you don’t delete it and it is created again and than different from before.
From time to time I am removing/renaming my .julia because of different reasons.

I’m not certain I’m answering your question, yet I do have some questions that could be related:

  • What mechanism will stop a Sybil attack on Pkg.jl telemetry or any other package that has similar anonymous telemetry"?
  • Now that Pkg.jl has paved the way, can we expect other packages to also send even more detailed opt-out telemetry? I’m sure all package developers would love to know more information about how their packages are used
  • If the norm is to allow an open-source package to have opt-out telemetry, are we going to ask that the server code also be open-source?
  • Will there be a norm that open-source packages cannot require telemetry that is not anonymous?
4 Likes

Completely agree; I have never encountered such an explicit and clear explanation of user data processing in any other service.

Moreover, the data requested is unusually minimal. Actually I was surprised of not reading about other data that could be legitimately requested, for instance, locale - that would be very useful to know about how much Julia is used by people whose primary language is not English, etc., and to what extent it would be useful to spend efforts in internationalization of documentation, location of servers, etc.

9 Likes