Pkg.jl telemetry should be opt-in

giordano · June 29, 2020, 1:08am

Honestly I think you don’t even need to apologise for bringing this discussion. We may have diverging opinions, but it’s important and healthy to openly discuss about this.

johnmyleswhite · June 29, 2020, 1:32am

No, this is not correct. The Julia proposal as I understand it is maximally privacy-safe under the constraint that it enables computing the unique number of UUID’s that have downloaded an artifact. Traditional server logs have much more information than this; see the example log string here: Log Files - Apache HTTP Server Version 2.4

No, the core team is small and doesn’t share this information with R package maintainers to my knowledge.

FWIW, I think your request that telemetry be opt-in is very reasonable. I’m just trying to make sure that comparisons with other projects don’t end up making Julia look worse than its peers when the reality seems to be that Julia is doing much better than they are.

DNF · June 29, 2020, 7:22am

I think ‘evil plans’ are rarely the main concern in these sorts of privacy matters. Therefore I find that line of argument a bit irrelevant. It’s rather ‘the law of unintended consequences’, that the data should somehow be abused by other parties or end up being used for purposes that are currently unanticipated.

Personally, I’m not really concerned about this sort of telemetry, and find the explanations in this thread adequate; but it would be good if a notice about it is easy to find, without having to actively search it out, not buried deep in some text that most users will never look at.

oheil · June 29, 2020, 7:22am

There is a crucial part in this telemetry, which is the IP adress of the client. The IP adress can constitute personal data.
Sending the anonymous (not personal) data using HTTP does include the IP adress of the client, it is probably logged automatically, and by chance, is deleted after some time. This together is automatic processing of personal data.

Some quotes:

The conclusion is, all IP addresses should be treated as personal data, in order to be GDPR compliant.

In the GDPR, processing operations include automated and non-automated operations, with a broad definition of processing and many types of operations included.

quoted from here :Personal data, identifiers, subjects and types of data

In my opinion, to be on the safe side (legally) it should be opt-in (in the EU).

StefanKarpinski · June 29, 2020, 1:59pm

That’s right, Pkg prints the legal notice when creating the telemetry file that saves the UUID. So users who used 1.4 and opted into using the pkg protocol and users of 1.5 development versions that didn’t have the legal notice yet will not see the legal notice because they already have telemetry files. Users who have only used official Julia releases (and release candidates) and have never set the JULIA_PKG_SERVER variable—which should be 99.99% of users—will see the legal notice the first time they do a package operation with Julia 1.5. You can also see the legal notice by calling Pkg.PlatformEngines.telemetry_notice().

mcabbott · June 29, 2020, 2:11pm

FWIW, I can’t print this notice with Pkg.PlatformEngines.telemetry_notice(). EDIT: calling Pkg.telemetryinfo() does print details, as at /legal/data/ page.

The file ~/.julia/servers/pkg.julialang.org/telemetry.toml exists, if I remove it and run Pkg.update() then it is re-created (with different values), without any notice being printed.

StefanKarpinski · June 29, 2020, 2:22pm

We did consult with lawyers who specialize in GDPR issues via NumFocus on this matter and gave them a detailed list of what data we send. This was their conclusion:

The inquiry as to which operating system is used, the fact that the IP-address is recognized by the server and that the used Julia version is surveyed is justified – without consent.

However, these surveys represent processing operations that must be disclosed to the user. The creation of the user-ID of the users seems to us to be justifiable as well (given that further information about the purpose / benefit of the processing will be sustainable within the consideration of the interests).

We recommend implementing a link on the Julia start display that lets users know about the data surveys in question and informs about the facts as stated above, so that the duty to inform subjects is fulfilled in accordance with Art. 13 GDPR.

As you can see, we are following this recommendation by printing a legal notice the first time the user does a package operation which sends telemetry data, informing the user that data is sent to the server and linking to a page with a detailed account of what data is sent, why it is collected, what it is used for, and how to opt out.

Regarding the quote you posted from “i-scoop.eu”, there is important context necessary: it’s generally assumed on sites like that the service in question also collects personally identifying information about users and that the IP address can be linked with that data. In such a context, since the IP address can be tied back to the user’s identity, it becomes personal data as well. On the other hand, if no personally identifying data is collected, then the IP address lacks significance as personal data. Otherwise the GDPR would require notifying anyone who connects to any server on the internet in any capacity. At least that is my understanding as a non-lawyer who has read quite a lot on the subject and gone back and forth with lawyers about this quite a bit.

StefanKarpinski · June 29, 2020, 2:29pm

If you’ve already done some package operation in the Julia process when there was a telemetry file and then you delete the file and do another package operation, you will not get the notice (since there was a telemetry file, implying that you’ve gotten the notice already). There’s logic to ensure that the legal notice is printed at most once per process, because otherwise the Pkg.jl CI logs were getting spammed with lots of legal notices. If you want to trigger showing the legal notice, you can:

Delete the telemetry file.
Start a fresh Julia process.
Do a package operation that connects to the server.

mcabbott · June 29, 2020, 2:39pm

Yes I restarted Julia. However after freshly downloading 1.5.0-rc1.0, I can now trigger it, so I guess all is well, sorry about the noise!

anon67531922 · July 1, 2020, 12:20am

It is one thing to do what lawyers say is acceptable within the law and another thing to do what is right. It may not be illegal, but it doesn’t feel right. I understand the desire and utility and I believe there is a right way to do this, but this way doesn’t seem right. Maybe you can tie telemetry data to having a JuliaHub(.com) account or something?

dlakelan · July 1, 2020, 12:27am

How about every time someone presses the ] key to get into the package manager, the REPL prints a single line above the new prompt with whether or not it’s enabled or disabled…

Pkg telemetry enabled
(@1.5.1) pkg>

and add a simple command there telemetry disable or telemetry enable to switch your status.

Tamas_Papp · July 1, 2020, 6:15am

Requiring a JuliaHub (or any other) account would

link this data to a person,
reach a fraction of people and thus pretty much defeat the purpose.

I am not sure I understand what “doesn’t seem right” to you about the proposed opt-out approach with a warning, but if you have concerns about privacy, then they also apply for your proposal.

I also feel icky about various companies collecting data about the way I use my computer, but in this particular case I feel that the data is as anonymous as it gets, and the benefits (quantifiable user base \Rightarrow more funding and support for Julia) overwhelm the mostly theoretical concerns.

The change of course should be announced via the usual channels (this forum, the official Julia blog, …) and feature prominently in the release notes so that those who want to opt out can do it.

DNF · July 1, 2020, 8:02am

I think you are painting this in a light which is unfair, as if properly checking law compliance is something that is done to find loopholes, and generally be a bit weaselly. GDPR is strict, complying with it is nothing to sneeze at.

anon67531922 · July 1, 2020, 8:35am

I never said or implied that checking GDPR compliance is done to find loopholes (I am the former Head of Risk Management for one of the largest insurers in Asia so I know a thing or two about compliance). Kudos for checking with lawyers. I’m saying the way this seems to be getting done doesn’t feel right and there is probably a better way with less reputation risk. Humans screw up. I screw up. The telemetry thing is going to screw up somehow. When it does, there can potentially be some real repurcussions. I am not sure the benefits outweigh the risks with the information I have so far (which is limited to this forum post - the first I’m hearing about it - which is another sign it isn’t being done right since v1.5 rc is already in the wild).

If I bother to comment here, it is because I care. I don’t want to see Julia on the front page of a newspaper with some glaring negative headline about data privacy issues.

DNF · July 1, 2020, 8:40am

That is how it read to me. I’m glad to hear you did not intend it, but I think it did sound as if you were implying that they are deliberately balancing right on the edge of breaking the law.

Edit: Anyway, some community pushback is good, it helps the process stay healthy, so I don’t think that’s a bad thing. I simply reacted to the phrasing.

oheil · July 1, 2020, 9:06am

For me, after @StefanKarpinski 's explanation about how lawyers see the IP issue, it seems reasonable to me. There could be a remark regarding the IP adress in the https://julialang.org/legal/data/ but all in all, it doesn’t look wrong to me and I am pretty sure that it helps a lot to improve Julia.

What exactly do you think is wrong? That it isn’t opt-in, just opt-out? As there is no personal data I would say opt-in is really not necessary. Looking at the data it is hard to imagine how this could be abused. And it seems quite minimal, which should be like that, no unnecessary data as far as I can tell.

Tamas_Papp · July 1, 2020, 11:41am

Thinking about this, I can imagine the following “nudge opt-in” mechanism:

users have to opt-in to the telemetry explicitly

until they do this, they get a friendly message at, say, each Pkg.update():

pkg> update
[packages get updated]
Please consider participating in the anonymous
package telemetry survey with

    pkg> telemetry enable

To disable this message, use

    pkg> telemetry disable

For more information, see

    pkg> telemetry info

after disabling it, the message is not shown again until the next major release.

StefanKarpinski · July 1, 2020, 1:20pm

I considered that but having a nag screen could be quite annoying and there are potential issues with incorrectly prompting the user in a non-interactive situation, which would effectively hang the Julia process. It does not seem worth making Julia potentially less reliable and annoying people. Furthermore, telemetry data can also be useful for helping to figure out what’s going on with CI and other automated systems (both for abuse prevention and to understand usage); if this required a manual opt-in during an interactive session, we wouldn’t get telemetry from any automated systems.

Tamas_Papp · July 1, 2020, 1:46pm

I understand your point about CI and non-interactive use, but given the reactions above and that the primary goal is to collect information about actual user installations, perhaps an “nudge opt-in” framework could just disable telemetry (& nagging) altogether when !Base.isinteractive(), since interactive use is bound to happen at some point for users anyway.

This is just a suggestion for a compromise, I am actually fine with telemetry as implemented.

StefanKarpinski · July 1, 2020, 2:17pm

While that’s one of the top priorities, it’s not the only reason. Serving requests to CI processes is expensive—network bandwidth is the primary cost of running a pkg server, not compute. Telemetry data from CI systems helps understand what people are doing in those automated processes and mitigate those expenses. For example, by deploying package servers that are colocated with CI services (so bandwidth is cheaper or even free). That’s why we check all those CI indicator variables: to try to help understand what services are making requests. If we see a huge deluge of new traffic (this is realistic and does happen already for services we host) and all we have is IP addresses, it’s much harder to figure out what’s going on than if we also have CI indicator variables, Julia version numbers, and client UUIDs, which allow us to figure out which requests are coming from the same instance and which are coming from different ones. Debugging these kinds of situations is hard and doing it completely blind is much harder, so having more context when this happens really helps.

Knowing which CI services people are using is also helpful for prioritizing quality of support for those CI services. Right now we collectively are good at supporting Travis and AppVeyor because that’s what Julia itself uses, but if we find out from CI variables that a ton of people are using Azure Pipelines, for example, then it may be worth the time and effort to make sure that works really flawlessly in the Julia ecosystem. Without those telemtry headers, we can’t know to spend time and energy on that.

Topic		Replies	Views
Digression about privacy over OpenTelemetry.jl Offtopic	9	969	November 6, 2021
Pkg sends usage info to Google Analytics [Update: No, it doesn't] Offtopic	3	979	December 6, 2016
[ANN] OpenTelemetry.jl - Now it's time to improve the observability of your system Package Announcements package , announcement	2	1700	November 4, 2021
Response to Pkg.jl and Julia Environments for Beginners by Jules General Usage pkg , workflow , environment	10	679	August 31, 2022
Obtaining a numeric value for "number of users" or downloads for grant applications and stuff Community package-manager	1	479	May 30, 2021

Pkg.jl telemetry should be opt-in

Related topics