Pkg.jl telemetry should be opt-in

New “Pkg & Storage protocols” and an accompanying centralized service to host packages have been merged and are present in Julia v1.5.0-rc1. The new Pkg sends telemetry consisting of a user-specific UUID and other information to the server, where it is used to count the number of users and other stats. The goal is to answer the question “How many Julia users are there?” in fundraising. The current protocol is opt-out, meaning that these stats are collected unless a user changes a configuration file.

I would like to see one of two changes made:

  • Make the Pkg.jl telemetry opt-in by default for the Julia binaries. In Feb it was implied on Github that the opt-in nature of v1.4 would remain in v1.5. I do not think it is appropriate for the Julia open-source project to be collecting a user identifier along with info on that user’s packages. I believe that even this minimal data is a “toxic asset” and is more appropriate for a for-profit product such as JuliaPro. It feels odd that while Apple is taking steps to prevent user tracking, Julia is adopting it. The HyperLogLog technique seems more reasonable for opt-out tracking.

  • If we are to keep the opt-out behavior in v1.5, I would like to remove “anonymous” from the Pkg.jl warning or at least change it to “pseudonymous”. UUIDs (like a browser cookie) are only anonymous until they are not (say via a data leak or correlation with information from another source as done by browser trackers).

Stefan said (in a conversation beyond the Slack horizon) that knowing how many Julia users there are would aid in fundraising. I understand the attraction of knowing this marketing number, even so, is the Julia project so strapped for cash that they need to monetize Julia users?

I enjoy Julia and the community very much; for example I’m grateful for the diversity and inclusion efforts. The Pkg.jl opt-out telemetry is the first thing in the Julia code or community that I have found distasteful. I hope you’ll forgive me for sharing this on discourse; I believe others may be interested.

38 Likes

is the Julia project so strapped for cash that they need to monetize Julia users?

The first question JuMP gets asked when applying for funding/awards/etc is how many users we have. Here is a verbatim quote from an email I received today, which asked for “substantiated estimates of the number of installations of a software package.” Should I reply “No idea. Somewhere between 1k and 100k.”?

Moreover, at present, we have no idea how many people use each solver (and on which platform!). Knowing how many people installed which solver would allow us to prioritize support from our finite developer time.

This would also allow us to lobby the commercial solver developers to provide official support (or $$). To quote one company “We’ll want to provide official support at some point, but it looks like the scales haven’t tilted quite yet.” It’d be nice to know whether 100, 1000, 10000, or 100000 people per month use their software; that might change their mind.

Finally, if it is opt-in, the vast majority of users will not opt-in. This leaves us no better off than we were before. Opt-out is a good compromise.

To summarize, at the cost of sending pseudonymous UUIDs (which you can opt-out of), we get easier access to sustained funding for Julia ecosystem development and more efficient usage of developer time. That seems like a good trade-off to me.

38 Likes

In order for a trade-off to (possibly) be good, users need to know that they are making a trade-off in the first place.

How are users going to be informed of the trade-off and the mechanisms to opt-out?

I ask because I follow the Julia issue tracker and read discourse almost every day, and this is the first time about I hear about telemetry in Pkg.

An additional thought: how are you going to use install numbers to estimate usage numbers?

13 Likes

Note that the uuid you are talking about isn’t like a Google advertising ID. It isn’t linked with any personal identifying information (other than is a Julia user with … Julia packages installed). There isn’t a conceivable way for this number to tell anyone anything about you that they couldn’t find out more easily another way.

9 Likes

I understand that, and I trust Julia’s developers. The reason for my comment is my concern for openness: this is something that shouldn’t be done without telling users about it, on principle.

I just found https://github.com/JuliaLang/Pkg.jl/pull/1544#issuecomment-565160856, where Stefan says this is going to be documented. It’s not there yet, but I trust it will be before 1.5.0 is released.

while I think it should be documented and/or prompted first time user using it, it’s not over-reaching at all; just think about how this very forum probably logs your IP and pages visited – it’s even less aggressive than cookies! This is not again open source (FOSS) philosophy at all.

Especially considering pkg usage cannot be connected to virtually anything about one’s identity.

3 Likes

How are users going to be informed of the trade-off and the mechanisms to opt-out?

From https://julialang.org/legal/data/#opting_out: “The first time you connect to a new server, Julia will print a brief legal notice with a link to this page.”

An additional thought: how are you going to use install numbers to estimate usage numbers?

Installs ~ users. You could also look at the packages updated within the last 30 days. It’s always going to be an approximate metric. The goal is to have something that is better than nothing (what we have now).

7 Likes

I don’t dispute the utility of knowing how many users Julia or JuMP has. It may also be useful to know how often JuMP is used or included! Why not have Julia count how often each method is used and send a daily report to the package server? (Edit: to be clear, I’m not in favor of this :slight_smile: )

What do you think of HyperLogLog tracking, which doesn’t require sending a UUID?

No one is claiming that there is any personally identifying information tied to the UUID at present; I’m sure that all the people doing in-browser tracking claimed the same to start with. Even so, what happens when some other package decides to require an email for use, and ties that to the UUID. Or there is a data breach?

3 Likes

This forum is definitely opt-in, which is what I’m suggesting :slight_smile:

1 Like

if you need to find answers or view docs, then its is not, unless someone only use the docs come with the source code but even then github tracks the IP and query for git clone / download too am sure.

My point is, I support the ideal / spirit but not all telemetry are the same and this one is probably fine (truly anonymous)

Github (Microsoft) is for-profit, Julia is not. Can you point to another open-source, non-profit project that tracks users? For example, I do not believe that Python or R track users.

2 Likes

Let me emphasize my point: personally, I’m not against the proposed Julia package telemetry. However, I expect an open project to let users know that it is happening, and provide instructions to opt-in or out.

Also, as I said above, Stefan already said this will be documented, and it looks like opting out is painless.

4 Likes

There already is some documentation at https://julialang.org/legal/data/

It looks like there is functionality for this recently merged to Pkg: telemetry: print legal notice first time talking to each pkg server by StefanKarpinski · Pull Request #1871 · JuliaLang/Pkg.jl · GitHub.

I didn’t see the notice when trying out the 1.5 release candidate, although Pkg.PlatformEngines.telemetry_notice() prints the notice. I saw in the PR it mentions it prints the notice the first time you talk to the pkg server; I think I probably have already talked to the pkg server via the 1.5 beta and maybe also by opting in on 1.4 (I forget what I’ve done on this computer), so that might be why it didn’t print for me on the 1.5 rc.

3 Likes

looks like if you remove ~/.julia/servers/pkg.julialang.org/telemetry.toml it would print (didn’t try, guessing from the code)

2 Likes

With the difference that you don’t have any much control over what cookies do, instead you can

  1. opt-out the telemetry in Julia, which contrary to using incognito mode in browsers doesn’t have any drawback
  2. see the code of Pkg to see what’s doing with this data

I don’t see any benefit in profiling single users in Julia (unless the evil plan is to do targeted advertising of packages in the REPL!), only getting aggregate usage statistics, which is my understanding will be shared with the public. If there was any evidence of an evil plan looming, I’m sure many people, likely including me, would either stop using Julia (not a great outcome for anybody) or hack Pkg to stop doing something evil.

8 Likes

To my knowledge, this is not true: the CRAN servers historically produced and maintained traditional de-anonymizable server logs that included information like IP addresses, but the maintainers were historically unwilling to share those logs with anyone outside the core team. I’ve been on direct e-mail threads with the CRAN maintainers where they’ve declined to share that kind of data, but acknowledged its existence.

Things may have changed since then, but I don’t think you’re representing the broad state of the art accurately. AFAICT Julia differs from Python and R primarily because Julia uses GitHub servers for hosting most artifacts, so GitHub has all the of the truly private information for Julia users, but Python and R have that data for their communities because they host most artifacts directly.

13 Likes

Also Firefox is an example of an open source project tracks at least as much data.

4 Likes

Yeah, and they disclose what they do: Mozilla’s Data Privacy FAQ — Mozilla

But that’s besides the point, in my opinion; what other projects do or don’t do doesn’t change my (personal) belief that a project like Julia should be up front about their user telemetry. There is even a perfectly legitimate reason to do it! How many projects have that luxury?

I mean, I get that some people may believe undisclosed telemetry is OK, or just plain don’t care, but pointing to what other projects do as justification is unconvincing. Two wrongs don’t make a right.

7 Likes

Thank you for sharing this. So my understanding is that Julia is proposing sharing with Julia package owners the sort of information that the CRAN maintainers are not willing to share with R package owners; is that right? Does the CRAN “core team” include R package maintainers?

It looks like no one but me opposes UUID tracking. There’s no need to change things just for me. I hope you all will forgive me for sharing a few thoughts on it.

2 Likes