Pkg.jl telemetry should be opt-in

StefanKarpinski · July 2, 2020, 4:17pm

First of all, @ninjin, thank you for the post—it’s fair, reasonable, gracious and well written.

CI services do tend to have predictable IP blocks. (Various people object to us logging IP addresses, so I guess for the most hardcore privacy-concerned, even that is an issue.) However, CI services sometimes change IP blocks, which is hard to detect and respond to unless there are other reliable indicators that some requests are CI. It also happens pretty regularly that someone spins up a new public or private CI setup/service that uses the free, public package servers. The question is how does one detect when one of these things has happened and respond appropriately?

With CI indicators, you can see things like “oh hey, there’s a large number of requests with some CI indicator set that are hitting this package server and costing us a ton of money; guess the IP blocks changed or someone spun up their own system and is using the free public package server to feed it.” Keep in mind (as you’ve already mentioned yourself, @ninjin) that this service is operated on a volunteer basis by the same people who develop Julia for free. We are not full time sysadmins whose job is to be on top of what IP blocks all possible CI systems are using or pore through the logs looking for this kind of thing to keep costs under control. We need to be able to automate as much as possible.

My suspicion is that we care less about non-interactive installations – in particular CI – in terms of their hardware, operating system, packages installed, etc. So maybe it is fair to say that we can expect that the installations we are primarily interested in will be run interactively at least at some point?

While that’s true I also think that automated systems have less of right to privacy than actual human users do. A CI system that is using the free public package server does not have a strong right to privacy protections, imo. The user who requests that CI run is a different story but nothing about them is exposed—the client UUID is ephemeral in this case and doesn’t relate to the user in any way.

However, there is an option here of a user report rather than full-on telemetry. Collecting data locally is perfectly acceptable in my book, then after a given time frame one can present the concrete report in an interactive session and ask: “Pardon the interruption, but locally on your machine we have compiled the following report (here is an excerpt) which would be useful for the community (mention desiderata?). Would you be happy to share it? A part of it? Submit automatically next time? Ask again next time? Never ask again and stop any collection?”. I may be naive, but I feel that this will allow us to increase the number of opt-ins, while not having to resort to opt-out as is evidently anathema to at least a subset of us, as I know for a fact that as much of a privacy nut that I am even I have agreed to filing user reports like this.

That’s certainly a way things could be done, but it seems quite complicated and hard to implement. You need to aggregate user data locally somewhere and constantly update it without corrupting it even when multiple Julia processes might be accessing the database of usage data concurrently. This has to work across all kinds of user file systems, which, let me tell you, is a constant source of shenanigans. (“Ah, but did you think of someone using Linux but mounting an NTFS drive?!?”. Real problem we’ve encountered in Pkg recently.)

The only way I can think of that seems sane to do that would be to use something like SQLite to maintain this data since it handles data concurrency and works everywhere. So that’s possible, but then we’ve made SQLite a dependency of Pkg just to collect user data which doesn’t seem great. Also, how does that look to users: are they really going to believe that it’s perfectly innocent that we’re maintaining a literal database about things they’ve done that we want to upload to a server periodically? That seems way more likely to freak people out than sending a few well-documented headers with each request to a server that you’re already talking to anyway.

The UI also seems very hard to get right. How does one present that data to the user to ask them if they want to share it? As a lot of raw records? That seems like an overwhelming amount of data to show them. Or should it be distilled down to a summary? In that case are we really being fully transparent with them about what’s being shared? In the current scheme, we show the user exactly what’s sent to the server if they want to see it and the first time they connect to a package server, we tell them how to print that information with a link to a page explaining what it means.

Finally, I suspect that very few people would share this data. Yes, this is how bug reports from application crashes work: “Here’s a crash report. Are you willing to share it with the developers to help them improve the application?” But for crash reports you just need one person who encountered a bug to submit a report for it, so getting an unrepresentative sampling is totally fine (and if you don’t get a bug report for a bug, then you just don’t fix it). For the purpose of understanding representative Julia usage, however, that kind of unrepresentative smattering of reports does not seem effective. I don’t believe that we could, with any real confidence, claim that such reports tell us how many Julia users there are.

Topic		Replies	Views
Julia losing popularity among Data Science users (KDnuggets Software Poll) Community	146	20252	June 23, 2018
The State of the Julia Ecosystem Community	109	8574	January 31, 2019
Digression about privacy over OpenTelemetry.jl Offtopic	9	1004	November 6, 2021
How can we create a leaner ecosystem for Julia? Statistics package , proposal , time-series , machine-learning	101	10469	October 15, 2020
Results regarding Julia from HackerRank developer skills report Community	26	3387	January 28, 2018

Pkg.jl telemetry should be opt-in

Related topics