Pkg.jl telemetry should be opt-in

We had 23k users that have opted in to the VS Code Julia extension telemetry in the last 90 days (not crash reporting). The VS Code marketplace says that we had 120k unique installs over the lifetime of the extension. So the absolute lower bound on the opt-in rate is something like 20%. But given that the 120k is the cumulative number of something like four years (as far as I understand it), the opt-in rate is probably much higher than that.

We have two distinct opt-in choices: one for telemetry and one for crash reports.

10 Likes

How does VS Code prompt for opt in? Also, how do you know the number of unique installs if VS Code isn’t using unique client ids for all users?

8 Likes

Isn’t the point that the PkgServer stats could be used to help obtain grants? I confess that it does indeed seem to me that the proposed stats would influence financial decisions and that the package stats would constitute a metric that is (at least) like a reputation system. I.e. there is an incentive for some package owner to make a Sybil attack. What am I missing?

It seems to me that even if we get the person detecting Sybil attacks to have the right incentives (e.g. so that the person is not inclined to process the data so as to result in the highest number of Julia users) and make sure that a Sybil-detection system could detect known attacks, there would still be some possibility of a smart, hungry attacker whose livelihood depends on high PkgServer stats for a package. I don’t claim any expertise on this; is there an obvious reason that I should not be a bit skeptical about the quality of the stats that would come from an opt-out system that collects IPs or IPs+UUIDs?

Edit: Above I was assuming an attacker could use a VPN or other tools so as to display multiple IPs. Is this tough or easily detectable?

2 Likes

it costs more than you can gain. You’re talking about secondary or even thiriary effects here, it’s like saying you can boost instagram followers in anticipation of making money, it’s not that easy. I don’t think it’s cost effective to:

  1. write a somewhat complex and usable package in Julia
  2. spam it with some user stats
  3. go to apply for a grant which has no guarantee whatsoever regardless of your effort anyways (&& it’s such a pain to apply for grant even if you love your job)
  4. pretend to be working on the project on-wards because you have a grant and people watch you?

also btw, people won’t just use a package because of stars / stats.

3 Likes

And don’t forget the emotional/egoist factor. When an application crashes, that’s annoying and telemetry does not feel that much a charitable action from my side, giving away my data for the benefit of the community; it feels rather like a lazy way of complaining and nagging developers with reports that they should look after, in order to improve my user experience. So, many people who would not opt-in to Julia’s telemetry, would be more than happy to click “yes” to sending crash reports.

4 Likes

I’m not really concerned about Julia’s package developers doing something this unethical. If they do this and use artificially inflated numbers for fundraising, they’re likely to lose their funding and ruin their careers. If it’s discovered, we can simply discount any usage numbers for the relevant packages during the affected period. It also seems strange that you’re trusting these developers to execute arbitrary code on your machine, but you can’t trust them not to cook up fake download numbers?

9 Likes

Agree, this is pretty far fetched.

1 Like

Thank you for humoring and responding to my strange and far-fetched idea :slight_smile: I guess you’re right in that the Julia community does not have any who would take the steps I discussed

It’s of course possible but I think not incentivized like it is in public marketplaces, so not that likely, and I think the remedy is social, not technical.

5 Likes

@StefanKarpinski I agree that much of the issues are dealt with best at a social level. However, there were several technical suggestions above that I made which I hope you have some time to take a look at. To collect them together in one place:

  1. A very small reminder in the Pkg prompt and some mention of a command in the help menu of the Pkg prompt that lets you easily turn off and on telemetry. So you don’t have to find a file and edit it. you can just ] pkgtelemetry off or ] pkgtelemetry on

  2. Placing the telemetry database server behind a proxy so that whatever machine is collecting the telemetry has no access to the IP from which it comes.

  1. Storing no logs on the proxy except temporary logs in volatile storage (ram disk), and keeping logs on these machines no more than say 1 or 2 days for purposes only of detecting abusive / attacker machines.

  2. Randomly regenerating UUIDs on the users own machine using a known random schedule so that UUIDs can’t be tracked long term connecting an individual across every single package they’ve ever installed uninstalled modified etc

All of those would keep the data collected just as useful, and even less personally identifying compared to current “proposal” (which as I understand it isn’t really set in stone yet, which is why I bring this up).

WDYT?

4 Likes

Regarding your first point, see https://github.com/JuliaLang/Pkg.jl/pull/1895.

4 Likes

Love it.

VS Code itself uses opt-out. I wish they used opt-in instead. The Julia extension opt-in prompt looks like this:

image

VS Code is using a unique client id.

I don’t like how VS Code handles telemetry and wish they didn’t do it this way. But it does give us a way to say something about the opt-in rate for the Julia extension, and it is pretty clear that it is nowhere near 0%.

9 Likes

It’s a bit ironic that you couldn’t make this argument in the first place without the benefit of VS Code’s opt-out client id. I also find the opt-in UI for the Julia plugin pretty misleading—it looks like you have no choice but to click “I agree to usage data collection”. If the user were presented with a clear “yes, please send data” versus “no, don’t send anything” choice, your opt-in rate would be much lower. I don’t really think that an opt-in that looks like it’s not optional is any more responsible or transparent that what Julia 1.5 does with printing a notice with details on how to opt out, which is at least very clear about that’s happening and that it’s optional.

25 Likes

Frankly, I think a lot of people are getting pretty tired of these “click here to make this ass-covering privacy notice go away” popups that now litter the experience of using a computer. I would much rather that websites and applications collect a minimal, reasonable amount of data and use it transparently and responsibly instead of disingenuously going through the motions of asking for my approval—which is actually just so that they can cover their butts legally and collect more data than the law allows them to without getting me to click that button.

So, that’s what we’re doing in Pkg: rather than presenting the user with a privacy prompt that looks like they have to click yes, the usage data is minimal enough that it is legal to collect without permission under even the most restrictive privacy legislation. There is a clear notification about what is sent and how to opt out. That seems like the least annoying, minimally intrusive approach which still allows the project to collect vital usage data needed to support the ecosystem.

If you are privacy-concerned and don’t want to share your usage data, you can turn telemetry headers off as detailed on the legal data page. Or, if you’re that privacy concerned, you can turn off the package protocol altogether by doing export JULIA_PKG_SERVER="". That causes Pkg to fall back to installing packages directly from GitHub (of course Microsoft is collecting way more data on you, but :man_shrugging:). If you feel that Pkg’s opt-out telemetry morally obligates you to warn people before they use Julia, that’s fair too—please show them the legal data page. We are not hiding anything here.

63 Likes

Can we put you in charge of the internet, please?

(Edit: I just realized that my post may come across as sarcastic, not serious, or just be unclear. What I meant is complete support for Stefan and the rest of the Julia team, and admiration of the way they’ve handled this thorny issue – I can only hope all software developers learn from their example).

18 Likes

Is there a reason unique IDs are necessary? If not there would be the option to make it fuzzy: give every user a random (e.g.) UInt16 number. There will be clashes (that’s the whole point of it) but you can still estimate that
65400 unique numbers (of 65535) corresponds to 400k ± 15k unique users. 65500 unique numbers corresponds to 500k ± 30k unique users. At that point each person shares their number (on average) with 8 other people. If the estimate gets too fuzzy, the bits of the random number can be increased.

I’m aware that this method (if it works at all, I haven’t thought it through to the end) would require quite a bit of additional work so with being so close to 1.5 this idea is mostly theoretical, but I’m interested in thoughts.

Edit: For the estimate of unique users I was using mean ± 3*stddev. Since the number unique IDs can’t be normally distributed (due to the upper bound), here is mean ± 90%-quantile which gives roughly 400k ± 10k and 500k ± 20k.

1 Like

yeah it’s not UInt16 it’s UUID, probably enough to give each grain of sand on earth without a collision

1 Like

How do you estimate a number of users larger than 65k with only that many unique values?

4 Likes

Reading through this discussion was a good experience just now.

There are good points being made.

With my risk management hat on though, I do feel this hasn’t been thought through nearly enough. v1.5 rc is already in the wild and there isn’t even a draft privacy document for us to have a look at. I’m afraid that releasing this opt-out telemetry thing to the world coincident with the official v1.5 release comes with some potential reputation risk. Can it wait until v1.6 (which makes sense since it is supposedly the next LTS)?

As @tbeason pointed out, not a lot of people have commented here, but enough have that I am a little concerned. It would only take a small group of highly vocal people to turn this into a significant issue and an unfortunate distraction.

The reputation of Julia does have financial consequences to me and others. If I am in the middle of negotating a project with a financial services company and they are deciding between Julia and Python, it would suck if a big negative headline about Julia’s data privacy policies hit the newspapers.

I believe this can be done right. I believe even an opt-out solution can be done right, but I also think this needs to be thought through more thoroughly. Lawyers have been consulted. How about risk managers and compliance officers? I would not trust my reputation to lawyers. @ninjin made some excellent points. What efforts are you making to keep IP addresses and UUIDs separate? Understanding that would help put me at ease.

While thinking about this, I had an idea that might help increase the number of people opting in. As a package maintainer, I would love to see the stats for my package. What if the price for me to see the stats about my packages is that I opt in to telemetry? You can make high level summary statistics about Julia itself freely available to everyone, but to see stats about my packages, I need to opt in to telemetry and access that data through JuliaHub or something. It’s just a thought. Maybe a bad idea, but thought I’d share it.

I do trust you all to do the right thing and posting here and listening to us enforces that trust.

9 Likes