Pkg.jl telemetry should be opt-in

Will there be any public information about registered packages popularity in numbers of installs? Or will that information be delivered to package authors somehow?

2 Likes

As a fellow “privacy nut”, it is not about the threat model, whether the invasion is minimal, or telemetry beneficial for the community, but rather about how consent is extracted from the user. To me, any opt-out telemetry is inherently non-consensual as it violates the expectations I have for FOSS software. But perhaps this just shows me being out of touch with the times or that the software utopia I strive for is different from that of the majority of the community? Perhaps nutters like me are best ignored as we are possibly but a rounding error in the community and not contributing much while causing obstruction? I honestly do not know.

What I can say is that when I face my class of a hundred or so students this autumn and introduce Julia I now feel that I morally need to add “…and this is how you disable the opt-out telemetry” so as not be complicit in forcefully extracting their consent. Likewise, when I bundle Julia for my package manager I now feel that I need to add code to disable the telemetry and add an opt-in flag so as to “right a wrong”. It saddens me greatly as I feel that it all distracts from the greatness that is Julia and its community – but perhaps that is due to my own misguided, antiquated idealism?

22 Likes

My view on this is simple: I think this should be opt-in, not opt-out. I don’t mind nudges to opt-in, but I generally feel that the “right” thing to do with telemetry is to never do it unless a user opted in. In my view a desire to count users doesn’t trump that principle.

24 Likes

I think the view when this was made is that since the info is less than that given when you make the HTTP request to get the package, it’s hard to believe that users would care more about the Julia community getting the information than Microsoft (github).

4 Likes

If it’s opt-in, I can guarantee that approximately 0% of people will opt in, so it would be approximately 0% useful.

24 Likes

I also think that reactions to telemetry practices are often reversed to what they should be: people remaining passive when they become aware that their privacy rights have been silently abused, and then claiming for more control when they encounter some service that does better and provide users with that opportunity. (Note: I’m generalizing, not referring to any specific person that participated in this conversation.)

That being said, I don’t think that the GitHub-Microsoft argument is right here. It’s true that right now Pkg interacts constantly with GitHub, but it can be otherwise; Pkg might potentially interact only with servers, registries, etc. under control of the user.

Regarding opt-in and opt-out: If the problem is that Pkg must be able to work noninteractively, and opt-in in such context means losing the vast majority of users, would it make sense to move the decision to another point where interactive use is the norm? E.g. prompt the users with the question when installing from binaries? CI might use different, opt-in binaries.

4 Likes

I don’t think so. There is also telemetry on VSCode and the Julia extension and I opted in. @davidanthoff can propably say something about the usefullness of the data they gain and perhaps on how many people opt-in?

2 Likes

That is different, because in VSCode the telemetry is used to send crash reports, if I remember correctly.

Edit: As it turns out there are two options, one for crash-reports and one for telemetry.

It is sufficient to just get a subsample of all crash reports. For Julia the attractiveness lies in knowing something about the whole population of Julia users.

1 Like

To compare how many people would opt-in it still could give an answer. Sure a single crash report is still good information, but just a single opt-in for pkg usage would be useless.

I just think that enough people would opt-in. But if the VSCode opt-in is very low I am probably wrong.

1 Like

Perhaps you can add it to the section where you already talked about the importance of spoofing your IP number as to not send it to GitHub when using the package manager.

12 Likes

Please do not do this, this is simply below the belt. I respect you, your contributions, and this topic too much to respond “in kind”. Sure, if I had a magic wand we would all be using only free software, use sourcehut, etc. But nowhere have I attempted to enforce this vision on anyone and as I am willing to use GitHub, this website despite Google Analytics, etc. I am evidently perfectly willing to compromise. All I ever said was that I feel that opt-out, in my view, is not a valid form of extracting consent and that I would feel obliged to voice this concern – take that or leave it, but I feel that mockery is unwarranted. I will try to respond to @StefanKarpinski’s comment when I have the time as I think a compromise is very much possible, but there are so many aspects of this mixed together that would need to be unpacked.

Edit: Better link showing the whole thread and lifting in the quote for clarity as the system removed it.

14 Likes

Something to be aware of when using data like this:

  • data could be forged by randomized requests (might even be recurrent with same uuid)
  • data is skewed towards those people who did not opt-out
  • ci uuids might be useless since the environment usually gets recreated every time
  • detecting unknown ci services (unknow env variable) might be hard with the current data (but is one of your goals)

The second point is interesting in that you might think you are catering to your users by making data-guided decisions when in reality by doing this you are gradually shifting your target audience towards people who care less about privacy.
I’m not going to condemn this move, but considering that many people do care about this (even if underrepresented in a forum like this) there might be other options, especially for the CI-usecase.

2 Likes

Insisting on opt-in is certainly a valid position, but perhaps claiming that this form of transparent, well-documented, and anonymous telemetry will put you in a difficult moral position when you ask students to use Julia is a bit overboard.

I don’t think that either the definition or the examples there apply to a case where the data is neither sold nor used for a market advantage, just to demonstrate a user base with the intention to get funding and recognition; the connection between the data and the advantage is very indirect.

7 Likes

There is no mockery intended and you are of course free to express whatever concerns you have to whoever you want. You did, however, chose to use strong terms like “complicit in forcefully extracting consent” and if you feel like that and that you have to make a special mention to a class for this, then it only stands to reason that there are many many more points you should have on that particular slide in your lecture. Otherwise, to me, it just looks like your picking on the guy that is honest.

21 Likes

Fair point, it could have been a lot more elegant and I do in hindsight feel that was a bit undue. My position still stands though.

I do not however feel that I am “picking on the honest”, rather, I am holding Julia to a much higher standard than GitHub, et al. in how consent is extracted (just like I feel more aghast when hearing about atrocities in a parliamentary democracy than in a dictatorship). Thus why I think it is not just a matter of consent, but consent mixed in with expectations.

Apologies again for a short response, I will try to unwrap it all somehow now, but it may very well take a few hours of on-and-off writing.

Edit: Grammar correction.

5 Likes

As I’ve mentioned earlier in this thread, I continue to believe it’s important to note that, AFAICT, none of the existing popular languages are meeting this expectation if they run their own hosting servers and those servers store standard server logs: Python and R are almost surely not meeting this expectation for that reason – to say nothing of other OSS ecosystems. With that comparison against the standard of practice in mind, I think a more charitable view of what @kristoffer.carlsson was trying to say is that it feels like your current boundary line on this topic changed (or at least became actionable) when you became aware of what data Julia wants to collect (and that seems to have happened precisely because it was explicitly disclosed in their plan), but you aren’t making clear how you will translate your boundary into a law that will be enforced uniformly across all parties – instead you’re arguing the principle in one specific forum in a way that will affect only one OSS project. It seems reasonable that, as a result, anyone working on a new language that’s trying to keep up with incumbent languages that aren’t being held to the same expectations would feel your approach is both very well-intentioned and a little unfair in its application. Does that sense of unfairness through inconsistent application of principles make sense to you as the reason why you’re getting push back?

19 Likes

Just a very quick response, as I think comes across much more clearly in my initial post and what I wrote on Hacker News, I feel conflicted about this and I am not yet 100% sure how to formulate this. Yes, one aspect of it is consent, but perhaps as I wrote in my first point this is some misleading idealism that guides me here and it should be disregarded? I am not sure, but what I am absolutely sure of is that I am not alone to feel uneasy about creating and sending out a uniquely identifiable ID without opt-in consent.

Just to foreshadow what I think is at work here: explicit/implicit financial pressure from external parties, a wish for a more solid ecosystem through analysis, better guidance in terms of what the community desires, the concept of what constitutes consent, expectations of privacy, the state of “surveillance” today, unavoidable identity exposures through existing protocols, simplicity of implementation vs. other aspects, bundling together CDN/CI usage with end-user statistics, etc. It is all very complicated, at least in my eyes, although I may be wrong, and I feel we may easily talk past each other as we probably only see parts of the puzzle. But I am an eternal optimist in the sense that I believe in the good will of the community to discuss this and that both an ideological and technological compromise is possible.

6 Likes

I think this is the crux of the issue: you’re already doing that across the Internet since your IP address is part of many (most?) normal HTTP requests. It’s not perfectly uniquely identifiable, but it’s not so far away from being that and it’s being submitted without even the possibility of opt-out in most cases / for most people.

So I think the core issue this thread should resolve: would it be better for Julia to just do everything via logging IP addresses? That’s what everyone else in OSS is already doing (seemingly without almost any concerns), so perhaps the problem is just that Julia is talking about how to best do things rather than just doing them? That feels quite perverse to me, but it’s my big fear after reading this thread.

24 Likes

This I think captures the issue. @StefanKarpinski and others went out of his way to build a system that captured less information than IP addresses and is very hard to finger print in order to ensure as much privacy as possible, but because they have done something special and so clearly documented it, people are in uproar because everyone knows it exists and it’s different. Meanwhile, all of the other language’s package managers (R, Python, etc.) are just silently collecting IP addresses from their servers, so they know where you house is, but most people don’t realize they know this about you so no one is upset. I think there’s a law for the new age of the internet:

The more you tell people about potential privacy issues and make explicit what you are doing, the more people will complain about the privacy issues.

That’s not to say that we shouldn’t care about privacy issues, but I think we should internalize the gradation of possible amounts of data collection and understand the personal effects given the amount that’s collected by a given party. Right now, it looks like anyone who explicitly tells the public about what’s being collected, how it’s being collected, why it’s being collected, and how to opt is… going to be punished more? Over time, that reaction will have the opposite effect of what I believe those who are concerned wish to see.

36 Likes

Perhaps it would be useful for this thread to focus on sharp comparisons with other systems like condastats or dlstats?

3 Likes