Pkg.jl telemetry should be opt-in

Unfortunately you are not into the position to decide this. One need to play with the given rules or get to the position to change the rules. Open source is not free software, the open source developers need bread and butter to keep coding.

I really don’t understand all of you who are against. Many pointed out that you are giving more data just using the internet. Basically, if you would be really concerned about your privacy you couldn’t participate this discussion, because this discussion happens to be on the internet.

4 Likes

The point I am making about funding is not related to privacy. What I’m saying is that funding should not be determined by usage stats. What if someone invents a highly valuable software only used by a small handful of people? Usage stats (which can be faked also) should never be the primary value indicator for a piece of software.

You are mistaken if you believe that my point about that is related to privacy.

I do have concerns about privacy, but they are separate from my points about funding.

1 Like

I think I understood your point. Did you understood mine? My point is that it is pointless to complain what criteria someone else than you uses to decide to whom he/she is giving him/her own money. (As an example Gates foundation.) If organization is distributing taxpayer’s money then there are normal democratic ways to affect the criterias and decisions (slow and difficult close to impossible in short term).

2 Likes

It’s not pointless. If i think you are doing a stupid job at allocating financial resources in a free software community like Julia, then i will speak up about it. There are a lot of flaws with relying primarily on usage stats for funding free software. This is especially the case for scientific and technical software, which should not be driven by usage stats, but by scientific value.

We’re not talking about bill gates here, we are talking about funding for mostly scientific free software.

You seem to be missing that this isn’t about the Julia project funding packages, but outside foundations funding the Julia project. You should take this up with numfocus, and other funding agencies.

7 Likes

The telemetry information is necessary for much more beyond Julia Computing. Reliable information about packages is extremely useful to package developers, many of whom have chimed in. Such information helps figure out the size of the user base, its growth over time, decisions about deprecations and support for older versions, and so much more. Every package ecosystem collects these stats. For example, here’s numpy stats, but it is unclear what those numbers mean.

There has been some misunderstanding about funding in the thread, and many have addressed it. I will echo it again. Pkg stats forms a small but useful part of the picture to persuade a reviewer about a project, which is often funded on the basis of innovation proposed, the submitter’s track record, and a bunch of other things. If funds were allocated primarily on the basis of downloads, open source software would not have a sustainability problem.

However, this is not simply about funding. There is a larger problem, especially in academia, where it is hard to justify the impact of your software. Unless you are building something recognizable that a huge number of people use (say Jupyter, numpy, lapack, or even Julia itself), it is extremely hard to convince your department, your dean, your boss, or your colleagues about the impact of your contributions and work. Having quantifiable information makes it part of your CV and your career advancement. This is all the more useful for niche packages, which often do not, by design, have large userbases.

-viral

44 Likes

This seems like a perfectly reasonable explanation to me. As open-source software, the least I can do is help the maintainers lower the cost of serving me software for free (though of course, as a GitHub sponsor, I do hope I’m at least covering my bandwidth used every month)

16 Likes

Given the information collected (eg environmental variables about CI platforms), it will be very easy to disaggregate CI from not-CI. Making this claim suggests you haven’t actually read the very clear documentation.

I don’t follow issues and PRs enough to know when this started to be discussed, but

  1. Presumably Stefan didn’t do this by fiat. Given that a number of other core language maintainers on this thread have been speaking out in support of the move, it seems like many people in a position to decide agreed.
  2. Given the engagement on this topic, the fact that development happened in the open and that there’s really clear documentation written (all before this thread started), and that a warning is printed the first time you use the new system, are all pretty clear indication that no duplicity is intended.
  3. Despite the real collaboration that is fostered by the engagement of the core developers with the community, and the solicitation of feedback and contribution, open source projects are not democracies.

I haven’t seen any evidence that remotely supports this kind of statement. To the contrary, the clarity of propose, ease of opt-out, and the conversation here are pretty strong evidence against such a claim.

This betrays a rather naive understanding of human behavior. Cf. https://science.sciencemag.org/content/302/5649/1338.full

You are of course entitled to your opinion, but you do not have a monopoly on what is right. I am satisfied that this decision, especially in light of the implementation and discussion of it, is the right call, nicely balancing competing interests.

Slippery slope arguments, especially of this nature, are pretty lazy. As a vegetarian, I don’t go around saying “well if you’re willing to eat animals, next week you’ll probably start eating people.”

32 Likes

I agree with you, but I wonder if it would have been better to have this discussion after a tentative decision has been reached, but before the PR was merged.

Yes, I know, the code is not necesarily final. But hearing about the telemetry (which, again, I fully support and find innocuous) first in a topic like this is just different.

18 Likes

Given the reaction, I think you’re probably right. Certainly, it would be nice to be able to point to a more open discussion that happened before hand. At the same time, if the decision was made, giving the illusion that the community had veto power might actually have been duplicitous. On the other hand, seems like there might be legit design decisions that could have benefited from early community feedback (eg keeping IP addresses separate)."

14 Likes

This telemetry business has been done on other open-source (CLI) systems, like Homebrew, and was never well-received. See, e.g., https://news.ycombinator.com/item?id=11566720

To be clear — I get the need to raise money, so I sympathize with the plight of the Julia team. Nevertheless, I do not support opt-out tracking, no matter how well-intentioned, privacy-oriented, or well-designed. If this telemetry will be opt-out-only, I would like to request that Julia at least respect the global DO_NOT_TRACK environment variable proposed on https://consoledonottrack.com

8 Likes

Why do I get a feeling that at least some of the loud voices here opposed to the tracking have happily signed over their data to FB, Google, Microsoft, Twitter, …?

6 Likes

So, it’s hard to even have a phone without at least google or apple involved in your life. However the fact that they are leveraging their position as dominant market-cornering organizations and that people have accepted that it’s necessary to allow this in order to even have a phone in the modern world is not necessarily evidence that people are happy about it.

I don’t let any FB or Twitter apps on my phone. I have turned off location history tracking in google’s phone settings. I look carefully at the permissions requested by phone apps, I use Firefox browser with FB container, and temporary containers. I use DNS over TLS for my home network, I run an open source router, I don’t use SMS instead favoring Signal… I use startpage.com as my search engine, I don’t allow Chrome on any of my machines though I install chromium for use when certain sites need it. I turn off all the google specific features in chromium. I don’t use Zoom or Google Meet or MS Teams, preferring https://meet.jit.si I don’t use my ISP or google DNS resolver, preferring the privacy focused cloudflare resolver (1.1.1.1) I have some IP cameras, but they are on a totally separate VLAN on my home lan, and that VLAN is 100% firewalled and can’t send or receive packets from the internet, I have a VPN running on my home server so when I’m at a cafe or visiting a campus I can turn on VPN and 100% privatize my traffic… etc

So no, i’m not happily signing over data. I’m ok with Julia doing this tracking if they take a few steps to ensure UUID and IP address are stored separately. Otherwise I"ll opt out.

oh, and I donate cash annually to Julia

18 Likes

consider this:

Pkg.jl shows you source code, whereas… IDK, I use startpage and firefox FB container too, but like, do you trust google options / android permissions do what they do and not that there are other ways to leak identifiable info?

Do you trust 1.1.1.1 never collecting anything you didn’t opt-in explicitly? Do you trust Chromium source code more because it’s from Google? and did Chromium ever ask you to opt-in collection? or did you have to turn some of them off?

I think ~50% noise here is unfair to Julia, as if doing it openly and politely only attracts more ‘discussion’ because ~million other sites / software don’t even have lines of code you can point to (for tracking).

9 Likes

Yes, clearly Julia is better than all that… My point was just that there are things you can do to reduce data collection, and I do a lot of them. The argument that somehow most people offering criticism here are happily giving away their stuff elsewhere doesn’t hold water, at least not for me and probably not for others. Even IP addresses. As a frequent contributor to OpenWrt’s forum I can tell you it’s a daily event for us to get questions about how to set up VPNs so people can hide their IP addresses.

I have one specific agenda here which is to ensure that Julia does a good job of actually designing their data collection to achieve their goals… I take the response so far to be constructive, we’ve identified areas of concern such as collecting a UUID and an IP address together being a bad idea. I don’t think this was previously considered in its full implications.

In my opinion, the purpose of doing things openly and politely is to get discussion. It’s a bit of a concern that this was designed and inserted with no prior discussion in the forum. The discussion has been mostly productive I think.

8 Likes

FB, Google, Microsoft, Twitter

Other wrongdoings don’t justify ones own actions.
Also compared to webservices or social media you normally wouldn’t expect a
locally installed development tool to need to collect and report your personal
information.

3 Likes

I haven’t been involved in the telemetry discussion or design, but it seems that the Julia-HyperLogLog header is for counting the number of users without resorting to a unique identifier. See https://julialang.org/legal/data/#hyperloglog and HyperLogLog - Wikipedia.

HyperLogLog seems attractive because the numbers sent have a high probability of collision (like uint16 ids would do, but even more so based on the small number of 1024 buckets in the current implementation). It’s also designed to estimate the cardinality of a set while never storing the elements of that set (that is, it’s an approximate, streaming algorithm for set cardinality). The streaming aspect doesn’t seem entirely necessary here, but it does reduce the attack surface even further if the server avoids ever logging that header and does the aggregation online. I don’t think this is necessary mind you, given the high probability of collision.

I personally think opt-out data collection here is completely reasonable as it’s likely to do good for the entire community (in indirectly helping provide development resources which make the ecosystem better), and because it’s minimal enough that nobody has been able to present a credible harm that could come to users even if an attacker were to compromise all telemetry data.

17 Likes

I disagree — I think it is a relevant argument.

If someone is willing to accept a lot of sophisticated black-box tracking as the cost of using various social media and web services, I find it somewhat disingenuous to complain so loud about some minimal, transparent, and anonymous telemetry about something that is not interested in your shopping habits, love life, social network, just the Julia packages you use, even if it technically does represent a marginal increase in the data collection you are subject to.

Don’t get me wrong, I think it is perfectly fine to grouch a bit about yet another darn thing collecting data. But if one is resigned to all the tracking that big web companies are doing, it becomes difficult to present it as a matter of principle. A single change in the algorithms they are using (that you, of course, won’t know about) will probably increase the collection of relevant personal data (ie data you would probably prefer not to be collected) about you orders of magnitude more than this telemetry.

9 Likes

it seems that the Julia-HyperLogLog header is for counting the number of
users without resorting to a unique identifier.

HyperLogLog is a great concept, but it’s purpose is somehow defeated when you
send a unique identifier along with it by default.
The Opt-out HyperLogLog key would be more reasonable if the UUID was opt-in.

3 Likes