Pkg.jl telemetry should be opt-in

see above: Pkg.jl telemetry should be opt-in - #149 by dlakelan

Sites that collect your IP have no way of knowing who is using that IP… particularly as a machine travels around the world. But sites that collect an IP and a UUID do know when a particular machine moves around the world.

github might collect IP address and userid data in the same database/table, but if they are concerned about privacy they shouldn’t do that. Since Julia is concerned enough about privacy that it’s trying to do the right thing here… It’s important that they don’t collect UUID and IP address in a correlated way.

4 Likes
  1. keeping IP logs separate from the telemetry,
  2. with restricted persistence (compatible with basic security)

Yes, these are certainly valid ways to reduce the risks of misuse.

But on a related note: from a user point-of-view all of the data is still sent
to pkg.julialang.org and since one is in no position to validate how the data
is being separated or handled, the user needs to put a certain amount of trust
in that service.
A good way to design a trustful service is to only require the information that
is strictly needed for operation (GDPR calls this “essential information”). If
more data than this is sent without explicit consent then it seems natural that
people begin to raise concerns.

4 Likes

Then we need to rework it. We are not trying to trick users, we are trying to nag them. I should say, though, that the screenshot I posted appears in the area for notifications, where a lot of notifications appear and then disappear if you just ignore them. So it is not something like a modal dialog box that gives you the impression that you need to do anything or interact with it. But I take your point, we should probably add another half sentence that makes it more clear that one can just ignore the whole prompt without problems.

I have a couple other questions around the project hash, mostly just understanding what that is about. Is the general idea that you want to be able to reverse engineer the Manifest.tomls for each individual user? Or at least those parts of the Manifest.toml that make up packages that are on a given package server? That is possible with the information that you are collecting, I think?

Or is that actually not really possible, because say I have two environments, I instantiate the first (and you’ll be able to reconstruct the content of my environment from the telemetry) and then I instantiate the second one, but now the package server will only get requests for any packages in that second env that haven’t already be downloaded as part of the instantiation of the first env. So will the “picture” of my second env that the package server gets be incomplete there?

So I guess right now I’m a bit confused what the goal of the whole project hash thing is. To understand what environments users are using?

7 Likes

Alright, so JuliaComputing wants analytics on their customers/users to get more funding/improve features - good for them. I also recall this being discussed on here before, and I don’t see it as a concern and I’m a pretty paranoid person. What are they honestly going to do with it - target more free open source projects to me? Sounds great…

Keep in mind Julia is open source - they aren’t doing something evil, and even still it is optional. Maybe say “Here’s some legal crap, but you’re a software person click here to see the telemetry code”. At least it’s transparent :). If it was something evil do you know how fast it’d be on HackerNews, or like… anywhere?

I say let’s all relax and remember, JuliaComputing is a really nice inclusive group pushing OSS to it’s limits. Let’s try to work with them to mitigate concerns, but not freak out. Trust me your cellphone, and laptop have multiple concurrent processes doing worse things than this as you’re reading this…

8 Likes

@StefanKarpinski please please please take my data and use it to make julia better! thank you!!

16 Likes

I think that this needs clarification to avoid noise in the conversation: it’s not Julia Computing who will get the data, but “a limited subset of core Julia developers” (there may be coincidences in the people, but legally this is different).

JC may enable other telemetry options in their products (Julia Pro, etc.), which have their own terms of service.

6 Likes

As far as I know, the only way to not send the telemetry is to either disable it manually in the settings or click the small x on the notification (which just hides it once, it will show up again). If you don’t notice the x or don’t know about the settings page, you have no choice but to accept the telemetry, since that notification doesn’t just vanish iirc. I’d personally prefer all choices (accept, deny once, deny always), including an explanation what is sent.

I think the project hash is used for determining the number of different projects (see here):

This hash value uniquely identifies the path of the active project without revealing any information about that path. Having this value allows determining when packages are dependencies of the same project, as opposed to being used in different projects on the same client: if two requests have the same project hash value, they are used by the same project; if they have different project hashes, they are not.

the hash function is applied to the client UUID, the secret salt value, and the active project path.

The salt ensures that without the Server and the corresponding requests, you can’t reconstruct which packages are used in the same project just by observing that hash. You can’t realistically create a hash collision here.

3 Likes

On the contrary, this only goes to prove that you don’t really care about getting reliable usage data. If the usage data also counts CI usage, then you are not actually counting human users, and the usage stats will be skewed towards CI pipelines.

This is another reason why the telemetry should be opt in, if you care about getting actual human user data.

Also, I’m going to have to consider switching away from Julia if this telemetry thing continues. You say it’s okay without first consulting the community. You should have held a public discussion before even making a decision on what to do.

So you decided it is okay to automatically collect data on users, where does it stop? In Julia 1.6 you’ll probably keep adding more telemetry. I’m a bit disgusted by all this, and doubt you would be getting reliable data in the first place with an opt-out approach, since the CI pipelines are constantly downloading fresh installs of Julia.

Terrible idea, and what abuse are you attempting to prevent? Are you the package police now?

1 Like

If 0% of users want to opt in, maybe that means it is a feature most people would rather not have. Another reason to not have it or make it opt in.

2 Likes

Let me guess, in Julia v1.6 or beyond, @StefanKarpinski will decide without discussion that he needs to collect crash data from all Julia users, and make it opt out as well.

This opt-out telemetry is a breach of trust for the Julia community, since it’s not clear where this will stop. I’m sure that as this is normalized, your investors will keep asking for more data, and you’ll want their money, so you weakly give in to their invasive demands.

This is kind of weak minded, and shows Julia cannot stand up for what is right.

Funding should not be primarily based on usage stats, since those stats can be skewed anyways. What’s to stop someone from automating and gaming the system with a VPN and MAC address randomizer? Then they could get more funding for their package by creating fake usage stats.

1 Like

A piece of well-intentioned advice: I can see this topic matters to you, but, as I’ve seen happen in previous threads on other topics, the intensity of your rhetoric and the use of a tone that suggests a personal resentment weakens your message.

32 Likes

Unfortunately you are not into the position to decide this. One need to play with the given rules or get to the position to change the rules. Open source is not free software, the open source developers need bread and butter to keep coding.

I really don’t understand all of you who are against. Many pointed out that you are giving more data just using the internet. Basically, if you would be really concerned about your privacy you couldn’t participate this discussion, because this discussion happens to be on the internet.

4 Likes

The point I am making about funding is not related to privacy. What I’m saying is that funding should not be determined by usage stats. What if someone invents a highly valuable software only used by a small handful of people? Usage stats (which can be faked also) should never be the primary value indicator for a piece of software.

You are mistaken if you believe that my point about that is related to privacy.

I do have concerns about privacy, but they are separate from my points about funding.

1 Like

This post was temporarily hidden by the community for possibly being off-topic, inappropriate, or spammy.

1 Like

I think I understood your point. Did you understood mine? My point is that it is pointless to complain what criteria someone else than you uses to decide to whom he/she is giving him/her own money. (As an example Gates foundation.) If organization is distributing taxpayer’s money then there are normal democratic ways to affect the criterias and decisions (slow and difficult close to impossible in short term).

2 Likes

It’s not pointless. If i think you are doing a stupid job at allocating financial resources in a free software community like Julia, then i will speak up about it. There are a lot of flaws with relying primarily on usage stats for funding free software. This is especially the case for scientific and technical software, which should not be driven by usage stats, but by scientific value.

We’re not talking about bill gates here, we are talking about funding for mostly scientific free software.

You seem to be missing that this isn’t about the Julia project funding packages, but outside foundations funding the Julia project. You should take this up with numfocus, and other funding agencies.

7 Likes

The telemetry information is necessary for much more beyond Julia Computing. Reliable information about packages is extremely useful to package developers, many of whom have chimed in. Such information helps figure out the size of the user base, its growth over time, decisions about deprecations and support for older versions, and so much more. Every package ecosystem collects these stats. For example, here’s numpy stats, but it is unclear what those numbers mean.

There has been some misunderstanding about funding in the thread, and many have addressed it. I will echo it again. Pkg stats forms a small but useful part of the picture to persuade a reviewer about a project, which is often funded on the basis of innovation proposed, the submitter’s track record, and a bunch of other things. If funds were allocated primarily on the basis of downloads, open source software would not have a sustainability problem.

However, this is not simply about funding. There is a larger problem, especially in academia, where it is hard to justify the impact of your software. Unless you are building something recognizable that a huge number of people use (say Jupyter, numpy, lapack, or even Julia itself), it is extremely hard to convince your department, your dean, your boss, or your colleagues about the impact of your contributions and work. Having quantifiable information makes it part of your CV and your career advancement. This is all the more useful for niche packages, which often do not, by design, have large userbases.

-viral

43 Likes

This seems like a perfectly reasonable explanation to me. As open-source software, the least I can do is help the maintainers lower the cost of serving me software for free (though of course, as a GitHub sponsor, I do hope I’m at least covering my bandwidth used every month)

16 Likes

Given the information collected (eg environmental variables about CI platforms), it will be very easy to disaggregate CI from not-CI. Making this claim suggests you haven’t actually read the very clear documentation.

I don’t follow issues and PRs enough to know when this started to be discussed, but

  1. Presumably Stefan didn’t do this by fiat. Given that a number of other core language maintainers on this thread have been speaking out in support of the move, it seems like many people in a position to decide agreed.
  2. Given the engagement on this topic, the fact that development happened in the open and that there’s really clear documentation written (all before this thread started), and that a warning is printed the first time you use the new system, are all pretty clear indication that no duplicity is intended.
  3. Despite the real collaboration that is fostered by the engagement of the core developers with the community, and the solicitation of feedback and contribution, open source projects are not democracies.

I haven’t seen any evidence that remotely supports this kind of statement. To the contrary, the clarity of propose, ease of opt-out, and the conversation here are pretty strong evidence against such a claim.

This betrays a rather naive understanding of human behavior. Cf. https://science.sciencemag.org/content/302/5649/1338.full

You are of course entitled to your opinion, but you do not have a monopoly on what is right. I am satisfied that this decision, especially in light of the implementation and discussion of it, is the right call, nicely balancing competing interests.

Slippery slope arguments, especially of this nature, are pretty lazy. As a vegetarian, I don’t go around saying “well if you’re willing to eat animals, next week you’ll probably start eating people.”

32 Likes