Pkg.jl telemetry should be opt-in

There should be some legal liability to the individuals holding the full data. What are the legal terms users agree to when the data is shared?

Why would I trust these individuals to report accurate aggregate data? What if they dislike me and want to prevent me from getting funding for my work by manipulating the usage data for my packages?

I already no longer trust the core developers due to various other reasons involving gossip.

This data is going to be tied to financial outcomes for julia developers, so there must be legal liability, aside from the liability associated with security breaches.

To add one more to a sea of perspectives… I think it’s almost impossible for the average user to assess the risk telemetry could have for them. Therefore, they have to rely on more knowledgeable people making a good default decision for them. This is therefore largely a trust issue if separated from the purely technical concerns. I believe due to Julia being open source that any qualified person with stakes in the matter can help reaching a technically sound solution. You can see many good suggestions here already. But there is no real solution for broken trust and people lose trust quickly. Especially if keywords like “telemetry”, “data collection”, “unique identifiers”, “IP addresses” come up a lot which we have all learned to associate with various kinds of nefarious business practices. So my recommendation is to avoid as much as we can any appearance of “hiding it”. Opting out with a special REPL command is maybe not enough.

My suggestion would be that the first time you start Julia (interactively) on a new machine you could be asked “Would you like to share anonymous data about the packages you are installing to help the growth of the Julia ecosystem? yes / no” with a link to the detailed explanation what it entails. For all usage without interactivity it’s opt-out by default. That way nobody could ever be surprised later to find out that information had been sent. And I think it’s totally fine if a majority of people opt out if presented with the choice this way, even if they misunderstand what’s being recorded. The trust of the community is much more valuable long term than usage statistics in my opinion.

6 Likes

Why trust anyone? How do you know github isn’t falsifying the number of stars on a repo? Also, we currently have no data. How is having no data at all going to help you get funding?

9 Likes

GitHub has legal terms of service, and I can see the github profiles of people who starred me. You haven’t explained your legal terms yet.

I would like to submit a “comment” that is intended to be productive and diffusive. I have shifted to feeling that the telemetry feature should initially ship as an opt-in feature in the 1.5 release cycle with a notice. This will allow the generation of additional user feedback (same incremental delivery/incremental adoption philosophy as multiprocessor primitives and other important features which are gradually matured). I recognize that Pkg.jl is not Julia, and that I have no ultimate control, but I feel and think this path will be best from a PR standpoint, especially with our shared desires to release Julia 1.5, to increase the amount of review the telemetry system gets (to ensure it is done right), and ultimately our shared desire to have a model, sustainable, well-designed method of tracking impact in “wide” and competitive package systems such as Julia’s. People are starting to raise pitch"fork" ideas and call into question nonprofit organizations, and I feel we need to take this slower.

To use a strained pseudo-analogy, the frequency response of human emotions is, unfortunately, really right-sided. These issues need some more representation at the lower bandwidths.

10 Likes

If you think somebody has financially harmed you through deliberate action, you can always sue them. No special agreement is needed for that.

5 Likes

And this is what I object to. I didn’t sign up to “help find funding support for open source development…” with my registration of Julia code. No matter HOW noble the cause, doing this without talking to the creators of the data and allowing them to say, “Hey, you know, the perception of user privacy might trump any value we could get from turning on tracking for all packages” – and giving them the option to decline to participate in money-making efforts – is what I find horribly objectionable.

I wish the creators of other packages the success they desire. I don’t want to support it through user tracking – not because this particular implementation is especially egregious, but because I don’t believe this is in the best interests of the user community as a whole and I don’t want to be associated with it, even by proxy. As someone else pointed out above, if it’s so important for certain individuals to know who’s downloaded their code, why don’t they put the telemetry in the individual packages? I think part of the answer is that it smells wrong to do so. If that’s the case, then why does making this language-wide legitimize the behavior except by reframing the supporters’ side as an argumentum ad populum?

3 Likes
  1. Who will have access to the data and how data would be processed is very plainly explained in the top linked document.

Data sent to pkg.julialang.org is only accessible to a limited subset of core Julia developers and is not made public or shared with any third party.

What am I missing here? How is this plainly explained, and who is in this “limited subset of core Julia developers”?

If it’s limited to core Julia developers, how is the data supposed to be used by package maintainers and others seeking to use it to obtain funding?

I see an awful lot of speculation about what is and is not legal here. I would like to remind everyone that this system was set up the right way, namely, after consulting a lawyer. That lawyer should be willing to defend that position (for a price, of course— Nothing in this world is free).

If you have legal questions that concern yourself, you should retain your own lawyer. They will advise you on how to proceed.

I am not a lawyer. However, my father was, so I know a bit about how early legal advice can save a lot of trouble down the road.

1 Like

The data doesn’t say who downloaded the code.

We already have various counts that can be used e.g. in grant proposals — for example number of julia downloads, number of stars, number of website hits. Do you also object to that? GitHub also already has all of this data plus much more. So it’s ok as long as it’s not used to help anything julia-related apply for funding, or help people list impact on their CVs?

8 Likes

Do they? I realize Pkg is probably downloading from github. But, I don’t recall giving the Julia package manager my github login. Is the Julia Pkg manager generating and sending github a UUID? Or do they have nothing but an IP address… which in my case is going to be the IPv6 randomly rolling privacy address of a proxy server that serves my entire small office network, and/or in other people’s case the IP address of a VPN service.

I would like to ask a different set of questions. Because I need to apply for federal and private grants as part of my day job, I understand the need for accurate usage data in the application.

If I have several machines, should I share that UUID among them? Or is that sort of long tail not much of a concern? Should I reinstall current packages in some way after this is live so that my current package usage is captured in the aggregate reporting?

4 Likes

Fair enough, assuming that deanonymization is not a risk. s/who has/how many people have.

We already have various counts that can be used e.g. in grant proposals — for example number of julia downloads, number of stars, number of website hits. Do you also object to that?

Not at all, because those are all transactions where the user giving up the statistic and the person providing the file or service each gain something from the transaction when it’s necessary, or when the user decides to provide this gift of the statistic as a show of appreciation. Neither of these things happen in this system-wide proposal.

GitHub also already has all of this data plus much more

I already responded that I don’t find this argument compelling.

So it’s ok as long as it’s not used to help anything julia-related apply for funding, or help people list impact on their CVs?

No. It’s not ok at all, and it misses the point: If some people want these metrics to help them apply for funding or to list impact on their CVs, or decide how many cookies to eat, they should place the metric-gathering in their own code (edit: or sign up to a central service that does this, and notify their users of it) rather than forcing others to participate in this against their will.

4 Likes

You think that it’s better to have hundreds of people import telemetry when messing it up can potentially reveal sensitive information rather than have one group do it right?

5 Likes

No, I think it’s better that we not have it at all. But if we do, then there should be a way for developers who don’t want to participate to be able to opt out. The same courtesy given to users should be given to the people who write the code that make the Julia ecosystem what it is.

I’ve edited my previous reply to clarify.

I think it’s better to have Julia do the telemetry, to not use a permanent static UUID, and also to ban the use of opt-out telemetry in any package submitted to the general registry.

3 Likes

You object to even having download counts for packages? That seems rather extreme.

5 Likes

That seems rather extreme.

OK.

My frustration is coming from the fact that I don’t understand why the people who decided to go forward with this – brilliant engineers – can’t grasp that the way it was handled is objectionable. I’m trying to explain it the best I can, but perhaps taking a different view would help:

When I signed up to register my package, I did so under agreements both tacit and explicit. These agreements spelled out the kind of relationship I would have with the Julia registry maintainers and that they would have with me.

The dynamic has changed over the years, unilaterally, with little opportunity for input. While I understand the motivation, the idea that the project (I don’t know who runs it, so I don’t want to malign Julia Computing again) is sweeping up all my data to serve against my will in the event I no longer wish to support the community, and then making the changes described above, means that I have no control over how my code is presented to the world. Legally, this is all on the up and up – but it means, at least for me, that I will have to revisit my relationship with the community to decide whether or not I want to continue providing a service under terms that can change out from under me.

I think I have decided that there is no benefit, and only detriment, to this kind of one-sided relationship. I am hoping that someone can either explain to me why this is in my interest to continue, or can persuade whoever implemented this idea (again, without input from people I would consider stakeholders) to re-evaluate it.

As it stands, however, I don’t see much hope for a continued mutually-beneficial dynamic.

2 Likes

First of all, thank you to everyone that has contributed constructively. It has been a struggle to keep up and I made a promise to myself to at least try to maintain “the big perspective” as it is all too easy to fall back on sloppy thinking and emotions.

Before I go on to write yet another long post I do want to state that I only speak for myself and make no attempt to unify every voice on the “opt-in side”. As such I will completely ignore some topics that have been discussed at length over the weekend: possible attack vectors, arbitrary code execution, etc., as they do not interest me personally and has nothing to do with my insistence on opt-in as a matter of consent. Although ultimately this is beyond the scope of this discussion, as I did mention on Hacker News I do feel the underlying issue causing this mess is that we lack a good way to ask for and give consent online and that we are left with awful proxys (do contact me by e-mail if you know good research on this matter as it does interest me). Now, on to the matter at hand.

Thank you @StefanKarpinski for your response to my initial “mega post”, it did make me think and the technical feedback was very useful. I see it as a satisfactory rebuttal of my “survey idea” and would only like to clarify one thing. I do agree that automated systems most likely have less of a right to privacy, but what I was trying to aim at in my initial post was that they also may be less interesting to us from a telemetry standpoint and that without them we could possibly safely assume that any Julia installation we are interested in will be run interactively at least once. I will not allude to this again though.

@jeff.bezanson really strikes home when he stated that the UUIDs is really what causes the strongest opposition, this is absolutely true for me (I also want to apologise for Jeff as I feel somewhat guilty of causing this statement: “…if you only ‘warn’ people about [Julia’s] package manager and not anything else, you are sending the message that [Julia] is somehow uniquely nefarious[.]”. Sorry, my intention was never to scream from the rooftops to my students, just to state factually and objectively what the telemetry entailed and allow them to make their own decision). As I did state previously, the UUIDs add additional capabilities that IP and HTTP lack in that they persist and perhaps even more importantly that they pretty much eliminates the plausible deniability that is inherent in IP and HTTP as for example NAT is no longer a possibility. Generating a UUID on someone’s device and transmitting it really is where the rubber hits the road for me. This is good, now I have finally have some clarity in regards to why I felt emotionally strongly about this when I encountered the topic and hopefully I can work back from there to something actionable to “get me on board” as I do want to be on board as I agree with the desiderata.

The clarification regarding IP logs and how they are to be separated from the Pkg logs is one of the greatest accomplishments of this thread so far – @c42f also hinting that the UUID logs could be considered a “toxic asset”. Now, a couple of naive technical questions that if answered would actually sway my opinion that opt-out can be justified in this specific case:

1.) Must one retain the 128-bit UUIDs in the logs in order to reach the desiderata? Is a lower-bound estimate in terms of usage numbers with some controllable confidence interval not sufficient so as to preserve plausible deniability and break the link between what is on the end-user system and the log? HyperLogLog springs to mind, but I am sure those of you less awful with stats and more familiar with the technical landscape know better than me.

2.) Is there any argument in favour of privacy that makes a persistent 128-bit UUID favourable to a transient 32-bit IPv4 address? Let us ignore IPv6 with its 128 bits and it still seems to be at least a decade out…

If it can be argued that there are privacy benefits of the current UUID approach under the assumption that it is never stored in association with an IP I am willing to concede. You will have won me over gradually and fairly. As I have a piece of software that is closely related to this issue I really want this to be the case as I would otherwise be perceived to make a “political statement” when releasing it – hopefully before JuliaCon – and it really is not how I wanted my “return” to the community to look like…

I guess ending with arguments I think we can leave by the wayside is now a tradition of mine:

@PetrKryslUCSD said: “[E]ach Julia executable would have a unique ID, and the telemetry would report usage tied to the executable. There would be no link between the user and the executable, hence complete privacy.” I am really sorry to pick on this one, but is it not the same as saying: “We did not track him, we tracked his car!” or “I did not kill him, the bullet did!” – it did make me smile and a judge would have a field day with this one. Again, you are a fine contributor and I am not trying to pick on you, just the argument itself.

“My code released as FOSS is served as a part of this and I object to the telemetry”: While I do respect this point of view, it is as you state indeed the case that FOSS does not take a stance on the purpose for which code is used – “The Software shall be used for Good, not Evil” springs to mind, which makes a license not FOSS. Thus, while I am sorry to hear that you feel uncomfortable about it, I am not sure how this adds much to the discussion.

“Pkg.jl is not ‘baked into’ the language as it is used for third-party code”: It comes with any release in the tarball, if that does not count as “baked in” I honestly do not know what counts.

Lastly, a lot of nonsense is written on Hacker News, but here is a comment I think many members deserve to feel proud over – assuming none of you wrote it of course…

31 Likes

Not sure what you mean with your comment @ninjin . How is the “unique id” generated for your executable, from let’s say the date the exe ran for the first time, able to tie you as a user to any of the telemetry data?

To expound a bit:

  1. The user downloads the exe. It is completely anonymous.
  2. The user runs the exe for the first time. The exe generates an id from data
    that cannot id the user or his/her machine etc. For instance the date.
  3. The telemetry that records which packages are used is tied to this id.
    This simply ensures (to a large degree) that accurate usage statistics can be compiled. (The exe is “unique” in a statistical sense.)
2 Likes