Pkg.jl telemetry should be opt-in

But every time you or anyone else have used git with GitHub (Microsoft) you have sent your (unique) public keys and every time someone has downloaded one of your packages their IP address has been sent to GitHub (Microsoft). This was never a problem for you before, even when these things are more intrusive than the potential UUID that will be sent? Could you elaborate on that a bit?

6 Likes

That argument doesn’t persuade me. I knew GitHub tracked activity on its site when I signed up. I knew going in what the deal was, and I accepted that in exchange for the service they provided.

I can also delete my repository at any time and stop the collection of data by hosting it in my own git repository.

The same cannot be said for the General Registry.

3 Likes

As a package developer, if I thought telemetry data would help me, I could simply add (opt-in) telemetry to my package. Does it really need to be baked into the language? The only organization that will benefit from telemetry baked into the language is JC, so it appears to me the language was modified for the benefit of JC. I’m not sure how I feel about that.

One solution that would make me feel a little better is if Pkg were moved out of Julia and hosted by Julia Computing. The data is valuable and I do think JC should benefit from it in some way. As I said in my first comment here, I think there is right a way to do this, but I don’t think we’ve found it yet.

Another solution, why not just host special / curated Julia packages on JuliaHub as an alternative to GitHub and telemetry becomes a non-issue because of course you’ll have most of the data you need in that case. Package maintainers get the benefit of being “special” with some added visibility and JC (or whoever runs JuliaHub.com ) benefits from the data.

1 Like

What about an individual (maybe a data journalist?) protecting and separating his work identity (via VPN/Tor/…) and his regular online identity. If he installed a package overlooking the telemetry warning with and without VPN/Tor/…, the user UUID would link both online identities.
I guess one would be able to deanonymize a VPN/Tor/… user with access to such statistics.

Another deanonymization scenario would be for example package developer. For example a package developer installs e.g. a semi-obscure package and then add a dependency to this package to one of his public projects on GitHub withing e.g. an hour. If a developer does this 3 times, then here is a good chance to link both.

What about specifying who will have access to the full telemetry data (I guess it will be Julia Computing, maybe it is already mentioned but I did not see it) and declare that they will not make any deanonymization attempt. I trust Julia Computing that they will abide with this statement. Additional they can also include a warrant canary statement and tell that so far they did not request from law enforcement to hand over the data and remove this statement when necessary (see [1])

But I am wondering if we cannot count the install-base of a package without this user UUID. Maybe the issue is to prevent double-counting a package installation when a user upgrades package v1.0 to v1.1. As they are two “install events” but just one user. But could this not be solved by counting “upgrade events” differently from “fresh install events” ?

Sorry if I sound paranoid, but I would really hate if the telemetry in a privacy respecting open software like julia would reveal the identity of a whistleblower as did the printer IDs (encoded as a grids of dots) in the case of Reality Winner [2]. Probably this is unrealistic, but maybe not.

[1] https://www.reuters.com/article/us-usa-cyber-reddit-idUSKCN0WX2YF
[2] https://www.theatlantic.com/technology/archive/2017/06/the-mysterious-printer-code-that-could-have-led-the-fbi-to-reality-winner/529350/

1 Like

What about an individual (maybe a data journalist?) protecting and separating his work identity (via VPN/Tor/…) and his regular online identity. If he installed a package overlooking the telemetry warning with and without VPN/Tor/…, the user UUID would link both online identities.
I guess one would be able to deanonymize a VPN/Tor/… user with access to such statistics.

Another deanonymization scenario would be for example package developer. For example a package developer installs e.g. a semi-obscure package and then add a dependency to this package to one of his public projects on GitHub withing e.g. an hour. If a developer does this 3 times, then here is a good chance to link both.

What about specifying who will have access to the full telemetry data (I guess it will be Julia Computing, maybe it is already mentioned but I did not see it) and declare that they will not make any deanonymization attempt. I trust Julia Computing that they will abide with this statement. Additional they can also include a warrant canary statement and tell that so far they did not request from law enforcement to hand over the data and remove this statement when necessary (see [1])

But I am wondering if we cannot count the install-base of a package without this user UUID. Maybe the issue is to prevent double-counting a package installation when a user upgrades package v1.0 to v1.1. As they are two “install events” but just one user. But could this not be solved by counting “upgrade events” differently from “fresh install events” ?

Sorry if I sound paranoid, but I would really hate if the telemetry in a privacy respecting open software like julia would reveal the identity of a whistleblower as did the printer IDs (encoded as a grids of dots) in the case of Reality Winner [2]. Maybe this is unrealistic, but maybe not.

[1] https://en.wikipedia.org/wiki/Warrant_canary
[2] https://www.theatlantic.com/technology/archive/2017/06/the-mysterious-printer-code-that-could-have-led-the-fbi-to-reality-winner/529350/

1 Like

To be clear, telemetry is used only when installing third-party packages. Using only Base and the standard libraries doesn’t send anything, so the claim “baked into the language” isn’t very much accurate. It’s rather baked into the default package manager.

3 Likes

I think another option will be for Telemetry to be a separate package. If you install Telemetry, you know that you gonna share some of your usage info. In return, you can use Telemetry to get some high level queries such as Top 10 or Top 100 packages for this week or month, high usage rate packages, least use or unused packages (to orphan), etc. Telemetry can help discover new packages and show other stat info that the user or developer may need in the future. This will encourage its installation due to its benefits in exchange of giving some information. i suggest that by default, any other packages that collect information should be rejected until it is justified and deliberated.

2 Likes

True, but the default package manager is a standard library that is currently inseparable from Base, so I think my statement is accurate enough. If standard libraries were truly separate, I think the telemetry in a separate Pkg would be better, but in that case, I’d probably still prefer opt-in or maybe even a separate Telemetry package (like @ppalmes is suggesting) that could be installed (the ultimate opt in :slight_smile:).

This thread has hit the record of receiving the greatest number of replies in the history of this forum - over discussions on time to first plot and scoping rules. :wink: And that in just one week!

Whatever opinion the participants have on the different aspects of this feature, that may be taken as a sign of its importance.

4 Likes

I’m not sure where you got this notion from. Julia Computing is completely unrelated to any of this aside from the coincidence that some is the same people who work there are also are Julia core developers. If you are confused about this issue, I wrote a long, detailed blog post about that as well:

Usage data would be used to help find funding support for open source development of Julia itself and projects like JuMP, Flux, MLJ, DataFrames and others, many of which are developed primarily by people at universities and other research institutions where being able to demonstrate impact with concrete usage numbers is crucial to getting funding. Such funding would not go through Julia Computing or directly benefit the company.

18 Likes

Again:

  1. Who will have access to the data and how data would be processed is very plainly explained in the top linked document.
  2. That is not Julia Computing.

That’s in the last paragraph. Maybe worth copy and paste, so that everyone can read without even clicking on the link:

(EDIT: see also the previous post by @StefanKarpinski and the linked blog post about Julia developers vs. Julia Computing, etc.)

5 Likes

Re-reading the section about data access, analysis and retention, I have noticed that it should also mention that there will be means for users to request the deletion of their data, or a copy of the individual records taken from them (rights to data erasure and data portability, enforced at least by the GDPR).

That might be possible thanks to the user unique code, which users should also be able to know. A bonus point towards sending that information, in my opinion.

(EDIT: What I wrote here is not right. See this later post.)

1 Like

That’s just regular OpSec on the end of the journalist though and hardly specific to using julia… Quite frankly any outside connection could in theory lead to a metadata leak, with some of them making the task of deanonymization MUCH easier than the minimal telemetry of Pkg, simply because they send a much bigger footprint home…

Also, fighting over telemetry and fears of deanonymization when talking about a free, public service designed to serve arbitrary code which you then trust and run feels really like barking up the wrong tree. A much better alternative than thinking of the telemetry would be to use a private package server, vet packages served through it before using it, and then using that instead of the official one since you can’t trust a third party anyway. This has been mentioned multipled times already.

11 Likes

In contrast, the number of views of this topic relative to other posts (some of which have much fewer replies) is very low. This could indicate that only relatively few, very vocal voices care to participate in this discussion or read about this topic at all.


Just for comparison:

17.200 views in the last 5 days on a topic about bringing julia to iOS and Android, compared to the ~13.000 here. To me, this indicates that there are a whole lot of people wanting to use julia on iOS and Android, so maybe development effort should be spent to make that happen.

23.000 for OpenBLAS vs MKL. Maybe I’m missing something here…

There’s A LOT of topics with many more views, though I don’t put them here since they’re also older than these two.

6 Likes

I was on holiday now and only sparsely read discourse (and the julia/Pkg issue tracker), so I might miss something outside this thread. But a few observations/comments.

a) an IP number is (still) a relatively weak identifier, a UUID is a strong one

b) while it’s true, that github has strong identification (going to F2A, and even mandatory to contributing to some projects/packages), there is (still) quite good functionality with git anon access for using packages

c) IANAL, in my reading of the GDPR (I had to look closer at this in my day job) creating a user id and using it to store user specific data is the classical case of ‘informed consent’ i.e. opt-in. I don’t doubt, that some lawers might get the picture that the anonymity of a random user id is good enough here, I’d rather liked to see a second opinion by e.g. by EFF.

d) While my personal opinion is clearly: Opt-in here, I think the developers have done a good example how to do this technically and transparent.

e) Better documention is needed.

f) The claim, that this discussion is pre-mature as “1.5 isn’t released” is weak, as 1.5-rc1 is already used by interested parties (and opt-out telemetry with that).

8 Likes

Just to be precise, the threads that you cited are not that recent: both were started in 2017, so those many views have been accumulated for a much longer time. Also, I wouldn’t say that the participation is limited to a few vocal voices: the number of users marked on this thread (over 50) is not small at all.

But anyway, you’re completely right: the number of replies, users, etc. is only a sign, and may be very biased. I don’t want to give much importance to this usage statistics.

@anon94023334 You seem to have misunderstood who owns the telemetry. It is not owned by Julia Computing. As a matter of fact it is community data, and will be available to the community at large.

Julia Computing raises revenue through its services and products that are listed on its website. I can see your point of view, and respect them - except this point.

-viral

10 Likes

Sorry to be nitpicky, but I wouldn’t put it exactly like that: if there are individual user codes, the owners of the data records are those individuals; the Julia community may be given aggregated, anonymous results of the data ananlysis. But the full data records shall be managed by a closed, well-defined group of data controllers, liable of ensuring that they are processed according to the conditions stated in the legal notice (be it opt-out, -in, or whatever).

1 Like

Yes, my only point there was that it is not owned by Julia Computing. Julia Computing will see whatever everyone else sees.

-viral

5 Likes

Honestly I think the discussion here has been productive. I recognize that it’s been heated at times, but that goes to show that the issue is important, perhaps more important than was recognized by those who designed the system in the first place.

I’ve filed bug reports on a few issues I think are important: storing IP addresses separately, which I think should be actively acknowledged somewhere other than the forum… and the use of a rolling identifier rather than a uuid (ie. an ID that changes through time on a rolling schedule so that it’s not possible to track an individual over months, years, decades…)

Given what I know about how it works… I will probably opt out due to the UUID… I would probably not opt out if instead it was a rolling identifier changing every 60 days or so… I still think that would give good usage stats.

6 Likes