Pkg.jl telemetry should be opt-in

Any comment on a suggestion like this one? A very short message (maybe only the first time in a session a Pkg operation that would send telemetry information is performed) that constantly reminds the user that telemetry is active would keep them aware of this.

11 Likes

Where is the official announcement on this? Would like to read the current wording… Also I agree with the point about ip addresses, but if Julia is logging both uuids and ip addresses then Julia can track an individual as they move around from connection to connection. This does seem more intrusive than just logging ips where they wouldn’t know if it’s me at home and also at work and also at friends houses and also at coffee shops and also at school etc etc

2 Likes

I think you misread something: IP addresses will not be collected.

2 Likes

Well that’s why I wanted to see the official announcement… To see exactly what info will be collected… Also it usually takes active action on the part of an organization to NOT collect ips because most server software by default logs things.

There is no announcement since Julia 1.5 has not been released yet.

4 Likes

I trust that will happen when 1.5 is released.

Not disregarded, but used as a starting point for a rational analysis about the issue that compares costs and benefits (both are hard to quantify, I know).

Specifically, one key question is whether the telemetry solution is on a Pareto frontier of collecting the minimum amount of information to achieve some given goal. Looking at it, I find the UUID-based approach neat and much better than collecting IP addresses, perhaps demonstrating an approach that other scientific programming language communities will adapt in due course. From this perspective, the implementation actually sends a very strong message about how much Julia cares about privacy.

7 Likes

Perhaps a draft of this announcement could be posted here so specific questions and feedback would be possible? It might alleviate the problem of posting after the official release and then needing to walk it back or modify the system.

As it is now I’m not sure what the real proposal IS.

2 Likes

Though maybe evident from the discussion here, having prominent members of the julia community somewhat downplay the issue might also send a very strong message about how much Julia cares about privacy.

4 Likes

IP addresses are logged. It’s not mentioned in the telemetry document because it didn’t occur to me that it needed to be mentioned—that’s how all internet services work and it’s not an additional header that we do anything special to send. For the sake of completeness, it should probably be listed as well. In practice it is not feasible to run an open internet service without recording IP addresses since without them you lose the very necessary ability to detect and filter abusive IP addresses, and your service would just be a sitting duck for all kinds of attacks. I would be shocked if there’s a single operator of any public internet service that isn’t logging IP addresses.

16 Likes

@StefanKarpinski, thanks for this. So it seems like you need to make some very explicit statements about how long those IP addresses are retained, and whether they will ever be correlated with UUIDs otherwise the UUID becomes a method whereby Julia could extract information about the movements of individuals around the globe.

2 Likes

The last section here:

https://julialang.org/legal/data/#data_access_analysis_amp_retention

promises exactly that—once we’re actually receiving and processing data, we will not only explain how it’s used, but we will publish the code that does the processing and the aggregated results. We will also publish an official data retention policy where that paragraph is now. All that has yet to be worked out, but it will be done with the same level of transparency and responsibility towards our users and developers.

17 Likes

There definitely are some, like Posteo.

2 Likes

My take-away so far from the conversation is that the Julia developers do care about privacy, but that they need information about how often Julia is used in order for Julia development to continue. They don’t need information about what the name is of the person using it, what location they are using it from, what they had for dinner, whether they like certain watches or clothing or shoes… etc

Therefore it seems the goal is to collect unique identifiers for julia installs, and nothing more… However in the course of doing business it’s normal for IP logs to be retained so that abusive machines can be filtered out etc… So to achieve the goal it will be necessary to make some specific separation between the IP logs retained for anti-abuse and the UUIDs for usage statistics.

Perhaps a specific machine could be put on the net which serves to collect the telemetry and does not have IP logging. Then the UUID can’t be connected to the IP address and there is not a tracking across connections possibility… something like that.

I’ll also point to the cloudflare DNS resolver policy which starts:

and then points to 1.1.1.1 Public DNS Resolver · Cloudflare 1.1.1.1 docs and Application Privacy Policy

for an example of a service focused on privacy

The 1.1.1.1 public DNS resolver was designed for privacy first, and Cloudflare commits to the following:

  1. Cloudflare will not sell or share Public Resolver users’ personal data with third parties or use personal data from the Public Resolver to target any user with advertisements.
  2. Cloudflare will only retain or use what is being asked, not information that will identify who is asking it. Except for randomly sampled network packets captured from at most .05% of all traffic sent to Cloudflare’s network infrastructure, Cloudflare will not retain the source IP from DNS queries to the Public Resolver in non-volatile storage. These randomly sampled packets are solely used for network troubleshooting and DoS mitigation purposes.
  3. A Public Resolver user’s IP address (referred to as the client or source IP address) will not be stored in non-volatile storage. Cloudflare will anonymize source IP addresses via IP truncation methods (last octet for IPv4 and last 80 bits for IPv6). Cloudflare will delete the truncated IP address within 25 hours.
  4. Cloudflare will retain only the limited transaction and debug log data (“Public Resolver Logs”) set forth below, for the legitimate operation of our Public Resolver and research purposes, and Cloudflare will delete the Public Resolver Logs within 25 hours.
  5. Cloudflare will not share the Public Resolver Logs with any third parties except for APNIC pursuant to a Research Cooperative Agreement. APNIC will only have limited access to query the anonymized data in the Public Resolver Logs and conduct research related to the operation of the DNS system.

Frankly, we don’t want to know what any one person is doing on the Internet — it’s none of our business — and we’ve taken the technical steps to ensure we can’t.

We wanted to put our money where our mouth was, so we retained one of the top four accounting firms to audit our practices and publish a public report confirming we’re doing what we said we would. The report is available here.

Perhaps ideas from those techniques could be helpful.

5 Likes

I am not sure what you are referring to. It is my impression that all the code and discussion are in the open (but of course, if it wasn’t, I wouldn’t know about it :wink:), and all questions about details are answered very promptly.

I think that the main difference in opinion is between

  1. those who want absolutely no telemetry if it is not opt-in, and

  2. those who recognize the benefits of the data collected and then focus on implementing it in a way that’s open, transparent, accountable, and is practically useful (the assumption is that opt-in isn’t, as very few people would opt in).

I think that it is understandable that some of the people who are involved with funding and infrastructure for Julia put a larger weight on the benefits that would come from having this kind of data. As a user, it is easier to lose sight of these things, because they are provided as free services. But someone has to get funding to pay for them, and having some of this data would make their life easier.

6 Likes

All right, finally that “mega post” is taking shape and I feel that I potentially have some sort of birds-eye view of the whole thing. Apologies in advance for any sloppy writing *and thinking) as it is getting late here.

Firstly, let us look at the desiderata as @StefanKarpinski et al. laid them out in #1377. I will however take the liberty to remove those that in no way relate to telemetry and unless I missed something it boils down to:

  1. (Exact) installation numbers
    • Can guide development: “Do I really need to still support v0.7?”
    • Can justify raising funds: “How popular are you lot really?”
  2. Installation details such as: architecture, OS, packages, etc.
    • Can guide development: “Is this package a priority? Does anyone in their right mind use musl?”
  3. Information on the origin of heavy users such as CI for content delivery.
    • Can save the community big bucks that can be spent on more “interesting” things – more JuliaCon travel stipends or CI speedups anyone?

As a side note, I strongly recommend reading #1377. It is fascinating from a technical standpoint. I also feel that reading it should lower the level of antagonisation of those of us coming from the “privacy” side of things may feel, as it explicitly lists benefits for those of us that want to see a world with less reliance of proprietary services. Proprietary services are clearly far worse from a privacy standpoint than what this proposal is.

I am wholeheartedly supportive of all the desiderata, they are all reasonable and I think as a community we should strive towards every single one of them. So this is clearly a question of “means”, rather than “ends”. So let us look at the “means” as they are laid out in the data overview:

  1. Pkg Protocol
  2. Version
  3. System
  4. Client UUID
  5. Project Hash
  6. CI Variables
  7. HyperLogLog
  8. Interactive

Honestly, #1 is necessary for any “living” protocol. #2 and #3 are pretty much a part of HTTP so I really do not feel strongly about them. #4 (probably also #5) while clearly contributing to getting closer to the desiderata are problematic to me as they expose more private information than IP and HTTP would do on their own and I thus would like opt-in, explicit consent for them. #6 is unlikely to cause any harm as it directly targets CI where the expectation of privacy is very low. #7 fascinates me, but I feel that it still requires consent. #8 is most likely unproblematic.

Now, Stefan has stated that opt-in is a non-starter as it is will never yield enough adaptation to go towards the desiderata, I absolutely agree with him on this. However, I think we can get to the same goals without resorting to opt-out – do correct me though as I undoubtedly have less technical chops and has had a lot less time to consider these matters.

Let us start with desiderata #3. Correct me if I am wrong, but will the CI services not have their own IP blocks? If so, what is stopping us from not simply binning IP;s as they arrive to detect large-scale traffic that can then be addressed via a CDN? This feels too obvious so I am reluctant to even mention it and solves it by relying simply on what is already available at the protocol level and “letting go” of the IP as soon as possible.

Assuming that what I wrote above is true, that leaves us with desiderata #1 and #2. My suspicion is that we care less about non-interactive installations – in particular CI – in terms of their hardware, operating system, packages installed, etc. So maybe it is fair to say that we can expect that the installations we are primarily interested in will be run interactively at least at some point? Despite this, I suspect that most users will hit “No” if we ask for their consent when they first launch Julia. However, there is an option here of a user report rather than full-on telemetry. Collecting data locally is perfectly acceptable in my book, then after a given time frame one can present the concrete report in an interactive session and ask: “Pardon the interruption, but locally on your machine we have compiled the following report (here is an excerpt) which would be useful for the community (mention desiderata?). Would you be happy to share it? A part of it? Submit automatically next time? Ask again next time? Never ask again and stop any collection?”. I may be naive, but I feel that this will allow us to increase the number of opt-ins, while not having to resort to opt-out as is evidently anathema to at least a subset of us, as I know for a fact that as much of a privacy nut that I am even I have agreed to filing user reports like this.

Is my line of thinking in error somewhere in those last two paragraphs? My hope is that we now have something concrete and I really do hope that I have not overly misrepresented or overlooked anyone’s position.

Lastly, some arguments that I think we can safely move away from:

“This is in a way user monetisation”: No, it really is not. Comparing this in any way to what data brokers do is a huge stretch. We are not trying to find out if your daughter is pregnant like in that Target privacy horror story from a few years ago, we want to be able to say “We can confidently say that we have X users and are growing more relevant” so that we can put food on the table for those producing wonderful FOSS software – preferably with as little invasion into your privacy as possible.

“Just simply use opt-in, maybe even without a ‘nag screen’”: We all know that the response rate with this approach will be next to zero and we will thus fail to meet most if not all of the desiderata that we hopefully agree are desirable and laudable. Digging in your heels here is likely to lead to less desirable outcomes in the end as Chris already rightfully pointed out.

“This feature is just catching flak because we are doing this in the open”: No, it is catching flak because some of us (probably a minority) have very high expectations when it comes to privacy and consent. Especially for a project that we feel that we are deeply a part of and care for.

“The client UUID is less invasive than your IP, as it can not be mapped to your location”: In the trivial sense this is indeed true, but it is also more invasive than your IP as you can now track users despite moving between locations and when dynamic IPs change.

“Other parties (FOSS and proprietary) respect your privacy/consent even less”: There is no denying this, but I think it is perfectly fair to strive to be as good as we can be rather than simply better than the in many ways awful competition. Besides, does this argument not remind you of the old “Finish your food, other children are starving” argument? I do not expect this to work on my 11-month daughter in a few years, so I see no reason why it should work here either.

Finally, let me just say in closing that Pkg3 is amazing. Only Rust comes close in my experience to what it achieves and several of the people behind it are in this thread. I never stated this at a JuliaCon talk as it makes less good of a story, but it was not just the community that convinced me of Julia back in 2014, but also the solid engineering behind it. Every single change coming in v1.5 apart from the telemetry looks amazing to me, thank you for the work you have done and the work that you do. Hopefully this reply can lead to something constructive and I am sorry for not joining the discussion earlier – or even contributing code for that matter.

28 Likes

Thought: what if the UUID changed every so often on a random schedule… Like you store the UUID and the date it was created and when a new one is created you also generate using a crypto RNG an expiration date (let’s say exponentially distributed with mean of 4 months). Then when julia starts up the package manager, if its beyond its date, it generates a new one and a new expiration date…

This potentially helps break the individual user tracking to some extent, but also keeps UUIDs stable enough that they’re useful, furthermore, there is an accurate model for their turnover which allows you to do statistics and back out the stable behavior of the community (growth etc)

just a thought to be thrown into the mix.

For example the statistics collection database machine could be put behind a proxy so that it always receives UUIDs only from the proxy… so it doesn’t know the IP addresses that originated the UUID.

This is a very long thread, but it seems to be mostly the same small set of people replying back and forth (compared to what I perceive as the number of daily users of the forum).

Maybe I’m reaching here, but it seems if this was THAT big of a deal, more people would chime in. As one data point, I’ve kept out mostly because I don’t care. As has been pointed out several times, the minimal data that Julia PkgServer collects and the transparency with which it takes place is really somewhat above the standards set by other OSS projects (barring those which exclusively focus on privacy and anonymity). There is nothing there to link this back to any information that “actually” has value about you, like your name, address, what you look like, etc. Sure, maybe the ideal situation would require no telemetry at all. But in the ideal world people would also take global pandemics seriously, participate in elections, etc.

TL;DR Opt-out is beyond fine in this case.

19 Likes

Data is valuable as pointed out. Maybe, opt-out shouldn’t be free. Maybe one should make a donation to Julia development to receive a key to opt-out the telemetry. I think there are plenty of technical options to achieve this kind of open source funding model, but I might be very small minority to support it and there is no point designing technology unless it is used.

First of all, @ninjin, thank you for the post—it’s fair, reasonable, gracious and well written.

CI services do tend to have predictable IP blocks. (Various people object to us logging IP addresses, so I guess for the most hardcore privacy-concerned, even that is an issue.) However, CI services sometimes change IP blocks, which is hard to detect and respond to unless there are other reliable indicators that some requests are CI. It also happens pretty regularly that someone spins up a new public or private CI setup/service that uses the free, public package servers. The question is how does one detect when one of these things has happened and respond appropriately?

With CI indicators, you can see things like “oh hey, there’s a large number of requests with some CI indicator set that are hitting this package server and costing us a ton of money; guess the IP blocks changed or someone spun up their own system and is using the free public package server to feed it.” Keep in mind (as you’ve already mentioned yourself, @ninjin) that this service is operated on a volunteer basis by the same people who develop Julia for free. We are not full time sysadmins whose job is to be on top of what IP blocks all possible CI systems are using or pore through the logs looking for this kind of thing to keep costs under control. We need to be able to automate as much as possible.

My suspicion is that we care less about non-interactive installations – in particular CI – in terms of their hardware, operating system, packages installed, etc. So maybe it is fair to say that we can expect that the installations we are primarily interested in will be run interactively at least at some point?

While that’s true I also think that automated systems have less of right to privacy than actual human users do. A CI system that is using the free public package server does not have a strong right to privacy protections, imo. The user who requests that CI run is a different story but nothing about them is exposed—the client UUID is ephemeral in this case and doesn’t relate to the user in any way.

However, there is an option here of a user report rather than full-on telemetry. Collecting data locally is perfectly acceptable in my book, then after a given time frame one can present the concrete report in an interactive session and ask: “Pardon the interruption, but locally on your machine we have compiled the following report (here is an excerpt) which would be useful for the community (mention desiderata?). Would you be happy to share it? A part of it? Submit automatically next time? Ask again next time? Never ask again and stop any collection?”. I may be naive, but I feel that this will allow us to increase the number of opt-ins, while not having to resort to opt-out as is evidently anathema to at least a subset of us, as I know for a fact that as much of a privacy nut that I am even I have agreed to filing user reports like this.

That’s certainly a way things could be done, but it seems quite complicated and hard to implement. You need to aggregate user data locally somewhere and constantly update it without corrupting it even when multiple Julia processes might be accessing the database of usage data concurrently. This has to work across all kinds of user file systems, which, let me tell you, is a constant source of shenanigans. (“Ah, but did you think of someone using Linux but mounting an NTFS drive?!?”. Real problem we’ve encountered in Pkg recently.)

The only way I can think of that seems sane to do that would be to use something like SQLite to maintain this data since it handles data concurrency and works everywhere. So that’s possible, but then we’ve made SQLite a dependency of Pkg just to collect user data which doesn’t seem great. Also, how does that look to users: are they really going to believe that it’s perfectly innocent that we’re maintaining a literal database about things they’ve done that we want to upload to a server periodically? That seems way more likely to freak people out than sending a few well-documented headers with each request to a server that you’re already talking to anyway.

The UI also seems very hard to get right. How does one present that data to the user to ask them if they want to share it? As a lot of raw records? That seems like an overwhelming amount of data to show them. Or should it be distilled down to a summary? In that case are we really being fully transparent with them about what’s being shared? In the current scheme, we show the user exactly what’s sent to the server if they want to see it and the first time they connect to a package server, we tell them how to print that information with a link to a page explaining what it means.

Finally, I suspect that very few people would share this data. Yes, this is how bug reports from application crashes work: “Here’s a crash report. Are you willing to share it with the developers to help them improve the application?” But for crash reports you just need one person who encountered a bug to submit a report for it, so getting an unrepresentative sampling is totally fine (and if you don’t get a bug report for a bug, then you just don’t fix it). For the purpose of understanding representative Julia usage, however, that kind of unrepresentative smattering of reports does not seem effective. I don’t believe that we could, with any real confidence, claim that such reports tell us how many Julia users there are.

20 Likes