Pkg.jl telemetry should be opt-in

Marc.Cox · July 6, 2020, 9:02pm

I am not aware of any issue with an absolutely random UUID, but suggest review headsup and caveats in Medium article below. However I do believe that instead of an opt-out design which requires reading all the fine print to get comfortable with, an explicit opt-in possibly with an ocassional nag is a more direct way to eventually get informed consent that everyone is more comfortable with.

Version four is completely random and unpredictable and this is the version I would recommend using now.
If you don’t want to have collision on multiple instances of databases, you don’t want IDs to be predictable or give information about the system, consider using UUIDs.

HTH,
Marc

Sukera · July 6, 2020, 9:29pm

And my point is that because all of the described attacks are already possible and used in the wild, having those UUIDs doesn’t expose you further than connecting to GitHub or PkgServer as it is. You’re arguing with a whole bunch of assumptions here that, as far as I can tell, simply aren’t true or, at the very least, don’t quite hold.

For example, you assert that having UUIDs allows tracking for an extended period of time, while in reality the data will be deleted after a specified timeframe:

We will also establish a data retention policy: all individual request records will be kept for no more than a fixed period of time for analysis and will be securely deleted after the specified period.

I do agree that this really has to be specified, but pretending that the idea doesn’t already exist is disingenious.

Ok? It’s great when people try to inform themselves, but the arguments you’re presenting here are, to me at least, reading like you’re against julia collecting UUIDs out of principle and not about informing people that they should care about their opsec practices, think about the tools they’re using and then decide whether that’s fine with them or not (and what/where they can do something). As I’ve tried to explain multiple times in this thread, nothing about having those UUIDs is exposing you to a larger threat than you’re already exposed to just by using the internet itself. I’ll give you that there is a potential increase in risk here, but that in and of itself is not an increased threat (at least not one you as an individual can do anything about, short of isolating yourself and going off the grid, or in this case, using your own instance of the package server and vetting packages on your own, as has been suggested multiple times already).

One of the only ways UUIDs would lead to a noticeable increase in risk would be if regular internet traffic was not node-based direct routing but slow propagation with builtin intentional latency, to throw off any timing correlations on the network level. That’s not feasible to do, since internet has to be fast. Ain’t nobody got time for asynchronous communication when watching a new kitten video.

Great! Since we’re all maximally doomed anyway, UUIDs won’t increase our risk of being doomed, since it’s all over already anyway </sarcasm>. In all seriousness though, not having that telemetry does nothing but make it harder for the honest devs here to do their work.

I get it, telemetry and phoning home and cybersecurity and evil hackers from another (or your own!) country is scary. But please, let’s stay level here and take a look from the attackers’ POV: Attacking Pkg, especially with the minimal amount of data that’s proposed of being collected here, is just not a worthwhile target. People who care about their OpSec (Financial Institutes, High-Security Environments, Hobbyists who take it too far, …) vet all access and build huge, complex firewalls precisely because they want to mitigate any potential risk, so the really juicy and profitable targets are not likely to be found in those logs in the first place (and if they are, they failed at their own OpSec - this is not the fault of julia). So we can rule out those kinds of attacks from a hosting perspective. What’s left is mostly individuals, small groups or organizations that don’t mind having known what they’re using internally. All of those are easier targeted directly (or by corrupting the packages served by the server, but that’s a different class of attack and not suited for this discussion). You already know they’re using julia and communicating with its package servers, you don’t gain knowledge about them by knowing the e.g. salted hash of the filepath/project. The salt isn’t sent or stored, so you can’t even reconstruct or bruteforce a match.

I’m sorry if this post comes off as tongue in cheek or annoyed, that wasn’t my intention.

wsphillips · July 6, 2020, 9:56pm

Regarding the notion of package maintainers that don’t want downloads tracked and following on the previous related comments:

Would there be a roadblock to having the package registration process include a “do not track my package” option? Unlike the issue of opt-in/opt-out for users, I would be surprised if most package maintainers didn’t voluntarily click yes, provided they had the ability to view their own package retrieval data through something like JuliaHub. This would in effect be an “official” minimal telemetry add-on (i.e. “adding the code to their own package”) and would give people like Seth the opportunity to elect not to participate. Additionally, if it’s a flag in the package metadata, it would give highly privacy-minded individuals the opportunity to select only those package dependencies that opt out.

dlakelan · July 6, 2020, 10:16pm

It doesn’t even require attacking Pkg. A simple subpoena would do in a patent infringement case.

And yes, there’s currently a document that says that at some point a plan will be put in place for data retention etc… That’s not quite as reassuring as a document with a specific plan.

I think my arguments have been somehow mixed up with other people’s arguments here. I’m not even arguing that this should be opt-in! I want Julia to move forward with more funding. That’s why I donate CASH to Julia each year. I’m just arguing that what’s collected should maybe not have a long-term identifiable UUID. My point is to treat this as a security risk analysis exercise of the same kind that might be done by say a client in a bank or a utility company… they want to know what information is being collected, and they want to know what will be done with it, and what’s the worst case of what could happen if the information store were breached or compromised by subpoena or whatever.

If there’s a plan for data retention expiration, then it seems that persisting the UUID over longer than the data retention period isn’t even valuable to Julia… and yet NOT persisting it beyond that lifetime reduces the risk of the user. So let’s give the user by default the same protection on the client side that we give them on the server side right?

giordano · July 6, 2020, 10:45pm

I did some research about popcon. I couldn’t find a page dedicated to privacy like https://julialang.org/legal/data/. I only found this excerpt from https://popcon.debian.org/README

!!!

SECURITY NOTE: it’s impossible to make a submission completely anonymous,
since Internet servers tend to add headers and log messages along the way.
Our receiver program at debian throws away this information as soon as
possible so no one will see it, but if you’re really paranoid you might not
want to participate.

!!!

and this other from https://popcon.debian.org/FAQ

Q) What are the privacy considerations for popularity-contest ?

A) Each popularity-contest host is identified by a random 128bit uuid
(MY_HOSTID in /etc/popularity-contest.conf). This uuid is used to
track submissions issued by the same host. It should be kept secret.
The reports are sent by email or HTTP to the popcon server. The
server automatically extracts the report from the email or HTTP and
stores it in a database for a maximum of 20 days or until the host
sends a new report. This database is readable only by Debian
Developers. The emails are readable only by the server admins.
Every day, the server computes a summary and posts it on
https://popcon.debian.org/all-popcon-results.txt.gz. This summary
is a merge of all the submissions and does not include uuids.

Known weaknesses of the system:

Someone who knows that you are very likely to use a particular package
reported by only one person (e.g. you are the maintainer) might infer you
are not at home when the package is not reported anymore. However this is
only a problem if you are gone for more than two weeks if the computer is
shut-down and 23 days if it is let idle.

Unofficial and local packages are reported. This can be an issue
due to 2) above, especially for custom-built kernel packages.
We are evaluating how far we can alleviate this problem.

I don’t think any of these weaknesses apply to the proposed system in Julia. I may not have searched hard enough, but I don’t think the code of the popcon server is available, contrary to PkgServer.

Do note also the goal of popcon is to get an idea of the relative popularity of packages, rather than absolute numbers.

ppalmes · July 6, 2020, 11:30pm

here’s the privacy policy of debian including popcon: Debian -- Privacy Policy

c42f · July 7, 2020, 3:13am

The problem with this scenario is that the attacker now has very likely caught the dissident red handed with those very packages installed on their computer; no need for the UUID at all! Furthermore, they very likely have access to all sorts of other very personal information which are kept on the machine and which are likely to be far more incriminating. (Alternatively, the dissident has excellent security; they’ve encrypted their hard drive and haven’t been threatened with sufficiently convincing rubber hose attack; in which case the attacker gets nothing, including not getting the UUID.)

As I’ve said further up, if you consider the attacker having root on the device containing the UUID, the victim has already lost in ways far worse than exposing the UUID.

Exactly; very well stated. An attacker will go for the simplest straightforward attack which gets what they want with minimal effort — that’s good engineering and just common sense. Proposing byzantine attacks which yield strictly less information than can be had by simpler means just confuses the issue.

c42f · July 7, 2020, 3:22am

I think the main point which is useful here is to note that long term server logs can be a toxic data asset in incriminating someone for actions they’ve taken in the past. I think this is a valid concern in general, but still consider the effort/reward tradeoff to be very unfavourable for the attacker in this particular case.

dlakelan · July 7, 2020, 3:52am

Of course I’m exaggerating the scenario for effect. The point is that the average person starts out saying “This is harmless” and “Julia isn’t going to do anything bad with this info” but I can easily describe a scenario where this isn’t harmless. The more plausible scenario of course is something like what the RIAA used to do which is try to bankrupt single mothers because their tweenage kids shared music or movies online with their friends… only this time instead of a plausibly deniable IP address that’s recycled from user to user, there’s a nearly provably unique UUID to connect the actions to the particular machine. When the RIAA serves the subpoena and the court orders you to allow them to byte-for-byte dump your hard drive… you’ll be unhappy when the forensic investigator manages to connect your machine to a particular IP address and a particular action you took at a particular time because of some unrelated action you did updating your julia packages. I think 35 year old single mothers shouldn’t have to read the Julia docs and understand intimately the details of what their 14 year old nerd kids do when they’re involved in their online coding camp and know that they need to tell their kids to turn off package telemetry.

So, as long as we agree that long term server logs are toxic, and that connecting a client machine to a long history of online actions causes problems for people, and that lots and lots of people will have a lot less insight into how this all works than the nerds like us who care enough to have this discussion, I think we might as well acknowledge that for the client to on some schedule of a few tens of days, maybe 30 or 60 or 100, generate a new UUID automatically… this is rather harmless to the purpose of seeing a pattern of usage and connecting it over a moderate timescale… while also protecting the users from having a toxic asset on their computer that they really don’t understand.

Oscar_Smith · July 7, 2020, 3:57am

Given that this UUID isn’t connected to anything other than your Julia install, it would seem to me like any subpoena which caught the UUID would also catch the packages. Especially since proving the behavior would presumably require the code which lists what packages you installed anyway.

c42f · July 7, 2020, 4:13am

But I think this has the opposite effect than you intend; when I see wild exaggeration I’m likely to dismiss a post out of hand. Conversely, a well-reasoned and plausible scenario which could directly lead to user harm and with low effort/complexity for the attacker would immediately catch my notice.

I’ll add that I’m completely confident this is true for other core contributors commenting on this thread who are certainly among the most skilled engineers I’ve ever had the pleasure of working with.

dlakelan · July 7, 2020, 4:26am

I think you have to agree that the scenario where it’s time for little Billy’s coding camp, so he fires up his torrent client and downloads the latest hit movie, and then while it downloads he signs into his coding camp and they tell him to install packages xyz, and now some forensic investigator working for the MPAA is able to show that the same IP address accessed the torrent site and the julia site at the same time, and that the UUID used with julia proves that the IP address was in use by little billy at the time and not some other computer, is not entirely farfetched and outlandish.

In the early 2000s multiple rounds of unfortunate parents lost their house over less than this.

Oscar_Smith · July 7, 2020, 4:27am

This would only work if the proposal stored UUID and IP (and time) cross-linked, which we’ve gotten specific confirmation that it won’t happen.

dlakelan · July 7, 2020, 4:37am

It won’t happen intentionally by Julia. Suppose I’m the MPAA and I am looking to set an example of some children and their families (ugh). I get cooperation from FBI etc to monitor the torrent site since it’s a major lawbreaker, so I have logs of IP addresses that persistently infringe… so then I go to the ISP that provides the service and I get a subpoena to monitor users… I get a packet dump of the flows that various people who access the torrent site use… I see that packets go to julia at certain times, and then I subpoena julia to provide UUID logs. I can link it all together no problem. That’s exactly the kind of thing they did in the early 2000s

Then I sue the mother, and I get a court order to dump the hard drive, and I find the UUID. I can now prove to the jury that this particular computer at this particular time was using this particular IP address and downloading these particular files…

It doesn’t even require julia to do anything “wrong” or to do the IP address linking.

Basically hidden UUIDs are kind of toxic, because they completely foil plausible deniability of much of anything.

Keno · July 7, 2020, 4:54am

While it’s not un-true for me personally, I am a bit of a sucker for the use of hypotheticals in legal analysis and line drawing arguments. I will say that I appreciate @dlakelan’s point of view, not because I think any of it is particularly likely, but because I think that it comes from a slightly different direction than previous commentary, as well as clearly highlighting one particular aspect (one needs to be careful about non-deniable long-term server logs).

All that said, I think it might be time for everyone to take a step back and breathe for a week or two. I think there’ve been a lot of good suggestions among the comments here, but I know that I personally have spent much of the past several days talking to people about this in various fora both public and private and have had very little time to actually take a step back and consider the issue as a whole. And I know others have shouldered far more of the interaction burden here, so I imagine the same is true for them. I think it’s fair to say that you have been heard and that the subject has been brightly illuminated from a plethora of perspectives. Let’s let the folks working on this take a few days away from all this, and then re-approach the topic with fresh eyes.

dlakelan · July 7, 2020, 5:00am

thanks for that acknowledgement, it was effective in convincing me that someone involved understands the issue. I’m happy to step away.

Sukera · July 7, 2020, 5:31am

Except that argument about IP addresses being plausibly deniable hasn’t been true in decades, since ISPs are required by law (both European and American) to keep logs of IP ↔ endpoint pairings for threat mitigation and law enforcement purposes. Your whole scenario with subpoenaing julia to establish a link between two endpoints is moot, since ISPs are required to keep logs for far longer than the UUID would (presumably) be saved for and past case law has shown time and again that courts happily accept ISP logs as damning evidence already. There’s no need to subpoena one of the endpoints when you’re already reading everything that’s coming through the pipe, it’s just extra hassle with no gain.

Just because Copyright law and associated institutions are broken to hell and back, does not mean that making this telemetry available to julia is as bad a thing as you make it out to be. You’re trying to apply a technical solution to a social problem, which never works out in practice.

Don’t get me wrong, I agree that UUIDs should rotate at least with the same frequency that old entries on the server get deleted. I do not agree with the inflammatory and hyperbolic reasoning.

c42f · July 7, 2020, 6:21am

Out of all the discussion in the last day or two, I think this is one of the best questions which has been asked. I don’t have a precise answer but I feel like you’re right and it should be possible to design more inherently privacy-preserving estimators for the quantities of interest. The main problem is the sheer amount of technical complexity this would add in designing and validating the right estimators, ensuring they’re robust against abuse and dealing with the compatibility between client and server versions in the future.

Tero_Frondelius · July 7, 2020, 9:56am

Sorry for the stupid question: would it help anything if we would use a linear map to convert the uuids before saving them to the database? This way the uuids in database and the uuids in people computers wouldn’t match. Still there would be all the benefits to use the linear map to match the uuid to database uuid and get accurate statistics.

StefanKarpinski · July 7, 2020, 1:51pm

That really just kicks the can down the road since now anyone who has the transformed UUIDs and the map can recover the original UUIDs, so in terms of security it would be window dressing. Same thing with encrypting UUIDs on the client side: if that’s done in a consistent way (which it needs to be in order to be useful), then the encrypted UUID becomes the equivalent of a UUID itself.

Topic		Replies	Views
Digression about privacy over OpenTelemetry.jl Offtopic	9	969	November 6, 2021
Pkg sends usage info to Google Analytics [Update: No, it doesn't] Offtopic	3	979	December 6, 2016
[ANN] OpenTelemetry.jl - Now it's time to improve the observability of your system Package Announcements package , announcement	2	1700	November 4, 2021
Response to Pkg.jl and Julia Environments for Beginners by Jules General Usage pkg , workflow , environment	10	679	August 31, 2022
Obtaining a numeric value for "number of users" or downloads for grant applications and stuff Community package-manager	1	479	May 30, 2021

Pkg.jl telemetry should be opt-in

Related topics