Pkg.jl telemetry should be opt-in

Can I use the existing technology to track my private package server usage data? Same data and analysis might become very handy in the future, if our company Julia usage grows. Currently we are using GitLab and use full addresses to install private packages, because we don’t have that many. Setting up a package server seems like a good idea.

2 Likes

But if what you say could actually happen, then any UUID, no matter how fleeting, is a problem.

1 Like

Googled all around for you and it appears Continuum (Anaconda) has done a pretty good job obscuring the details in legalise such as @mbauman found - so if you have a real need for details it would probably require a packet sniffer like wireshark etc. ; but back to my main point, from my experience every software installed in a financial institution will definitely get the IT security review, and any “calling home to mama” will definitely be uncovered and will need to explained - so it would be much better to have an explicit opt-in or configuration flag handy to turn that off when that security inspection inevitably occurs. As discussed elsewhere in this (insanely loooong :joy: thread) doing so is likely to make Julia a much easier “sale” to large corporations, financial institutions, banks etc. - I won’t list details here because they definitely don’t want their name showing up in Google searchs associated with such issues for obvious reasons.

Does Debian have a clear policy on APT logs? And does Ubuntu follow the same? I ask mainly for my own information - since I actually used to be a Debian developer a very long time ago. I haven’t searched online yet - so please excuse me if it is easily found.

7 Likes

Why is it that we assume that the benefit of counting uses of a package can only be done on the client side? Why is the server-side not reliable enough? Meaning: why can’t the point from which the package is served not count how many times it was delivered? I understand that the system could be gamed, but is that a big concern?

4 Likes

True, obviously people doing those things should OPT OUT definitely. But risk reduction by rolling ID helps right? I mean, it’s one thing to prove someone once did something in the last few weeks, but it’s another to prove that they spent the last 10 years actively working against their government’s encryption policies.

4 Likes

Who exactly are the people handling the raw data and processing the aggregate? With the finances at stake here, I believe all the individuals processing and handling this data should be scrutinized to not manipulate it. In the case someone needs to be sued, it should be 100% transparent who the exact individuals are handling this information.

Regarding Debian, they do have a system to track popularity of packages: popcon. But it’s opt-in, you need to install the popularity-contest package

3 Likes

Financial institutions (and pharma companies, the government, semiconductor people, etc., etc.) already have trouble with downloading things from the internet at all, particularly things that can execute arbitrary code. Often these rules are bespoke and require tons of handholding. We work with companies deploying Julia all the time to help them set it up right, but all that is somewhat orthogonal to this conversation.

7 Likes

Correct. There is never any data collected to be sent later. The only data sent is about the current request and the client making it.

6 Likes

i think implementation similar to popcon in spirit is nice. results such as this provide great insights on different packages: Debian Popularity Contest

2 Likes

I don’t think this is an accurate description. Julia/Pkg is not “phoning home”; you are just making requests to a package server. If you point to a different package server, then all the traffic goes there. If you don’t download packages, nothing happens. To me “phoning home” evokes contacting a specific “home” URL, and possibly in such a way that the user is not aware they’re sending any network traffic.

14 Likes

I’m coming in late to this discussion, and the thread is quite lengthy so I maybe missing some information here.

If we were to count on server-side the amount of times a package was delivered how would we uniquely identify users? It might not seem like that big of a thing to just count to total number of package downloads but what about a scenario of a CI system?

Usually you would have a couple of environments, Linux and OSX. In each of those you would be testing Julia v1.0-1.4. Running your CI pipeline multiple times per day; nightly, a couple merge requests. You’re looking at 4-12 package downloads per day. Over the course of a year that’s ~1,500-4,500 artificial downloads for a package. However in reality, it’s actually just two users (Linux / OSX environments).

Fold in other scenarios like this and you can see how out of hand these numbers would be artificially inflated. You’ll never get a truly accurate number of users but I think approximating them as best you can, and account for systems like this is important. A lot of the internet is becoming robots just talking to other robots.

3 Likes

That’s precisely one of the motivating factors here. See how the existence of CI variables are noted in the headers to address this.

4 Likes

I’m not sure if I’ve stated it further up in the thread, but if your threat model includes nation state attackers you’re best off not to communicate at all, since at that point the threat in question is certainly willing to fake records to make you guilty if they want to make you go away. There’s no practical defense against APTs, using them as an argument against this very minimal data is very far fetched.

This is ridiculous. This is already possible by sitting in the ISP and tracking which servers your dissidents connect to, no UUID and advanced data exfiltration with persistence in foreign systems needed. As I said earlier in this thread, for this kind of deanonymization, easier and more practical attacks that scale better already exist, so using them as an argument against UUID is basically a strawman.

11 Likes

For my taste the best way in this case would be to have telemetry on by default with the message showing each time as suggested above and one simple command to switch it off.

However it could be a viable compromise (meaning >70% of users would enable it) to have it off by default and a lot of nagging. Like:

Pkg telemetry disabled. Consider helping Julia community by enabling statistics (more text)
(@1.5.1) pkg>

as a standard Pkg prompt, and half a page of more detailed information (including a link to this discussion) each time one asks Pkg to download a package.

8 Likes

Honestly I often try to avoid even civil public debate because I prefer teamwork and agreement.
But you asked, so I’ll answer:

So what happens “If you DON’T point to a different package server …”
Pkg would contact a specific “home” URL right ? And opt-out implies
“… possibly in such a way that the user is not aware they’re sending any network traffic.”

At any rate, I believe that the Julia team has good intentions and such,
but at least at first glance the scenario above appears
to be more surreptitious than a straight up opt-in request
and could easily be a sticking point for IT Security departments.

So why not proceed in a manner such as Petr @PetrKryslUCSD described here ? :

Why is it that we assume that the benefit of counting uses of a package can only be done on the client side? Why is the server-side not reliable enough? Meaning: why can’t the point from which the package is served not count how many times it was delivered? I understand that the system could be gamed, but is that a big concern?

And/Or

@giordano here Regarding Debian, they do have a system to track popularity of packages: popcon. But it’s opt-in, you need to install the popularity-contest package here: Pkg.jl telemetry should be opt-in - #310 by giordano

HTH,
Marc

Do you only object to the inclusion of a UUID in the request, or something more? It sounds like you may be saying that a user is unaware they are sending a network request when they type ]add Package.

Anyway, to me adding a package is like visiting a particular URL in my browser. But if my browser also sends a record of that to google or mozilla then that would be “phoning home”. I’m just pointing out that we’re not making a separate connection to a separately-configured “home” server; there is only the package server request itself.

17 Likes

I think this should be clearly highlighted on the webpage where this telemetry is described. Before understanding this part I personally thought Julia always calls “home servers”, and my attitude to julia telemetry was 50/50. After reading your message I completely support the julia approach, and probably this part is also not clear to some of those arguing against.

15 Likes

My point is just that we don’t mainly need to worry about what Julia will do with the information, we need to worry about what is the worst thing that could happen if this database of unique IDs and actions taken is leaked/exfiltrated/misused by someone else. If the unique IDs don’t exist, or are rolling so that you can’t connect long periods of time together… the potential threat is much less.

We’ve already seen several people here come to some realization that there’s a broader question involved because of the examples I’ve given. I think that’s enough. Any one of those examples is sufficient whether it’s a patent lawsuit, or a divorce proceedings or a criminal investigation or advanced persistent threats doesn’t matter as much.

Also I should say anyone with a machine on the internet is subject to advanced persistent threats. Working with the OpenWrt team it’s become clear that APTs do things like design software worms that move from router to router and exfiltrate people’s information or do other nefarious stuff. There’s examples of this with respect to MikroTik products if I remember correctly. So if you’re running a MikroTik router on your home network, you could be compromised by an APT. It’s not just about targeted compromises. Over 200,000 MikroTik Routers Compromised in Cryptojacking Campaign - Security News

1 Like