Pkg.jl telemetry should be opt-in

Hi Viral @viralbshah , Stefan @StefanKarpinski, and Julia community :slight_smile:

Ok so I was just about to post this when I read Viral’s
post making essentially the same points but more elegantly :slight_smile:
At any rate I suppose I’ll reemphasize the points …

TLDR summarizing my point up front:
If the telemetry and other customer usage data is used to find ONLY the most Historically successful projects , and there isn’t a policy in place to more evenly distribute the funding secured from Older Historically successful projects to INCUBATE Startup New Innovative community projects there will be a higher risk that Julia will lose its New Innovative cutting edge community projects and almost immediately suffer from lack of innovation and actual progress.

IOW, as my economics professor wisely said with Economics “Incentives Matter”,
so we should be careful about exactly what is incentivized.

So I’m suggesting Julia be sure to fund incubation period for new innovative packages like
EmpiricalCDF.jl as they are just getting started. Below are some reasons why.

I’m not as much concerned with the privacy aspect
here ( Im sure it will be handled correctly ) as the fact that,
even if Pkg.jl telemetry succeeds
in counting up Package-XYZ users for funding and support
there is no clear mechanism to support ALL the
MUCH more interesting to me novel cutting edge
Packages-ABC due to their INITIAL NECESSARILY lower volume of usage.

To be specific, I see EmpiricalCDF.jl with 14 Stars from John @jlapeyre
as a novel cutting edge package here


and so I have starred it here https://github.com/jlapeyre/EmpiricalCDFs.jl
where John @jlapeyre also notes the following :

I’m surprised that this module is not more popular (if stars are a good measure) because it’s rather generic, I use it frequently for new projects, and the functionality is not available elsewhere.

EmpiricalCDFs implements empirical CDFs; building, evaluating, random sampling, evaluating the inverse, etc. It is useful especially for examining the tail of the CDF obtained from streaming a large number of data, more than can be stored in memory. For this purpose, you specify a lower cutoff; data points below this value will be silently rejected, but the resulting CDF will still be properly normalized.

This ability to process and filter data ONLINE AT SCALE is ABSENT (emphasis from @marc.cox) in StatsBase – which I Note has 277 Stars ergo likely to overtake EmpiricalCDFs IF Pkg.jl telemetry succeeds and Julia.org doesn’t intentionally redistribute (any needed) funds from " … funding/awards/etc from knowing how many users we have."

HTH,
Marc

Ps> Separately I checked with John @jlapeyre about reposting his public quotes
and found out he doesn’t require funding at the moment for
https://github.com/jlapeyre/EmpiricalCDFs.jl … but I believe
the principle of the matter is still the same in that Julia should continue to
focus, and fund if necessary, innovation and actual progress even for specialized customers.

1 Like

I’m curious to know if @anon94023334 has any thoughts on this subject.

What are some comparable packages in other languages? Has anyone been successful getting any of those packages installed inside a national lab or large financial institution?

It was probably already difficult getting Julia into some enterprises and I don’t see this making things easier. In some borderline cases, I can imagine this will probably cross a line on some IT adminstrators check list resulting in denial even if it can be disabled.

Is there a write up somewhere explaining what is actually happening? I am still confused.

1 Like

This is what is happening: https://julialang.org/legal/data/
When you make a request to the package server (to be used by default in v1.5), it will send some headers just like any other HTTP request. That page details exactly what is sent.

Obviously I can’t speak to others’ specific IT policies, but all of this only comes into play when you interact with the package server — in other words, downloading and running arbitrary code. That is already just about the most dangerous thing you can do from an IT security standpoint, so it’s hard to believe that tiny bit of extra info in the request header would be the deal-breaker. Of course, an IT admin might object to anything whatsoever, so without more specific information about what the problem would be there’s not much we can do.

9 Likes

A security issue for example might be if a large company is using Julia to provide say an internet facing web service… a bug is identified in which pkg A and pkg B when used together cause an exploitable buffer overflow condition in some C code they are wrapping, and this enables remote root access to the machine…

Now, suppose this is installed on thousands of servers around the world. If someone can get access information about what versions of the package each machine is running, say by targeting the Julia pkg server and exfiltrating the database… they can then execute a targeted attack on this entire network, gaining root access to thousands of machines deep inside data centers at the heart of this multinational corporation.

That doesn’t sound particularly nice.

2 Likes

Daniel @dlakelan and Stefan @StefanKarpinski ( because I recall your interest in security )

I’ve worked at a large bank in IT security and your hypothetical scenario
is a legitimate concern - they are very sensitive to any information about their infrastructure being exfiltrated. So in essence, unless they are somehow absolutely convinced its secure, there may be more liability to Julia in retaining this data than there is to gain from it.

Stefan, I wonder why not just count the number of package downloads instead ?

HTH,
Marc

1 Like

In order for this to be a realistic attack, you need to also suppose that someone has reverse engineered a mapping from the server fingerprint => individual UUIDs of the developer(s) within the company. Where will they get this mapping from? If it’s from compromising individual developer machines, the attacker already has a much more valuable foothold in the target organization from which they can extract all sorts of detailed knowledge of the organization.

If you compose this with the UUID => package set mapping (which you get from attacking the package server), you will now end up with a server fingerprint => package set mapping.

This is an awful lot of work for the attacker which can be mitigated by your supposed multinational company by enforcing either or both of two simple policies:

  • Use an internal package server (companies with sensitive infrastructure often prevent any access to the open internet, so enforcing this could be very easily done at the network layer).
  • Ensure developers opt-out of usage data.

In all, I don’t think this is a credible attack scenario: it presents a lot of work and complexity for the attacker for uncertain gain and is easily mitigated with standard IT policies.

6 Likes

Since there isn’t yet an official readable document about how the whole thing works… I’m going on the basis that every machine gets its own UUID when it first runs Pkg. So the server comes online, it runs the install program, that install program goes and gets all the necessary Pkgs and at this point the UUID is set up. From now on, every package installed on that machine is somehow in a database.

If this is the case, then exfiltrating the database from the Julia project is a valuable information target. It gives you potentially information about the software installed on millions and millions of servers (hopefully julia becomes used by millions of servers!).

I don’t see how we can pooh pooh that. Particularly if Julia is also tracking IP addresses. It becomes a directory of vulnerable machines and their IP addresses as soon as a vulnerability is discovered that can be exploited.

At the very least, it should be actively recommended that people looking to install Julia on internet facing servers use their own Pkg server and a good Howto on that should be set up and pressing ] should get you a help item where you can get info on using your own Pkg server.

For example, suppose you gain a tap on packets going in and out of a given organization by compromising some ISP infrastructure… now you can maybe see UUIDs going to and fro… You have a list of a few hundred machines and their IP and UUID… what software is on those machines you can compromise? let’s download the Julia database off a darkweb site and find out…

That seems like a very legit concern.

1 Like

Ok I think that might be fair; considering the log of UUIDs as a possible toxic data asset which may be used against a wide variety of target servers.

But we should also frame this debate by comparing to the damage that release of a fairly normal server log might do in the absence of UUIDs. I don’t know what the package server will log currently, but let’s suppose it keeps:

  • IP address
  • Packages requested by that IP address

Now, suppose you get hold of those logs. You already have the information to attack servers in exactly the way you mention regardless of UUIDs being included or not. As others have mentioned further up the thread, the UUIDs are less valuable to an attacker than data which is already inherently known as part of normal package server operation.

(Side note: I’d point out that any well-developed workflow for server deployment would involve downloading packages to the build machine, not on production servers. So if anything, the UUID and IP would be associated to the build machines, not the public IP of a production server. Naturally some people will update their packages on production servers and we should consider that as an expected use case. But I’d generally discourage this and not expect to see it in a modern deployment scenario.)

11 Likes

How many times does someone have to send a link to this excruciatingly detailed description of exactly what is sent and why before people will stop claiming that this isn’t documented?

Continuing to pretend this doesn’t exist and hasn’t been linked to a dozen times in this thread really undermines any credibility your critiques might have.

18 Likes

Sorry I was still relying on this

As you can see the way that discourse modifies and displays that link it is extremely generic. By itself there is not any obvious indicator of what that content is. I will take a look

2 Likes

My understanding is that, while the server will log IP addresses as a matter of course, they will not be stored with the header data, and the IP logs will probably be deleted much more frequently. So it is not easy to construct server fingerprints from the dataset. But the developers of Pkg and PkgServer can probably say more.

10 Likes

That is correct. By sequestering IP addresses from other request data, what is collected is actually much weaker from a privacy perspective than the way IP logs are normally collected and used to identify people.

11 Likes

How many times does someone have to send a link to this excruciatingly detailed description of exactly what is sent and why before people will stop claiming that this isn’t documented?

Come on Stefan. Be a better steward. And shame on those who “liked” it, but I’m sadly not surprised.

Continuing to pretend this doesn’t exist and hasn’t been linked to a dozen times in this thread really undermines any credibility your critiques might have.

What is more likely?

  1. @dlakelan didn’t see the link in the sea of responses in this thread or
  2. @dlakelan is actively ignoring information so that he can argue with people on the internet?

I don’t know @dlakelan personally, but the fact he is taking time out of his schedule (who knows how busy he is, how many kids he has, etc) asking good questions and raising good points indicates that he cares. To shoot him down like that is kind of shitty. How long do you think you can behave like that before losing good people and good partners?

Instead of attacking people who care for not diligently reading every post in this long discussion, why hasn’t anybody made an actual announcement with the link prominently visible instead of buried in a long conversation? Why isn’t this topic pinned at the top of the forum? Why did it have to come up from a concerned person posting on Discourse? Why wasn’t the community consulted before making this significant decision? It makes me wonder, what is the governance of Julia as a language (I don’t really see any to be honest)?

I’ve had some sad conversations with people recently (including @anon94023334 who has essentially given up on Julia) and we are losing other good people.

I have to say this telemetry thing does seem ill-conceived. After everything I’ve read here, I feel this effort should be driven by Julia Computing (possibly via JuliaHub) since JC seems to be the one with the most to gain. That is not a bad thing and I would do the same thing. However, blurring the lines between what is good for the language and what is good for JC appears, to me, to be causing more harm than good. Putting it in a standard library was mistake. The language seems to be doing just fine without telemetry.

Is it time to have a Julia Foundation with people not involved with Julia Computing?

6 Likes

I think there already is one: http://www.thejuliafoundation.org/. Consider donating.

5 Likes

This may not be the first time we will encounter this kind of issue. I would like to suggest a developer community similar to debian developers where constitutions and bylaws are driven by this community with each one having one vote on important decisions, etc. I think we also need some kind of formal constitution that we can abide with? https://www.debian.org/devel/constitution

3 Likes

That’s a joke, isn’t it? I like humour as much as anybody else, but given the tone of the conversation I think that at least without a disclaimer this reply is a bit out of place.

3 Likes

Sounds good. Are you volunteering? :slight_smile:

More seriously, I am invested enough in the language to do what I can to help out with something like this.

I’m just so used to debian way of doing things and debian has survived so many issues because they have a very well-defined steps to resolve conflicts written on their constitution. because of their well-written community standard and packaging standard, Debian OS is really very well-regarded OS including its community. I suggest those who have background in law can help in this constitution.

2 Likes

I’m not going to put a disclaimer that a link to a foundation that “provide assistance to those in need while creating awareness of the power of art to heal and inspire” that happens to also be called “The Julia Foundation” is not a serious suggestion to an alternative non-profit organization for the Julia language. It was a fun coincidence that I found, lighten up a little.

7 Likes

Yeah. But I think we need someone who is well-versed in organizing and making everyone agree mostly on what Julia developer community should stand for.