Pkg.jl telemetry should be opt-in

kevbonham · July 3, 2020, 1:44am

I think the idea is that you use statistics. If everyone is getting a random number you would not expect 65k users to all have a different number. I’m sure there’s some fancy math that @Karajan is referring to that gives more precision, but here’s my naive attempt

function bootstrap_users(ids, users)
    fraction_ids = Float64[]
    for i in 1:500
        u = rand(1:ids, users)
        push!(fraction_ids, length(unique(u)) / ids)
    end
    return fraction_ids
end

maximum(bootstrap_users(65_000, 65_000))  # 0.635200
maximum(bootstrap_users(65_000, 200_000)) # 0.956615
maximum(bootstrap_users(65_000, 500_000)) # 0.999754

So even with 500k users, with only 65k numbers the maximum number used in 500 iterations was 64,984

EDIT: because I couldn’t resist:

code

function bootstrap_users(ids, users)
fraction_ids = Float64
for i in 1:500
u = rand(1:ids, users)
push!(fraction_ids, length(unique(u)) / ids)
end
return fraction_ids
end

u65k = bootstrap_users(65_000, 65_000)
u200k = bootstrap_users(65_000, 200_000)
u500k = bootstrap_users(65_000, 500_000)
u600k = bootstrap_users(65_000, 600_000)
u800k = bootstrap_users(65_000, 800_000)
u1m = bootstrap_users(65_000, 1_000_000)

using StatsPlots

plot(histogram(u65k, primary=false, title=“65k”),
histogram(u200k, primary=false, title=“200k”),
histogram(u500k, primary=false, title=“500k”),
histogram(u600k, primary=false, title=“600k”),)

kevbonham · July 3, 2020, 1:59am

If the majority of users that needed to be counted had public packages, presumably they could just count contributors to the packages in General.

Oscar_Smith · July 3, 2020, 2:01am

I’m not 100% sure, but I think that the error bounds of this grow quite large as the number of users increases.

kevbonham · July 3, 2020, 2:07am

Based on the plots (just added in an edit), it seems pretty tight, though I guess the real thing I should do is model how the fraction of IDs predicts the true number of users rather than the other way around as I’m doing (I’m not a very good statistician ). I presume there’s some application of the normal distribution that leads to this conclusion:

But I think the point is that, once the error bounds get to big, you can just increase the maximum. You’d still have enough overlapping IDs that they wouldn’t be useful as identifiers for people, but presumably your error bounds would shrink again.

EDIT: again, just 'cuz.

code

function stupid_bootstrap(ids_taken, uids; samples = 200)
    users = Int[]
    for i in 1:samples
        ids = Set(Int[])
        n_users = 0
        while length(ids) < ids_taken
            n_users +=1
            push!(ids, rand(1:uids))
        end
        push!(users, n_users)
    end
    return users
end

i60k = stupid_bootstrap(60_000, 65_000)
i63k = stupid_bootstrap(63_000, 65_000)
i64k = stupid_bootstrap(63_000, 65_000)

plot(histogram(i60k, primary=false, title="60k taken", xlabel="mean=$(round(mean(i60k), digits=2)) std=$(round(std(i60k), digits=2))"),
     histogram(i63k, primary=false, title="63k taken", xlabel="mean=$(round(mean(i63k), digits=2)) std=$(round(std(i63k), digits=2))"),
     histogram(i64k, primary=false, title="64k taken", xlabel="mean=$(round(mean(i64k), digits=2)) std=$(round(std(i64k), digits=2))"))

helgee · July 3, 2020, 5:03am

Slight tangent: As an EU citizen, I feel that the EU’s data privacy regulations are a good idea but the implementation leaves a lot to be desired and has a few nasty side effects. Cookie prompts are now even worse than before and stuffed to the brim with dark UI patterns to trick people into opting in, e.g. “necessary cookies” is preselected but the highlighted button will accept all categories and not just the selected one.

TL;DR: I fully agree with Stefan that it is much better to only collect the data that is essential to provide your service and be upfront about it.

Tamas_Papp · July 3, 2020, 5:30am

I am not sure about this; “Julia package manager collects anonymous package usage data in a transparent and open way — unless you opt out” is hardly headline material even for certified nerds.

dlakelan · July 3, 2020, 5:35am

you’re assuming accuracy… the headline would read “Open source software spies on users by default” or such.

Tamas_Papp · July 3, 2020, 5:43am

Outlandish claims can be made about anything; this does not necessarily mean that they will get traction.

Implicitly, I am assuming that the target audience of Julia consists of mostly rational, level headed people who would investigate such claims themselves. As long as we make this easy to do, it will be hard to spin this in a malicious way.

dlakelan · July 3, 2020, 5:45am

and their boss, their consulting clients, insurance companies, lawyers…

heliosdrm · July 3, 2020, 5:48am

This has cited several times in this thread:
https://julialang.org/legal/data/

As others have pointed out, that would be fallacious and ill-intentioned exaggeration. If someone is willing to ruin the credit of Julia spreading out lies, even removing all telemetry could be an event to make up a negative headline.

anon67531922 · July 3, 2020, 6:31am

I am not sure about this; “Julia package manager collects anonymous package usage data in a transparent and open way — unless you opt out” is hardly headline material even for certified nerds.

The risk could be close to zero - I have no idea - but I fear there could be a risk of something real happening, e.g. some data breach or some unforeseen (as of yet) issue that is real and headline worthy. If a hacker somehow accessed all the collected telemetry data, is there something that could be done with it? I don’t know. What could happen if UUIDs linked to IP addresses were released accidentally? I have worked on projects linking “anonymous” data to personal identifying data. It isn’t that hard. If there is no risk at all, that is great. If someone could explain the potential risks, if any, that would be good.

nilshg · July 3, 2020, 6:39am

I second Eric’s question from above - how is the separation of IP addresses and UUID handled? It seems to me that the combination of both is the potentially (not saying actually!) most problematic aspect of this - apologies if I have missed an explanation somewhere in this long thread, but I can’t find IP addresses mentioned in the data policy section of the Julialang website linked above.

Tamas_Papp · July 3, 2020, 6:54am

… can be directed to the details, read through them, and move on (in addition, the lawyers will be delighted to send an invoice, thank you so much).

Dilbert cartoons notwithstanding, the private sector is not completely staffed with gullible idiots. If/when the question of Julia’s package telemetry comes up, it will be reviewed by people who have seen a quite a few data collection policies, 99% of which are orders of magnitude less transparent and innocuous.

Karajan · July 3, 2020, 7:01am

Yes, pretty much what @kevbonham wrote: some statistics, relying on the fact that is highly unlikely not to get clashes (birthday paradox) and therefore not use up all the available random numbers.
I took N random UInt16s and looked how many unique ones are in there on average, plus a 90% quantile. Then calculate back N from the unique IDs the server would see.

Code

using StatsBase
using Plots

uniquerands(i) = rand(UInt16, i) |> unique |> length

function collect_stats(rng)
    len = length(rng)
    means = zeros(len)
    low = zeros(len)
    up = zeros(len)
    Threads.@threads for i in 1:len
        ur = Int[]
        for _ in 1:500
            push!(ur, uniquerands(rng[i]))
        end
        means[i] = mean(ur)
        l, u = quantile(ur, (0.05, 0.95))
        low[i] = l
        up[i] = u
    end
    means, low, up
end

i = 300_000:10_000:600_000 |> collect
s = collect_stats(i)

plot(i, s[1], ribbon = (s[1] .- s[2], s[3] .- s[1]), label = "mean with 90% quantile")
hline!([typemax(UInt16)], label = "typemax(UInt16)")
scatter!([405_000, 495_000], [65400, 65500], xerr = [10_000, 20_000], label = "estimate for unique users")
plot!(legend = :bottomright, ylims = (65_000, 65600), xlabel = "unique users", ylabel = "unique IDs")

oheil · July 3, 2020, 7:48am

As long as the collected data is as minimal as it is and contains no personal data I would say that the IP adress together with the set of telemetry data is no problem even if it stays connected to the telemetry data. In other scenarios the IP adress would be the connection to an individium to e.g. anonymized personal data and this would be a problem.

But despite that I think the IP adress should be mentioned in the data document as it is clearly part of the gathered data even if it is not stored together with the telemetry data.

stev47 · July 3, 2020, 8:54am

Reliable reidentification has been performed with less unique data than IP address and telemetry. The key is that linking data together gives you much more information that you can imagine. Just link the data about fetched packages together with time and public git commits to Project.toml files on github and maybe his public bug reports where he posted his versioninfo() and I’m sure you can infer the person whom a UUID belongs to in quite a few cases.

This discussion is less about whether people at Julia are to be trusted or not, but about whether collecting not strictly necessary data that has the potential to be misused is a good idea or not.

Per · July 3, 2020, 9:02am

You use a non-uniform distribution for the random ID:s. For example, the HyperLogLog distribution, as pointed out by @chrisvwx above.

(This is not perfect. Some users will get an ID that is so rare that it effectively identifies that user uniquely, and if those users are more likely to opt-out (or get a new ID) than the average user, then that will skew the statistics.)

oheil · July 3, 2020, 9:40am

I am with you. I don’t think that the IP adress should be collected. At first it is collected as part of any protocoll, so this is fact. Therefor it should be mentioned in the data document. And it should be discarded and how this is done should also be part of the description.

This is easy to do which makes it even more important.

But, what I also think: it is from a privacy viewpoint in this case not such a big deal as there is actually no personal data. IP adress can be personal data but it isn’t per definition.

Just link the data about fetched packages together with time and public git commits to Project.toml files on github

And this is not part of the telemetry we are talking about. It can be done, sure. But if you argue with that we all have to go offline right now!

Tamas_Papp · July 3, 2020, 10:12am

So, if I understand correctly, the worst case scenario is that people who have access to the disaggregated data can learn which packages from the general registry I have installed.

Sorry to be dense, but I still don’t understand why this is a big deal. Yes, I can imagine misuse scenarios (eg math book publishers getting this data for targeted advertising based on my preferences for various numerical methods), but they all seem pretty far-fetched given the payoffs. Technically, when de-anonymized this would constitute personal data, but only in the sense of some people worrying about citric acid as a food additive.

stev47 · July 3, 2020, 10:57am

Not saying these are relevant to most people, but a few things I have heard to be possible:

IP and timings give off your location, maybe up to your employer and people close to you (when linking)
timings give off your usual working hours, might disclose when you are on vacation (maybe inferring when your house is vacant)
Version/System information together with package versions give off information about potential vulnerabilities that an attacker can use (there are databases linking software versions with corresponding exploits)