I want to get new Array of Tuples (x,y) where x is unique, and y = mean(z), where z is all elements from previous table with corresponding x:
[
(1, 20);
(2, 25);
(3, 27)
]
I’m interested in beautiful functional style solution(not ugly nested loops, although i know, that it will be as efficient as one-line functional styled).
Array of tuples can be changed at 2d Array.
I managed to write only next solution, but it looks too long
# convert to 2d Array
B = zeros(length(A),2)
for x in LinearIndices(A)
B[x,:] = [A[x][1], A[x][2]]
end
# Create unique array composed of 1st column values
C = unique(B[:,1])
out = zeros(length(C),2)
for i in LinearIndices(B[:,1])
out[i,:] = [ C[i], Statistics.mean( findall( C[i] .== B[:,2]) ) ]
end
A = [
(1,10);
(1,20);
(1,30);
(2,15);
(2,25);
(2,35);
(3,17);
(3,27);
(3,37);
]
using DataFrames
using Statistics
df = DataFrame(Dict(:x => [a[1] for a in A], :y => [a[2] for a in A]))
by(df -> DataFrame(m = mean(df[:x])), df, :x)
Of course you can do the same thing manually using 2 dicts - for count and sum of elements, using loops or array/dict comprehension. E.g. using loops (which are clearer in this case IMO):
# make a dict with first element of each tuple as key and a list of corresponding second elements as value
d = Dict()
for (k, v) in A
if !haskey(d, k)
d[k] = []
end
push!(d[k], v)
end
# d now contains:
# Dict{Any,Any} with 3 entries:
# 2 => Any[15, 25, 35]
# 3 => Any[17, 27, 37]
# 1 => Any[10, 20, 30]
# calculate means
dm = Dict(k => Statistics.mean(vs) for (k, vs) in d)
# which gives:
# Dict{Int64,Float64} with 3 entries:
# 2 => 25.0
# 3 => 27.0
# 1 => 20.0
This is quite head-spinning, but I hate reading the docs of a zillion packages just for doing a small task.
julia> A = [
(1,10);
(1,20);
(1,30);
(2,15);
(2,25);
(2,35);
(3,17);
(3,27);
(3,37);
];
julia> B = unique( A[i][1] for i in 1:length(A) );
julia> C = [ mean( i[2] for i in filter(M -> M[1] == j, A) ) for j in B ];
julia> [(B[i],C[i]) for i in 1:length(B)]
3-element Array{Tuple{Int64,Float64},1}:
(1, 20.0)
(2, 25.0)
(3, 27.0)
Note that your solution is O(N) + O(K * N) + O(K) = O(K * N) where N is length of array A, K - number of unique keys (i.e. length of B). At the same time solution with dict is only O(2N) = O(N), i.e. you see each element only twice instead of K times. Even more efficient algorithm would be to calculate averages on the fly, which would allow to make only 1 pass through each element in A.
There is a beautiful integration between IndexedTables and OnlineStats to compute statistics “on the fly” (with O(1) memory and doing only one pass). Here for example it’d be:
using IndexedTables, OnlineStats
groupreduce(Mean(), table(A), 1, select = 2)
It’s very useful for distributed datasets and OnlineStats has some sophisticated statistics (not only Mean() and co. but also linear models for example).