Functional style table processing

I have an Array of Tuples:

A = [
(1,10);
(1,20);
(1,30);
(2,15);
(2,25);
(2,35);
(3,17);
(3,27);
(3,37);
]

I want to get new Array of Tuples (x,y) where x is unique, and y = mean(z), where z is all elements from previous table with corresponding x:

[
(1, 20);
(2, 25);
(3, 27)
]

I’m interested in beautiful functional style solution(not ugly nested loops, although i know, that it will be as efficient as one-line functional styled).
Array of tuples can be changed at 2d Array.

Thx!

1 Like

I managed to write only next solution, but it looks too long

# convert to 2d Array
B = zeros(length(A),2)
for x in LinearIndices(A)
    B[x,:] = [A[x][1], A[x][2]]
end

# Create unique array composed of 1st column values
C = unique(B[:,1])
out = zeros(length(C),2)

for i in LinearIndices(B[:,1])
    out[i,:] = [ C[i], Statistics.mean( findall( C[i] .== B[:,2]) ) ]
end

Sounds like group by operation:

A = [
(1,10);
(1,20);
(1,30);
(2,15);
(2,25);
(2,35);
(3,17);
(3,27);
(3,37);
]

using DataFrames
using Statistics

df = DataFrame(Dict(:x => [a[1] for a in A], :y => [a[2] for a in A]))
by(df -> DataFrame(m = mean(df[:x])), df, :x)

Of course you can do the same thing manually using 2 dicts - for count and sum of elements, using loops or array/dict comprehension. E.g. using loops (which are clearer in this case IMO):

# make a dict with first element of each tuple as key and a list of corresponding second elements as value
d = Dict()
for (k, v) in A
    if !haskey(d, k)
        d[k] = []
    end
    push!(d[k], v)
end

# d now contains: 
# Dict{Any,Any} with 3 entries:
#  2 => Any[15, 25, 35]
#  3 => Any[17, 27, 37]
#  1 => Any[10, 20, 30]

# calculate means
dm = Dict(k => Statistics.mean(vs) for (k, vs) in d)

# which gives:
# Dict{Int64,Float64} with 3 entries:
#  2 => 25.0
#  3 => 27.0
# 1 => 20.0
4 Likes

In Julia 0.6 with IndexedTables:

using IndexedTables
groupby(mean, table(A), 1, select = 2)

Unfortunately is not ported to 1.0 yet, so this only works if you’re still on Julia 0.6

2 Likes

This is quite head-spinning, but I hate reading the docs of a zillion packages just for doing a small task.

julia> A = [
       (1,10);
       (1,20);
       (1,30);
       (2,15);
       (2,25);
       (2,35);
       (3,17);
       (3,27);
       (3,37);
       ];

julia> B = unique( A[i][1] for i in 1:length(A) );

julia> C = [ mean( i[2] for i in filter(M -> M[1] == j, A) ) for j in B ];

julia> [(B[i],C[i]) for i in 1:length(B)]
3-element Array{Tuple{Int64,Float64},1}:
 (1, 20.0)
 (2, 25.0)
 (3, 27.0)

Wow, so far, this is the closest solution.

Note that your solution is O(N) + O(K * N) + O(K) = O(K * N) where N is length of array A, K - number of unique keys (i.e. length of B). At the same time solution with dict is only O(2N) = O(N), i.e. you see each element only twice instead of K times. Even more efficient algorithm would be to calculate averages on the fly, which would allow to make only 1 pass through each element in A.

2 Likes

There is a beautiful integration between IndexedTables and OnlineStats to compute statistics “on the fly” (with O(1) memory and doing only one pass). Here for example it’d be:

using IndexedTables, OnlineStats
groupreduce(Mean(), table(A), 1, select = 2)

It’s very useful for distributed datasets and OnlineStats has some sophisticated statistics (not only Mean() and co. but also linear models for example).

3 Likes