Functional style table processing

Andrey.Borzunov · September 22, 2018, 11:52am

I have an Array of Tuples:

A = [
(1,10);
(1,20);
(1,30);
(2,15);
(2,25);
(2,35);
(3,17);
(3,27);
(3,37);
]

I want to get new Array of Tuples (x,y) where x is unique, and y = mean(z), where z is all elements from previous table with corresponding x:

[
(1, 20);
(2, 25);
(3, 27)
]

I’m interested in beautiful functional style solution(not ugly nested loops, although i know, that it will be as efficient as one-line functional styled).
Array of tuples can be changed at 2d Array.

Thx!

Andrey.Borzunov · September 22, 2018, 12:42pm

I managed to write only next solution, but it looks too long

# convert to 2d Array
B = zeros(length(A),2)
for x in LinearIndices(A)
    B[x,:] = [A[x][1], A[x][2]]
end

# Create unique array composed of 1st column values
C = unique(B[:,1])
out = zeros(length(C),2)

for i in LinearIndices(B[:,1])
    out[i,:] = [ C[i], Statistics.mean( findall( C[i] .== B[:,2]) ) ]
end

dfdx · September 22, 2018, 12:57pm

Sounds like group by operation:

A = [
(1,10);
(1,20);
(1,30);
(2,15);
(2,25);
(2,35);
(3,17);
(3,27);
(3,37);
]

using DataFrames
using Statistics

df = DataFrame(Dict(:x => [a[1] for a in A], :y => [a[2] for a in A]))
by(df -> DataFrame(m = mean(df[:x])), df, :x)

Of course you can do the same thing manually using 2 dicts - for count and sum of elements, using loops or array/dict comprehension. E.g. using loops (which are clearer in this case IMO):

# make a dict with first element of each tuple as key and a list of corresponding second elements as value
d = Dict()
for (k, v) in A
    if !haskey(d, k)
        d[k] = []
    end
    push!(d[k], v)
end

# d now contains: 
# Dict{Any,Any} with 3 entries:
#  2 => Any[15, 25, 35]
#  3 => Any[17, 27, 37]
#  1 => Any[10, 20, 30]

# calculate means
dm = Dict(k => Statistics.mean(vs) for (k, vs) in d)

# which gives:
# Dict{Int64,Float64} with 3 entries:
#  2 => 25.0
#  3 => 27.0
# 1 => 20.0

piever · September 22, 2018, 2:25pm

In Julia 0.6 with IndexedTables:

using IndexedTables
groupby(mean, table(A), 1, select = 2)

Unfortunately is not ported to 1.0 yet, so this only works if you’re still on Julia 0.6

Seif_Shebl · September 22, 2018, 6:27pm

This is quite head-spinning, but I hate reading the docs of a zillion packages just for doing a small task.

julia> A = [
       (1,10);
       (1,20);
       (1,30);
       (2,15);
       (2,25);
       (2,35);
       (3,17);
       (3,27);
       (3,37);
       ];

julia> B = unique( A[i][1] for i in 1:length(A) );

julia> C = [ mean( i[2] for i in filter(M -> M[1] == j, A) ) for j in B ];

julia> [(B[i],C[i]) for i in 1:length(B)]
3-element Array{Tuple{Int64,Float64},1}:
 (1, 20.0)
 (2, 25.0)
 (3, 27.0)

Andrey.Borzunov · September 22, 2018, 6:41pm

Wow, so far, this is the closest solution.

dfdx · September 22, 2018, 7:36pm

Note that your solution is O(N) + O(K * N) + O(K) = O(K * N) where N is length of array A, K - number of unique keys (i.e. length of B). At the same time solution with dict is only O(2N) = O(N), i.e. you see each element only twice instead of K times. Even more efficient algorithm would be to calculate averages on the fly, which would allow to make only 1 pass through each element in A.

piever · September 22, 2018, 8:03pm

There is a beautiful integration between IndexedTables and OnlineStats to compute statistics “on the fly” (with O(1) memory and doing only one pass). Here for example it’d be:

using IndexedTables, OnlineStats
groupreduce(Mean(), table(A), 1, select = 2)

It’s very useful for distributed datasets and OnlineStats has some sophisticated statistics (not only Mean() and co. but also linear models for example).

Topic		Replies	Views
What's julia's solution to tapply or accumarray? General Usage question , dataframes , arrays , splitapplycombine	15	2074	January 3, 2022
How to calcule the mean of values considering their tuples of another value: General Usage dataframes	6	254	November 14, 2022
Groupby() on two individual arrays General Usage question , array	1	318	April 20, 2021
How Transducers.jl process table? General Usage question , transducers	7	546	May 11, 2022
Various by-group strategies compared Data	36	3935	January 30, 2018

Functional style table processing

Related topics