Split-Apply-Combine in arrays

Hello (first post on this forum)! I am new to Julia, transitioning from R. I am trying to implement a split-apply-combine strategy on a matrix without converting it into a DataFrame (reason: I need to do this operation for each iteration of a Monte Carlo, and I need to switch between arrays and their sparse matrix representation, hence I want to minimize type conversion, not sure if this makes sense).

The split-apply-combine would work as follows: take a matrix of three columns, split it by unique combinations of the values in the first two columns (which together, define the groups), and then add a fourth column that equals one if a value in the third column is the maximum in the corresponding group. Here is the comprehension I am working on: the dimension of the array A has approximately the same size as the examples from my data.

julia> using SplitApplyCombine, Random

julia> a, b = [150, 1000] # These can be made smaller for testing purposes

julia> A = hcat(repeat(1:a, outer = b), repeat(1:a, inner = b), randn(a*b))
150000×3 Matrix{Float64}:
   1.0    1.0  -0.837408
   2.0    1.0   0.793393
   3.0    1.0   0.297672
   4.0    1.0  -0.142228
   5.0    1.0   0.606502
   6.0    1.0   0.573417
 145.0  150.0  -0.474295
 146.0  150.0   0.816623
 147.0  150.0  -2.17863
 148.0  150.0   0.455542
 149.0  150.0  -0.868958
 150.0  150.0   0.519702

julia> A = [argmax(J[findall(J[:, 1] .== b[1] .&& J[:, 2] .== b[2]), 3]) for b ∈ unique(splitdims(A[:, [1,2]], 1))]

The idea is to then use the “argmax” indices to then construct the binary column that I want. However, the last line takes about one minute on my laptop, which I take as an indication that I am doing something wrong as Julia is not supposed to be this slow for a loop on a relatively low dimension. Also I think my syntax is overly intricate, another indication that something is wrong.

I looked for threads with similar issues but could not find a suitable strategy that would not resort to Data Frames. Thanks for helping!!!

You need to forget about syntax for a moment and think about complexity. Isn’t the algorithm you wrote O(NM) in cost, where N is the number of groups and M is the number of rows? You could do a lot better.

Have you considered writing a more primitive loop that makes fewer passes over the data? The following (which returns your 4th column; you can concatenate it with A if you want using hcat) takes about 0.01sec on my laptop with your A matrix above.

function maxgroups(A::AbstractMatrix{T}) where {T}
    size(A,2) == 3 || error("a 3-column matrix is required")
    d = Dict{Tuple{T,T}, T}()
    for i in axes(A, 1)
        key = (A[i,1], A[i,2])
        d[key] = max(A[i,3], get(d, key, typemin(T)))
    return [A[i,3] == d[A[i,1], A[i,2]] for i in axes(A,1)]

Note that this first creates a dictionary mapping each group (keyed by the pair of elements in the first two columns) to its maximum value (the 3rd column). (You might want to consider whether an Array is the best data structure or if you want to use some other data structure to start with.)

PS. As usual: put performance-sensitive code inside a function.

Sure, it makes sense, and is possible in Julia. Unlike R or Python, arbitrary structures or custom code can be fast, you don’t need to fit yourself into nd-arrays of numbers, or flat tables.
Not only fast, but also pretty convenient and straightforward, after getting some used to the paradigm (or, more like “getting un-used to the flat-tables-only paradigm” (: ).

Simplifying @stevengj answer:

# max for each group:
julia> maxes = map(group(r -> (r[1], r[2]), eachrow(A))) do gr
    maximum(r -> r[3], gr)

# add the required column with flags:
julia> B = hcat(A, map(r -> r[3] == maxes[(r[1], r[2])], eachrow(A)))

here you can find a discussion about a similar problem

if there are more maxima per group, in this way, all are “exposed”.

This could be a way to do as DataFrames does(!?) without using DataFrames

idx=Int.( A[:,1] .* 1000 .+ A[:,2])

function testy2(s,r)
    res = fill(150001, m)
    @inbounds for i in eachindex(s)
        res[s[i]] = r[res[s[i]]] < r[i] ? i : res[s[i]]
    @view A[filter(!=(150001),res),:]

Thanks everyone for responding! That was really helpful. I must say that Julia’s community is really nice, and I feel welcome to it. A good feeling! I took two days off for family reasons hence my late reply.

I find @stevengj’s approach in @aplavin’s version fast and natural to me, perhaps because it’s close to how I would do a ‘lapply’ in R, though I don’t quite understand how @aplavin’s code manages to build a dictionary in between. So many things to learn.

Thanks @rocco_sprmnt21 for all the hints, links and ideas, though your “DataFrames-like” code is difficult for me to understand! :sweat_smile: Also, you’re right to point out that one may want to avoid catching two maxima, though this is very unlikely to occur in my data.

Looking forward to future productive discussion! I hope to become a productive member of the community at some point.

Edit: I get the dictionary thing, it’s from the group function. Still many things to learn…

OK, one last clarification question. I tried to adapt the original suggestions to make it all one-line (not that it matters for speed I know, but it helps me keep my code organized). Here is how I adapted my MWE, please see my comments before each line.

# Code from my OP
using SplitApplyCombine, Random
a, b = [150, 1000] # These can be made smaller for testing purposes
A = hcat(repeat(1:a, outer=b), repeat(1:a, inner=b), randn(a * b))

# @aplavin's code, first part: 0.29 ca. seconds
@time maxes = map(group(r -> (r[1], r[2]), eachrow(A))) do gr
    maximum(r -> r[3], gr)

# @aplavin's code above, in one line: 0.22 ca. seconds
@time a = map(b -> maximum(s -> s[3], b), group(r -> (r[1], r[2]), eachrow(A)))

# Check that I did it right, it returns true
maxes == a

# @aplavin's code, second part: 0.07 ca. seconds
@time B = hcat(A, map(r -> r[3] == a[(r[1], r[2])], eachrow(A)))

# This should be the same as the above, but it never ends, I stop it after a few minutes
@time C = hcat(A, map(r -> r[3] == map(b -> maximum(s -> s[3], b), group(c -> (c[1], c[2]), eachrow(A)))[(r[1], r[2])], eachrow(A)))

The only thing I did in the last line was to replace a with its full definition, thinking this is how function composition should work. Why doesn’t this work? This sounds like a problem about how I understand the language to function, store things in memory, run loops etc. - this is not how I expect say R to work. Sorry for the comparison with another language.

All languages I’m familiar with work in exactly the same way in this regard.

f() = (sleep(1); 1)

# takes 1 second:
x = f()
for i in 1:100
   println(x + 1)

# replace x with f(), and it takes 100 seconds:
for i in 1:100
   println(f() + 1)

That’s basically what you have in the example, computing the whole a once again for each matrix row.

1 Like

I SEE. Maybe this does not happen in R’s “apply” family functions because these functions are built internally so as to avoid this kind of repetition.

Thanks a lot, now I understand. Sorry for the silly question.