Selecting entries of an array based on selection criteria provided by other array

Consider this example, which may run in a loop where selection is updated.

a = [1.0,1.2,0.9, 3.4, 4.5,4.5, 9.7, -1.5,-2.1]
b = [1,1,1,2,3,3,4,5,5]   # e.g. a grouping of a
selection = [1,3,5]    # these groups

# What I may naively want to do, and then e.g. median over it
julia> a[b.==selection]  
ERROR: DimensionMismatch: arrays could not be broadcast to a common size; got a dimension with lengths 9 and 3

# My probably naive solution:
vcat([a[b.==s] for s in selection]...)

It must be a recipe for an excess of allocations. Is there a better way?

The most compact way to do this is probably

a[b .∈ (selection,)] # or a[b .∈ Ref(selection)]

but you have to use (type \in[tab] in REPL to get this symbol) in order to broadcast. The (,)or Ref means broadcast will treat selection as a scalar for broadcast.

This is not ideal for allocation/speed, in part because it creates an intermediate bit array. The fastest/least allocating solution I could come up with is not particularly pretty:

function myfun(a, b, selection)
    
    l = 0
    for bi ∈ b 
       if bi ∈ selection; l += 1; end
    end
    out = zeros(eltype(a), l)
    
    k = 1
    for i ∈ eachindex(a, b)
        if b[i] ∈ selection
            out[k] = a[i]
            k += 1
        end
    end
    
    return out
end
3 Likes

You could also do:

[a[i] for i in eachindex(a,b) if b[i] in selection]

(If efficiency matters, and you can rely on b and selection being sorted as in your example, then there are faster algorithms where you don’t have to repeat the in search for every element.)

4 Likes

I hope this isn’t derailing the thread too much, but I tried this and surprisingly it’s about half as fast as a[b .∈ (selection,)]on my machine for this specific example. My best guess is that it’s because the length of the array is not known in advance without first computing the intermediate bit array, but if anyone knows for sure I’d love to find out.

Thanks, very useful. I will try it in my code. I like the one-liner, which can be memorized easily, but also the looping function could see good use, as I will probably be throwing this into functions regularly.

On my laptop, when placed in a function, the above code performs like:

vcat([a[b.==s] for s in selection]...)
281.848 ns (11 allocations: 720 bytes)
3.334 ms (62 allocations: 1.71 MiB) on actual data (see below)
a[b .∈ Ref(selection)]
77.137 ns (3 allocations: 208 bytes)
4.472 ms (5 allocations: 149.33 KiB)  on actual data
loop function
60.285 ns (1 allocation: 112 bytes)
10.826 ms (1 allocation: 4.25 KiB) on actual data
[a[i] for i in eachindex(a,b) if b[i] in selection]
135.664 ns (3 allocations: 208 bytes)
4.537 ms (7 allocations: 11.12 KiB) on actual data

I also tried this shorter version of Jonas’ loop function:

function test5(a, b, selection)
   out = zeros(eltype(a), length(b))
   k = 1
   for i ∈ eachindex(a, b)
       if b[i] ∈ selection
           out[k] = a[i]
           k += 1
       end
   end
   filter!(x->x!=0.0,out)
   return out
end

56.548 ns (1 allocation: 128 bytes)
5.489 ms (3 allocations: 4.40 MiB)  on actual data 
(allocates unnecessarily for entire vector, which is why Jonas' first loop is needed, but it is faster)

On the other hand, my vectors may not be sorted at all (sorted by time, while vector a could be latitude, not lined up with the order of b but the same length). I still have to figure out eachindex(a,b) and its uses.

Update: I have now listed also the performance when applied to one iteration of my actual data, where a and b are of length 1153405, and selection of length 12.

See my updated numbers above! For large data arrays the relative results look quite different.

The two single line functions perform similarly, the Jonas loop takes twice as long, but allocates close to nothing, while my original list comprehension is surprisingly fast, allocating a lot, but still less than the single loop (the Jonas loop function I tried to simplify).

It seems easy to make loops, in hope of speed and allocation improvements, that instead allocate way more than needed, here 4 MB vs 4 kB! Good to consider including a pre-allocation loop from now on.

Those who are unicode-adverse can broadcast with

a[in.(b, (selection,))]

This doesn’t work correctly if you have zeros in a (which may or may not be a problem for you) but since you know that you only want to discard data at the end, you can do it more efficiently with

resize!(out, k - 1)

The density of the selection definitely matters for the relative efficiency of these approaches.
Edit: As does the size of selection. If it is big you certainly don’t want to check inclusion more than once per b element.

1 Like

Here’s another way to spell the broadcasting:

a[in(selection).(b)]

If selection is large and the set of all classes is a range from 1, it may be interesting to try some variation of

binary_selection = in(selection).(1:maximum(b))
a[binary_selection[b]]

Or in(b).(selection)

That gives a quite different result though, of little use for this problem.

1 Like

Ah sorry ignore, I just wanted to point to the Fix1 version of in, hadn’t seen that you already noted that in your later post (quite apart from the fact that I confused the argument order…)