Selecting entries of an array based on selection criteria provided by other array

oscarvdvelde · August 26, 2024, 4:28pm

Consider this example, which may run in a loop where selection is updated.

a = [1.0,1.2,0.9, 3.4, 4.5,4.5, 9.7, -1.5,-2.1]
b = [1,1,1,2,3,3,4,5,5]   # e.g. a grouping of a
selection = [1,3,5]    # these groups

# What I may naively want to do, and then e.g. median over it
julia> a[b.==selection]  
ERROR: DimensionMismatch: arrays could not be broadcast to a common size; got a dimension with lengths 9 and 3

# My probably naive solution:
vcat([a[b.==s] for s in selection]...)

It must be a recipe for an excess of allocations. Is there a better way?

JonasWickman · August 26, 2024, 5:30pm

The most compact way to do this is probably

a[b .∈ (selection,)] # or a[b .∈ Ref(selection)]

but you have to use ∈ (type \in[tab] in REPL to get this symbol) in order to broadcast. The (,)or Ref means broadcast will treat selection as a scalar for broadcast.

This is not ideal for allocation/speed, in part because it creates an intermediate bit array. The fastest/least allocating solution I could come up with is not particularly pretty:

function myfun(a, b, selection)
    
    l = 0
    for bi ∈ b 
       if bi ∈ selection; l += 1; end
    end
    out = zeros(eltype(a), l)
    
    k = 1
    for i ∈ eachindex(a, b)
        if b[i] ∈ selection
            out[k] = a[i]
            k += 1
        end
    end
    
    return out
end

stevengj · August 26, 2024, 5:55pm

You could also do:

[a[i] for i in eachindex(a,b) if b[i] in selection]

(If efficiency matters, and you can rely on b and selection being sorted as in your example, then there are faster algorithms where you don’t have to repeat the in search for every element.)

JonasWickman · August 26, 2024, 6:07pm

I hope this isn’t derailing the thread too much, but I tried this and surprisingly it’s about half as fast as a[b .∈ (selection,)]on my machine for this specific example. My best guess is that it’s because the length of the array is not known in advance without first computing the intermediate bit array, but if anyone knows for sure I’d love to find out.

oscarvdvelde · August 26, 2024, 9:07pm

Thanks, very useful. I will try it in my code. I like the one-liner, which can be memorized easily, but also the looping function could see good use, as I will probably be throwing this into functions regularly.

On my laptop, when placed in a function, the above code performs like:

vcat([a[b.==s] for s in selection]...)
281.848 ns (11 allocations: 720 bytes)
3.334 ms (62 allocations: 1.71 MiB) on actual data (see below)

a[b .∈ Ref(selection)]
77.137 ns (3 allocations: 208 bytes)
4.472 ms (5 allocations: 149.33 KiB)  on actual data

loop function
60.285 ns (1 allocation: 112 bytes)
10.826 ms (1 allocation: 4.25 KiB) on actual data

[a[i] for i in eachindex(a,b) if b[i] in selection]
135.664 ns (3 allocations: 208 bytes)
4.537 ms (7 allocations: 11.12 KiB) on actual data

I also tried this shorter version of Jonas’ loop function:

function test5(a, b, selection)
   out = zeros(eltype(a), length(b))
   k = 1
   for i ∈ eachindex(a, b)
       if b[i] ∈ selection
           out[k] = a[i]
           k += 1
       end
   end
   filter!(x->x!=0.0,out)
   return out
end

56.548 ns (1 allocation: 128 bytes)
5.489 ms (3 allocations: 4.40 MiB)  on actual data 
(allocates unnecessarily for entire vector, which is why Jonas' first loop is needed, but it is faster)

On the other hand, my vectors may not be sorted at all (sorted by time, while vector a could be latitude, not lined up with the order of b but the same length). I still have to figure out eachindex(a,b) and its uses.

Update: I have now listed also the performance when applied to one iteration of my actual data, where a and b are of length 1153405, and selection of length 12.

oscarvdvelde · August 26, 2024, 11:35pm

See my updated numbers above! For large data arrays the relative results look quite different.

The two single line functions perform similarly, the Jonas loop takes twice as long, but allocates close to nothing, while my original list comprehension is surprisingly fast, allocating a lot, but still less than the single loop (the Jonas loop function I tried to simplify).

It seems easy to make loops, in hope of speed and allocation improvements, that instead allocate way more than needed, here 4 MB vs 4 kB! Good to consider including a pre-allocation loop from now on.

GunnarFarneback · August 27, 2024, 7:16am

JonasWickman:

The most compact way to do this is probably
a[b .∈ (selection,)] # or a[b .∈ Ref(selection)]
but you have to use ∈ (type \in[tab] in REPL to get this symbol) in order to broadcast.

Those who are unicode-adverse can broadcast with

a[in.(b, (selection,))]

This doesn’t work correctly if you have zeros in a (which may or may not be a problem for you) but since you know that you only want to discard data at the end, you can do it more efficiently with

resize!(out, k - 1)

The density of the selection definitely matters for the relative efficiency of these approaches.
Edit: As does the size of selection. If it is big you certainly don’t want to check inclusion more than once per b element.

GunnarFarneback · August 27, 2024, 1:24pm

Here’s another way to spell the broadcasting:

a[in(selection).(b)]

If selection is large and the set of all classes is a range from 1, it may be interesting to try some variation of

binary_selection = in(selection).(1:maximum(b))
a[binary_selection[b]]

nilshg · August 28, 2024, 4:27pm

Or in(b).(selection)

GunnarFarneback · August 28, 2024, 7:38pm

That gives a quite different result though, of little use for this problem.

nilshg · August 28, 2024, 8:35pm

Ah sorry ignore, I just wanted to point to the Fix1 version of in, hadn’t seen that you already noted that in your later post (quite apart from the fact that I confused the argument order…)

Topic		Replies	Views
Array comparison New to Julia	4	3044	November 8, 2020
Create a boolean mask between arrays of different size General Usage arrays , boolean	10	120	October 30, 2024
Understanding assignment without allocation New to Julia question	14	1696	November 12, 2019
Performance issue when searching for an occurrence in larger matrix Performance	3	346	July 18, 2020
Filter DataFrame by an Array New to Julia	8	5913	December 10, 2019

Selecting entries of an array based on selection criteria provided by other array

Related topics