Constructing index with broadcasting operators vs loops

fipelle · April 7, 2023, 5:31pm

Hi,

I am playing around with the following code and trying to find the most efficient way to construct indices based on a boolean condition:

using BenchmarkTools, Random, Statistics;

# Generate data
n, m = 10^3, 5
features = randn(n, m)
weights = rand(-2:2, m)
labels = features * weights;

# Transpose data
targets = permutedims(labels);
predictor = view(features, :, 1);
threshold = quantile(predictor, 0.1);

println("Case 1")
@btime begin
    ind_leq = predictor .<= threshold;
    ind_grt = .!ind_leq;
end
# 1.178 μs (10 allocations: 4.84 KiB)

println("Case 2")
@btime begin
    ind_leq = BitVector(undef, length(predictor));
    ind_grt = similar(ind_leq);
    for i in eachindex(predictor)
        ind_leq[i] = predictor[i] <= threshold;
        ind_grt[i] = !ind_leq[i];
    end
end
# 95.695 μs (2983 allocations: 62.62 KiB)

The first case is much faster and I would like to understand why. I thought that constructing both indices within a single loop should have been faster, but I suspect I am overlooking something - from the number of allocations it looks as if the elements of predictor are not copied over in the first case, and I’d like to understand how to do it manually within loop.

DNF · April 7, 2023, 6:05pm

Iterating over BitArrays is slow, since you cannot directly access each bit-element (this is a limitation of processor architectures, that one byte/8bits is the smallest addressable quantity) so each access has to mask out the irrelevant bits before modifying the one you want.

BitArrays are only fast for batched operations, but then they are very fast. You can flip 64bits in one go, and with simd you even do 4 or 8 times that simultaneously.

Oh, and by the way, you should put your code in a function.

Elrod · April 7, 2023, 6:12pm

using Statistics, LoopVectorization
function case1(predictor, threshold)
    ind_leq = predictor .<= threshold;
    ind_grt = .!ind_leq;
    return ind_leq, ind_grt
end
function case2(predictor, threshold)
    ind_leq = BitVector(undef, length(predictor));
    ind_grt = similar(ind_leq);
    @inbounds for i in eachindex(predictor)
        ind_leq[i] = predictor[i] <= threshold;
        ind_grt[i] = !ind_leq[i];
    end
    return ind_leq, ind_grt
end
function case3(predictor, threshold)
    ind_leq = BitVector(undef, length(predictor));
    ind_grt = similar(ind_leq);
    @turbo for i in eachindex(predictor)
        ind_leq[i] = predictor[i] <= threshold;
        ind_grt[i] = !ind_leq[i];
    end
    return ind_leq, ind_grt
end

# Generate data
n, m = 10^3, 5
features = randn(n, m)
weights = rand(-2:2, m)
labels = features * weights;

# Transpose data
targets = permutedims(labels);
predictor = view(features, :, 1);
threshold = quantile(predictor, 0.1);
case1(predictor, threshold) == case2(predictor, threshold) == case3(predictor, threshold)
@btime case1($predictor, $threshold);
@btime case2($predictor, $threshold);
@btime case3($predictor, $threshold);

I get

julia> case1(predictor, threshold) == case2(predictor, threshold) == case3(predictor, threshold)
true

julia> @btime case1($predictor, $threshold);
  492.605 ns (5 allocations: 4.62 KiB)

julia> @btime case2($predictor, $threshold);
  1.849 μs (4 allocations: 448 bytes)

julia> @btime case3($predictor, $threshold);
  145.379 ns (4 allocations: 448 bytes)

@turbo batches the operations.
This is actually something tricky to do at a low level, meaning it will take some time before the rewrite supports it.
It is easy to batch at a high level though.
At a high level, we’re just indexing an array.

At a low level, we’re doing a ton of bit manipulations and then indexing into the same memory address across many loop iterations.
You’d have to

look at the individual bit assignments and realize they’re incrementing, so separate iterations index separate bits.
that you can batch these and in so doing actually delete all those bit manipulations.

Eventually, I plan on having the rewrite handle this as well while working on a low level, but it still doesn’t handle anything yet, so that’s a ways away.

Meaning for now, this means either use something that works on a relatively high level – that is LoopVectorization.@turbo or broadcasting – or use Vector{Bool} instead.

julia> function case2bool(predictor, threshold)
           ind_leq = Vector{Bool}(undef, length(predictor));
           ind_grt = similar(ind_leq);
           @inbounds for i in eachindex(predictor)
               ind_leq[i] = predictor[i] <= threshold;
               ind_grt[i] = !ind_leq[i];
           end
           return ind_leq, ind_grt
       end
case2bool (generic function with 1 method)

julia> @btime case2bool($predictor, $threshold);
  316.359 ns (2 allocations: 2.12 KiB)

@turbo would probably be the fastest (like it is here), but it doesn’t work on all loops, and it is also a weighty dependency that introduces latency.

Topic		Replies	Views
Broadcasted assignment slower than for loop when indices are arbitrary New to Julia question , broadcast	15	1421	August 23, 2021
Efficient way to get indices of true elements from BitArray New to Julia indexing	5	1753	February 3, 2022
When should I write loops or vectorised calls? General Usage	17	1761	December 1, 2020
Hand written loop slower than bitVector broadcast Performance question	15	687	January 18, 2024
Efficient (non-allocating, in-place) getindex! for BitArray? Performance question , indexing	19	1634	July 2, 2020

Constructing index with broadcasting operators vs loops

Related topics