# Efficient (non-allocating, in-place) getindex! for BitArray?

I have an indexing operation of the form `y .= x[mask]` where `mask` is a `BitArray`. To my surprise, this is an allocating operation even though it could be done in place in compiled code. So I wrote a loop that is non-allocating, but it seems to be slower! I am a bit surprised there is no general `getindex!`, but lacking that I would be grateful for ideas of how to get fast non-allocating `getindex` at least for bit arrays. Here is my benchmark code:

``````using BenchmarkTools
using Random: seed!
seed!(0)

@inline function getindex1!(y::Vector{T}, x::Vector{T}, mask::AbstractVector{Bool}) where {T}
@inbounds y .= x[mask] # the obvious way, but it allocates!
end

@inline function getindex2!(y::Vector{T}, x::Vector{T}, mask::AbstractVector{Bool}) where {T}
count = 0
count += 1
y[count] = x[i] # non-allocating, but slower :(
end
end
end

N = (2^16,)
mask = rand(N...) .< 0.5
@show K = sum(mask)
T = ComplexF32
y = Array{T}(undef, K)
x = randn(T, N...)

@btime getindex1!(\$y, \$x, \$mask) #  35.9 Ī¼s (3 allocations: 256.36 KiB)
@btime getindex2!(\$y, \$x, \$mask) # 217.1 Ī¼s (0 allocations: 0 bytes)
``````

Itās faster if you omit the `if`:

``````function get4!(y, x, mask)
count = 1
@inbounds for i in eachindex(mask)
y[count] = x[i]
end
y
end

get4!(similar(y), x, mask) ā getindex1!(similar(y), x, mask)
``````

Edit ā this isnāt actually safe, if `last(mask) != 1` then it will try to write beyond the end of `y`. Hereās a better version, almost as quick:

``````function get7!(y, x, mask)
sum(mask) == length(y) || error("wrong length")
count = 1
@inbounds for i in 1:findlast(identity, mask)
y[count] = x[i]
end
y
end
``````
1 Like

Just to give people the times (on my machine, `Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz`):

``````julia> @btime getindex1!(\$y, \$x, \$mask)
41.783 Ī¼s (3 allocations: 256.36 KiB)

julia> @btime getindex2!(\$y, \$x, \$mask)
257.931 Ī¼s (0 allocations: 0 bytes)

julia> @btime getindex3!(\$y, \$x, \$mask) # mcabbott version
48.770 Ī¼s (0 allocations: 0 bytes)
``````
1 Like

Thanks for the fast reply and the fast code!
On my machine there is about 10% extra compute time for the safe nonallocating version `get7!` above, compared to my original allocating version `getindex1!`.
But when I increase problem size to `N=2^20` then your `get7!` is 15% faster than `getindex1`, presumably because it avoids the overhead of allocating a 8MB array. Yay!
The `sum(mask)` check adds only 1% overhead so it seems worth it.
I will be adding this to MIRT.jl but I wonder if it belongs elsewhereā¦

Edit: in case anyone wants a tested version of this later, it is now here:

A slight modification on @mcabbottās answer (2.5X faster on my machine than original `getindex1!`):

``````function getindex4!(y::Vector{T}, x::Vector{T}, mask::AbstractVector{Bool}) where {T}
count = 0
y[count] = x[i]
end
end

N = (2^20,)
mask = rand(N...) .< 0.5
@show K = sum(mask)
T = ComplexF32
y = Array{T}(undef, K)
x = randn(T, N...)

@btime getindex1!(\$y, \$x, \$mask) # 2.117 ms (2 allocations: 4.00 MiB)
@btime getindex4!(\$y, \$x, \$mask) # 868.900 Ī¼s (0 allocations: 0 bytes)
``````

Unfortunately it does not quite return the correct answer:
`getindex4!(zeros(2), [1,-2.,3], [true,false,true])`
returns `[-2,3]` instead of `[1,3]`.
It would be nice to use the (presumably faster) `findfirst` but I donāt see how.

The code of @Seif_Shebl is wrong. Starting from zero and incrementing `count` before the attribution gives wrong results because every time a false is found it is written in the same position the last saved value was already stored. The algorithm depends on `count` being one index ahead when false values are found. This is also what made @mcabbott amend their code.

I tried a final version combining the other tries, but the results were not better (for the original `2^16` arrays).

``````julia> function getindex8!(y::Vector{T}, x::Vector{T}, mask::AbstractVector{Bool}) where {T}
@assert sum(mask) <= length(y) && length(x) >= length(mask)
# optionally, instead of this assert, just remove the @inbounds below
count = 1
y[count] = x[i]
end
y
end

julia> @btime getindex8!(\$y, \$x, \$mask)
48.913 Ī¼s (0 allocations: 0 bytes)
``````

Thanks for the input. I actually needed this to work for general arrays, not just vectors, so here is my current version that requires that x and mask have the same size. I also need `AbstractVector` because often `y` is a `view` in my work.
The feedback on this forum was enormously helpful. Yay Julia community!

``````@inline function getindex!(y::AbstractVector, x::AbstractArray{T,D},
) where {T,D}
sum(mask) == length(y) || throw("wrong length")
count = 1
y[count] = x[i]
end
return y
end
``````

Edit: I just realized that the types enforce same dimensions but not same size so I do need to add a size consistency check between `x` and `mask`.

Are you sure you do not want `axes`?

1 Like

After looking into it, it seems like the way to go here is:
`for i in eachindex(x, mask)`
where `eachindex` uses `axes` internally I think.

I meant comparing the mask and x axis for your assert. But yes,
eachindex is a way to go, if you do not use @inbounds I imagine.

1 Like

Just a few points here:

• `A[I]` will allocate unless itās prefixed by `@view` or `@views`. The right hand side of `.=` isnāt special. In many cases using `@views y .= A[I]` will do exactly what you want and be efficient about it. Thatās by-and-large why Iāve not suggested adding `getindex!`.
• One exception, though, is with logical indexing. `@view A[mask]` unfortunately must allocate in order to create a usable array (it converts the mask into immediate indices).
• In general, Iāve found logical indexing to be a super-convenient shorthand, but you can often do things faster by avoiding the mask creation in the first place. If this is truly a hot loop in your program, I wouldnāt just optimize `y .= x[mask]`, Iād also seek to fuse the `mask` creation into one single loop.
3 Likes

The operation `y .= x[mask]` is fundamentally strange. Youāre asking to assign an array of variable length elementwise to an array of fixed length.

3 Likes

@Henrique_Becker, I misunderstood your suggestion about `axes`; thanks for clarifying. In looking into that more, I just discovered that my `eachindex()` version writes past the end of the array (doh). So I took your `axes` suggestion and went back to `lastindex`. Hereās the current version:

``````@inline function getindex!(
y::AbstractVector,
x::AbstractArray{T,D},
) where {T,D}
axes(y) == (Base.OneTo(sum(mask)),) || throw("y axes \$(axes(y))")
axes(mask) == axes(x) || throw(DimensionMismatch("x vs mask"))
count = 1
y[count] = x[i]
end
return y
end
``````

@mbauman, thanks for the points. I wish the Julia manual had an entry for `getindex!` that says: ānope, just use `@views y .= A[I]` in most casesā!
Unfortunately, searching for `getindex!` in this web site does not work because the search seems to discard the ā`!`ā so I donāt know if others will find this helpful discussion.

@benninkrs, my `mask` is provided by the user at the start of a long computation so it is not āvariable lengthā in that long computation, and I can preallocate `y` based on it at the start. While writing this, and thinking about @mbaumanās fuse" suggestion, I just realized that I could do `index = findall(mask)` before the long computation and then use the `@views` strategy.

Edit: I tried the `@views` way with `findall` and it was slower than `getindex!` for a small problem and the same speed for a larger problem.

Hereās an ugly implementation written to only support `Complex{Float32}` (but it should work for any 64-bit numbers that are stored inline).

### WARNING: DONāT RUN THIS UNLESS

``````(Base.libllvm_version ā„ v"7" && VectorizationBase.AVX512F) || Base.libllvm_version ā„ v"9"
``````

Otherwise, it will crash. LLVM 7 introduced the intrinsic and implemented it for AVX512, but not for other x86 CPUs. Starting with LLVM 9, it does work for the rest.
Therefore, if you want to try this code, use Julia nightly, or a CPU with AVX512. It should be fast if your CPU has it, but probably wonāt be otherwise.

``````using SIMDPirates
@inline function getindex_compress!(y::Vector{Complex{Float32}}, x::Vector{Complex{Float32}}, mask::BitVector)
@assert ((Base.libllvm_version ā„ v"7" && SIMDPirates.VectorizationBase.AVX512F) || Base.libllvm_version ā„ v"9") "Running this code will crash your Julia session."
y_ptr = Base.unsafe_convert(Ptr{Int64}, pointer(y));
x_ptr = Base.unsafe_convert(Ptr{Int64}, pointer(x));
x_max = x_ptr + 8length(mask)
GC.@preserve x y mask begin
while x_ptr < x_max - 7*8
SIMDPirates.compressstore!(y_ptr, vload(Vec{8,Int64}, x_ptr), m)
mask_ptr += 1; x_ptr += 8*8; y_ptr += count_ones(m) * 8;
end
if x_ptr < x_max
SIMDPirates.compressstore!(y_ptr, vload(Vec{8,Int64}, x_ptr, m), m)
end
end # GC.@preserve
end
``````

That said, it can pay off; versioninfo and setup:

``````julia> versioninfo()
Julia Version 1.4.1
Commit 381693d3df* (2020-04-14 17:20 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Core(TM) i9-7900X CPU @ 3.30GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-8.0.1 (ORCJIT, skylake)

julia> using BenchmarkTools

julia> using Random: seed!

julia> seed!(0);

julia> @inline function getindex1!(y::Vector{T}, x::Vector{T}, mask::AbstractVector{Bool}) where {T}
@inbounds y .= x[mask] # the obvious way, but it allocates!
end
getindex1! (generic function with 1 method)

julia> @inline function getindex2!(y::Vector{T}, x::Vector{T}, mask::AbstractVector{Bool}) where {T}
count = 0
count += 1
y[count] = x[i] # non-allocating, but slower :(
end
end
end
getindex2! (generic function with 1 method)

julia> N = (2^16,)
(65536,)

julia> mask = rand(N...) .< 0.5;

julia> @show K = sum(mask)
K = sum(mask) = 32799
32799

julia> T = ComplexF32
Complex{Float32}

julia> y = Array{T}(undef, K);

julia> x = randn(T, N...);

julia> function getindex4!(y::Vector{T}, x::Vector{T}, mask::AbstractVector{Bool}) where {T}
count = 0
y[count] = x[i]
end
end
getindex4! (generic function with 1 method)

julia> @inline function getindex!(y::AbstractVector, x::AbstractArray{T,D},
) where {T,D}
sum(mask) == length(y) || throw("wrong length")
count = 1
y[count] = x[i]
end
return y
end
getindex! (generic function with 1 method)
``````

Results:

``````julia> @btime getindex_compress!(\$y, \$x, \$mask); y == x[mask]
15.043 Ī¼s (0 allocations: 0 bytes)
true

julia> @btime getindex1!(\$y, \$x, \$mask); y == x[mask]
45.704 Ī¼s (3 allocations: 256.36 KiB)
true

julia> @btime getindex2!(\$y, \$x, \$mask); y == x[mask]
239.255 Ī¼s (0 allocations: 0 bytes)
true

julia> @btime getindex4!(\$y, \$x, \$mask); y == x[mask]
99.014 Ī¼s (0 allocations: 0 bytes)
false

julia> @btime getindex8!(\$y, \$x, \$mask); y == x[mask]
51.028 Ī¼s (0 allocations: 0 bytes)
true

julia> @btime getindex!(\$y, \$x, \$mask); y == x[mask]
50.978 Ī¼s (0 allocations: 0 bytes)
true
``````

@Elrod, impressive speedup!
I tried it on a server that satisfies the requirements:

``````Julia Version 1.4.2
Commit 44fa15b150* (2020-05-23 18:35 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-8.0.1 (ORCJIT, ivybridge)
Environment:
``````

`VectorizationBase.AVX512F` was `true`

And it crashed. Probably Iāll wait to try again with Julia 1.5 / llvm9. Thanks.

``````LLVM ERROR: Cannot select: 0x1644ed8: ch = masked_store<(store 32 into %ir.ptr.i7 + 32)> 0x1b33ec0, 0x11c5568, 0x1b34dc8, 0x1b34818, /n/ironwood/y/fessler/.julia/packages/SIMDPirates/lBzBX/src/SIMDPirates.jl:10 @[ /n/ironwood/y/fessler/.julia/packages/SIMDPirates/lBzBX/src/memory.jl:1170 @[ /n/ironwood/y/fessler/.julia/packages/SIMDPirates/lBzBX/src/memory.jl:1151 @[ /n/ire/Volumes/s2/fessler/src/julia/work/getindex-.jl:130 ] ] ]
``````

You can check Intel Ark and see that that CPU (sandybridge) doesnāt even support FMA or AVX2, let alone AVX512.

FWIW, Travisās Linux CPUs do support AVX512.

If `VectorizationBase.AVX512F == true`, that is a definite bug.
Mind double checking and showing `] st Vectorizationbase`?

I just installed it today so itās (presumably) the latest version:
` [3d5dd08c] VectorizationBase v0.12.16`

Looks like multiple bugs then:

``````julia-1.4.2> using VectorizationBase

julia-1.4.2> VectorizationBase.AVX512F
true

julia-1.4.2> VectorizationBase.FMA
true

julia-1.4.2> VectorizationBase.AVX2
true
``````

Do you mind submitting the issue since you know much more about this?

I just released VectorizationBase v0.12.17. Mind updating and trying again?
If you still get `true` on any of these, mind showing me the output of:

``````# using Pkg; Pkg.add("LLVM")
using LLVM
println(unsafe_string(LLVM.API.LLVMGetHostCPUFeatures()))
``````

Anyway, while the code works without AVX512 on a newer LLVM version, it needs AVX512 to be fast:

``````julia> @btime getindex_compress!(\$y, \$x, \$mask); y == x[mask]
174.600 Ī¼s (0 allocations: 0 bytes)
true

julia> @btime getindex1!(\$y, \$x, \$mask); y == x[mask]
43.300 Ī¼s (2 allocations: 256.33 KiB)
true

julia> @btime getindex2!(\$y, \$x, \$mask); y == x[mask]
268.800 Ī¼s (0 allocations: 0 bytes)
true

julia> @btime getindex4!(\$y, \$x, \$mask); y == x[mask]
100.100 Ī¼s (0 allocations: 0 bytes)
false

julia> @btime getindex8!(\$y, \$x, \$mask); y == x[mask]
52.000 Ī¼s (0 allocations: 0 bytes)
true

julia> @btime getindex!(\$y, \$x, \$mask); y == x[mask]
62.500 Ī¼s (0 allocations: 0 bytes)
true

julia> versioninfo()
Julia Version 1.6.0-DEV.355
Commit 955beb6a91* (2020-07-02 10:28 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: Intel(R) Core(TM) i5-8350U CPU @ 1.70GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-10.0.0 (ORCJIT, skylake)
``````

Because it works by using a special assembly instruction thatās only available on AVX512. Without it, that instruction has to be emulated.

They are all false after the update on that server.

2 Likes