Efficient (non-allocating, in-place) getindex! for BitArray?

I have an indexing operation of the form y .= x[mask] where mask is a BitArray. To my surprise, this is an allocating operation even though it could be done in place in compiled code. So I wrote a loop that is non-allocating, but it seems to be slower! I am a bit surprised there is no general getindex!, but lacking that I would be grateful for ideas of how to get fast non-allocating getindex at least for bit arrays. Here is my benchmark code:

using BenchmarkTools
using Random: seed!
seed!(0)

@inline function getindex1!(y::Vector{T}, x::Vector{T}, mask::AbstractVector{Bool}) where {T}
	@inbounds y .= x[mask] # the obvious way, but it allocates!
end

@inline function getindex2!(y::Vector{T}, x::Vector{T}, mask::AbstractVector{Bool}) where {T}
	count = 0
	@inbounds for i=1:length(mask)
		if mask[i]
			count += 1
			y[count] = x[i] # non-allocating, but slower :(
		end
	end
end

N = (2^16,)
mask = rand(N...) .< 0.5
@show K = sum(mask)
T = ComplexF32
y = Array{T}(undef, K)
x = randn(T, N...)

@btime getindex1!($y, $x, $mask) #  35.9 Ī¼s (3 allocations: 256.36 KiB)
@btime getindex2!($y, $x, $mask) # 217.1 Ī¼s (0 allocations: 0 bytes)

Itā€™s faster if you omit the if:

function get4!(y, x, mask)
    count = 1
    @inbounds for i in eachindex(mask)
        y[count] = x[i]
        count += mask[i]
    end
    y
end

 get4!(similar(y), x, mask) ā‰ˆ getindex1!(similar(y), x, mask)

Edit ā€“ this isnā€™t actually safe, if last(mask) != 1 then it will try to write beyond the end of y. Hereā€™s a better version, almost as quick:

function get7!(y, x, mask)
    sum(mask) == length(y) || error("wrong length")
    count = 1
    @inbounds for i in 1:findlast(identity, mask)
        y[count] = x[i]
        count += mask[i]
    end
    y
end
1 Like

Just to give people the times (on my machine, Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz):

julia> @btime getindex1!($y, $x, $mask)
  41.783 Ī¼s (3 allocations: 256.36 KiB)

julia> @btime getindex2!($y, $x, $mask)
  257.931 Ī¼s (0 allocations: 0 bytes)

julia> @btime getindex3!($y, $x, $mask) # mcabbott version
  48.770 Ī¼s (0 allocations: 0 bytes)
1 Like

Thanks for the fast reply and the fast code!
On my machine there is about 10% extra compute time for the safe nonallocating version get7! above, compared to my original allocating version getindex1!.
But when I increase problem size to N=2^20 then your get7! is 15% faster than getindex1, presumably because it avoids the overhead of allocating a 8MB array. Yay!
The sum(mask) check adds only 1% overhead so it seems worth it.
I will be adding this to MIRT.jl but I wonder if it belongs elsewhereā€¦

Edit: in case anyone wants a tested version of this later, it is now here:
https://github.com/JeffFessler/MIRT.jl/blob/master/src/utility/mask.jl

A slight modification on @mcabbottā€™s answer (2.5X faster on my machine than original getindex1!):

function getindex4!(y::Vector{T}, x::Vector{T}, mask::AbstractVector{Bool}) where {T}
    count = 0
    for i = findfirst(mask):length(mask)
        count += mask[i]
        y[count] = x[i]
    end 
end

N = (2^20,)
mask = rand(N...) .< 0.5
@show K = sum(mask)
T = ComplexF32
y = Array{T}(undef, K)
x = randn(T, N...)

@btime getindex1!($y, $x, $mask) # 2.117 ms (2 allocations: 4.00 MiB)
@btime getindex4!($y, $x, $mask) # 868.900 Ī¼s (0 allocations: 0 bytes)

Unfortunately it does not quite return the correct answer:
getindex4!(zeros(2), [1,-2.,3], [true,false,true])
returns [-2,3] instead of [1,3].
It would be nice to use the (presumably faster) findfirst but I donā€™t see how.

The code of @Seif_Shebl is wrong. Starting from zero and incrementing count before the attribution gives wrong results because every time a false is found it is written in the same position the last saved value was already stored. The algorithm depends on count being one index ahead when false values are found. This is also what made @mcabbott amend their code.

I tried a final version combining the other tries, but the results were not better (for the original 2^16 arrays).

julia> function getindex8!(y::Vector{T}, x::Vector{T}, mask::AbstractVector{Bool}) where {T}
           @assert sum(mask) <= length(y) && length(x) >= length(mask)
           # optionally, instead of this assert, just remove the @inbounds below
           count = 1
           @inbounds for i = findfirst(mask):findlast(mask)
               y[count] = x[i]
               count += mask[i]
           end
           y
       end

julia> @btime getindex8!($y, $x, $mask)
  48.913 Ī¼s (0 allocations: 0 bytes)

Thanks for the input. I actually needed this to work for general arrays, not just vectors, so here is my current version that requires that x and mask have the same size. I also need AbstractVector because often y is a view in my work.
The feedback on this forum was enormously helpful. Yay Julia community!

@inline function getindex!(y::AbstractVector, x::AbstractArray{T,D},
    mask::AbstractArray{Bool,D},
) where {T,D}
    sum(mask) == length(y) || throw("wrong length")
    count = 1
    @inbounds for i in 1:LinearIndices(mask)[findlast(mask)]
        y[count] = x[i]
        count += mask[i]
    end
    return y
end

Edit: I just realized that the types enforce same dimensions but not same size so I do need to add a size consistency check between x and mask.

Are you sure you do not want axes?

1 Like

After looking into it, it seems like the way to go here is:
for i in eachindex(x, mask)
where eachindex uses axes internally I think.

I meant comparing the mask and x axis for your assert. But yes,
eachindex is a way to go, if you do not use @inbounds I imagine.

1 Like

Just a few points here:

  • A[I] will allocate unless itā€™s prefixed by @view or @views. The right hand side of .= isnā€™t special. In many cases using @views y .= A[I] will do exactly what you want and be efficient about it. Thatā€™s by-and-large why Iā€™ve not suggested adding getindex!.
  • One exception, though, is with logical indexing. @view A[mask] unfortunately must allocate in order to create a usable array (it converts the mask into immediate indices).
  • In general, Iā€™ve found logical indexing to be a super-convenient shorthand, but you can often do things faster by avoiding the mask creation in the first place. If this is truly a hot loop in your program, I wouldnā€™t just optimize y .= x[mask], Iā€™d also seek to fuse the mask creation into one single loop.
4 Likes

The operation y .= x[mask] is fundamentally strange. Youā€™re asking to assign an array of variable length elementwise to an array of fixed length.

3 Likes

@Henrique_Becker, I misunderstood your suggestion about axes; thanks for clarifying. In looking into that more, I just discovered that my eachindex() version writes past the end of the array (doh). So I took your axes suggestion and went back to lastindex. Hereā€™s the current version:

@inline function getindex!(
    y::AbstractVector,
    x::AbstractArray{T,D},
    mask::AbstractArray{Bool,D},
) where {T,D}
    axes(y) == (Base.OneTo(sum(mask)),) || throw("y axes $(axes(y))")
    axes(mask) == axes(x) || throw(DimensionMismatch("x vs mask"))
    count = 1
    @inbounds for i in 1:LinearIndices(mask)[findlast(mask)]
        y[count] = x[i]
        count += mask[i]
    end
    return y
end

@mbauman, thanks for the points. I wish the Julia manual had an entry for getindex! that says: ā€œnope, just use @views y .= A[I] in most casesā€!
Unfortunately, searching for getindex! in this web site does not work because the search seems to discard the ā€œ!ā€ so I donā€™t know if others will find this helpful discussion.

@benninkrs, my mask is provided by the user at the start of a long computation so it is not ā€œvariable lengthā€ in that long computation, and I can preallocate y based on it at the start. While writing this, and thinking about @mbaumanā€™s fuse" suggestion, I just realized that I could do index = findall(mask) before the long computation and then use the @views strategy.

Edit: I tried the @views way with findall and it was slower than getindex! for a small problem and the same speed for a larger problem.

Hereā€™s an ugly implementation written to only support Complex{Float32} (but it should work for any 64-bit numbers that are stored inline).

WARNING: DONā€™T RUN THIS UNLESS

(Base.libllvm_version ā‰„ v"7" && VectorizationBase.AVX512F) || Base.libllvm_version ā‰„ v"9"

Otherwise, it will crash. LLVM 7 introduced the intrinsic and implemented it for AVX512, but not for other x86 CPUs. Starting with LLVM 9, it does work for the rest.
Therefore, if you want to try this code, use Julia nightly, or a CPU with AVX512. It should be fast if your CPU has it, but probably wonā€™t be otherwise.

using SIMDPirates
@inline function getindex_compress!(y::Vector{Complex{Float32}}, x::Vector{Complex{Float32}}, mask::BitVector)
    @assert ((Base.libllvm_version ā‰„ v"7" && SIMDPirates.VectorizationBase.AVX512F) || Base.libllvm_version ā‰„ v"9") "Running this code will crash your Julia session."
    mask_ptr = Base.unsafe_convert(Ptr{UInt8}, pointer(mask.chunks))
    y_ptr = Base.unsafe_convert(Ptr{Int64}, pointer(y));
    x_ptr = Base.unsafe_convert(Ptr{Int64}, pointer(x));
    x_max = x_ptr + 8length(mask)
    GC.@preserve x y mask begin
        while x_ptr < x_max - 7*8
            m = unsafe_load(mask_ptr)
            SIMDPirates.compressstore!(y_ptr, vload(Vec{8,Int64}, x_ptr), m)
            mask_ptr += 1; x_ptr += 8*8; y_ptr += count_ones(m) * 8;
        end
        if x_ptr < x_max
            m = SIMDPirates.mask(Val(8), length(x)) & unsafe_load(mask_ptr)
            SIMDPirates.compressstore!(y_ptr, vload(Vec{8,Int64}, x_ptr, m), m)
        end
    end # GC.@preserve
end

That said, it can pay off; versioninfo and setup:

julia> versioninfo()
Julia Version 1.4.1
Commit 381693d3df* (2020-04-14 17:20 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i9-7900X CPU @ 3.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, skylake)

julia> using BenchmarkTools

julia> using Random: seed!

julia> seed!(0);

julia> @inline function getindex1!(y::Vector{T}, x::Vector{T}, mask::AbstractVector{Bool}) where {T}
               @inbounds y .= x[mask] # the obvious way, but it allocates!
       end
getindex1! (generic function with 1 method)

julia> @inline function getindex2!(y::Vector{T}, x::Vector{T}, mask::AbstractVector{Bool}) where {T}
               count = 0
               @inbounds for i=1:length(mask)
                       if mask[i]
                               count += 1
                               y[count] = x[i] # non-allocating, but slower :(
                       end
               end
       end
getindex2! (generic function with 1 method)

julia> N = (2^16,)
(65536,)

julia> mask = rand(N...) .< 0.5;

julia> @show K = sum(mask)
K = sum(mask) = 32799
32799

julia> T = ComplexF32
Complex{Float32}

julia> y = Array{T}(undef, K);

julia> x = randn(T, N...);

julia> function getindex4!(y::Vector{T}, x::Vector{T}, mask::AbstractVector{Bool}) where {T}
           count = 0
           for i = findfirst(mask):length(mask)
               count += mask[i]
               y[count] = x[i]
           end
       end
getindex4! (generic function with 1 method)

julia> @inline function getindex!(y::AbstractVector, x::AbstractArray{T,D},
           mask::AbstractArray{Bool,D},
       ) where {T,D}
           sum(mask) == length(y) || throw("wrong length")
           count = 1
           @inbounds for i in 1:LinearIndices(mask)[findlast(mask)]
               y[count] = x[i]
               count += mask[i]
           end
           return y
       end
getindex! (generic function with 1 method)

Results:

julia> @btime getindex_compress!($y, $x, $mask); y == x[mask]
  15.043 Ī¼s (0 allocations: 0 bytes)
true

julia> @btime getindex1!($y, $x, $mask); y == x[mask]
  45.704 Ī¼s (3 allocations: 256.36 KiB)
true

julia> @btime getindex2!($y, $x, $mask); y == x[mask]
  239.255 Ī¼s (0 allocations: 0 bytes)
true

julia> @btime getindex4!($y, $x, $mask); y == x[mask]
  99.014 Ī¼s (0 allocations: 0 bytes)
false

julia> @btime getindex8!($y, $x, $mask); y == x[mask]
  51.028 Ī¼s (0 allocations: 0 bytes)
true

julia> @btime getindex!($y, $x, $mask); y == x[mask]
  50.978 Ī¼s (0 allocations: 0 bytes)
true

@Elrod, impressive speedup!
I tried it on a server that satisfies the requirements:

Julia Version 1.4.2
Commit 44fa15b150* (2020-05-23 18:35 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, ivybridge)
Environment:
  JULIA_NUM_THREADS = 40

VectorizationBase.AVX512F was true

And it crashed. Probably Iā€™ll wait to try again with Julia 1.5 / llvm9. Thanks.

LLVM ERROR: Cannot select: 0x1644ed8: ch = masked_store<(store 32 into %ir.ptr.i7 + 32)> 0x1b33ec0, 0x11c5568, 0x1b34dc8, 0x1b34818, /n/ironwood/y/fessler/.julia/packages/SIMDPirates/lBzBX/src/SIMDPirates.jl:10 @[ /n/ironwood/y/fessler/.julia/packages/SIMDPirates/lBzBX/src/memory.jl:1170 @[ /n/ironwood/y/fessler/.julia/packages/SIMDPirates/lBzBX/src/memory.jl:1151 @[ /n/ire/Volumes/s2/fessler/src/julia/work/getindex-.jl:130 ] ] ] 

You can check Intel Ark and see that that CPU (sandybridge) doesnā€™t even support FMA or AVX2, let alone AVX512.

FWIW, Travisā€™s Linux CPUs do support AVX512.

If VectorizationBase.AVX512F == true, that is a definite bug.
Mind double checking and showing ] st Vectorizationbase?

I just installed it today so itā€™s (presumably) the latest version:
[3d5dd08c] VectorizationBase v0.12.16

Looks like multiple bugs then:

julia-1.4.2> using VectorizationBase

julia-1.4.2> VectorizationBase.AVX512F
true

julia-1.4.2> VectorizationBase.FMA
true

julia-1.4.2> VectorizationBase.AVX2
true

Do you mind submitting the issue since you know much more about this?

I just released VectorizationBase v0.12.17. Mind updating and trying again?
If you still get true on any of these, mind showing me the output of:

# using Pkg; Pkg.add("LLVM")
using LLVM
println(unsafe_string(LLVM.API.LLVMGetHostCPUFeatures()))

Anyway, while the code works without AVX512 on a newer LLVM version, it needs AVX512 to be fast:

julia> @btime getindex_compress!($y, $x, $mask); y == x[mask]
  174.600 Ī¼s (0 allocations: 0 bytes)
true

julia> @btime getindex1!($y, $x, $mask); y == x[mask]
  43.300 Ī¼s (2 allocations: 256.33 KiB)
true

julia> @btime getindex2!($y, $x, $mask); y == x[mask]
  268.800 Ī¼s (0 allocations: 0 bytes)
true

julia> @btime getindex4!($y, $x, $mask); y == x[mask]
  100.100 Ī¼s (0 allocations: 0 bytes)
false

julia> @btime getindex8!($y, $x, $mask); y == x[mask]
  52.000 Ī¼s (0 allocations: 0 bytes)
true

julia> @btime getindex!($y, $x, $mask); y == x[mask]
  62.500 Ī¼s (0 allocations: 0 bytes)
true

julia> versioninfo()
Julia Version 1.6.0-DEV.355
Commit 955beb6a91* (2020-07-02 10:28 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Core(TM) i5-8350U CPU @ 1.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-10.0.0 (ORCJIT, skylake)

Because it works by using a special assembly instruction thatā€™s only available on AVX512. Without it, that instruction has to be emulated.

They are all false after the update on that server.

2 Likes