Efficient (non-allocating, in-place) getindex! for BitArray?

I have an indexing operation of the form y .= x[mask] where mask is a BitArray. To my surprise, this is an allocating operation even though it could be done in place in compiled code. So I wrote a loop that is non-allocating, but it seems to be slower! I am a bit surprised there is no general getindex!, but lacking that I would be grateful for ideas of how to get fast non-allocating getindex at least for bit arrays. Here is my benchmark code:

using BenchmarkTools
using Random: seed!
seed!(0)

@inline function getindex1!(y::Vector{T}, x::Vector{T}, mask::AbstractVector{Bool}) where {T}
	@inbounds y .= x[mask] # the obvious way, but it allocates!
end

@inline function getindex2!(y::Vector{T}, x::Vector{T}, mask::AbstractVector{Bool}) where {T}
	count = 0
	@inbounds for i=1:length(mask)
		if mask[i]
			count += 1
			y[count] = x[i] # non-allocating, but slower :(
		end
	end
end

N = (2^16,)
mask = rand(N...) .< 0.5
@show K = sum(mask)
T = ComplexF32
y = Array{T}(undef, K)
x = randn(T, N...)

@btime getindex1!($y, $x, $mask) #  35.9 μs (3 allocations: 256.36 KiB)
@btime getindex2!($y, $x, $mask) # 217.1 μs (0 allocations: 0 bytes)

It’s faster if you omit the if:

function get4!(y, x, mask)
    count = 1
    @inbounds for i in eachindex(mask)
        y[count] = x[i]
        count += mask[i]
    end
    y
end

 get4!(similar(y), x, mask) ≈ getindex1!(similar(y), x, mask)

Edit – this isn’t actually safe, if last(mask) != 1 then it will try to write beyond the end of y. Here’s a better version, almost as quick:

function get7!(y, x, mask)
    sum(mask) == length(y) || error("wrong length")
    count = 1
    @inbounds for i in 1:findlast(identity, mask)
        y[count] = x[i]
        count += mask[i]
    end
    y
end
1 Like

Just to give people the times (on my machine, Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz):

julia> @btime getindex1!($y, $x, $mask)
  41.783 μs (3 allocations: 256.36 KiB)

julia> @btime getindex2!($y, $x, $mask)
  257.931 μs (0 allocations: 0 bytes)

julia> @btime getindex3!($y, $x, $mask) # mcabbott version
  48.770 μs (0 allocations: 0 bytes)
1 Like

Thanks for the fast reply and the fast code!
On my machine there is about 10% extra compute time for the safe nonallocating version get7! above, compared to my original allocating version getindex1!.
But when I increase problem size to N=2^20 then your get7! is 15% faster than getindex1, presumably because it avoids the overhead of allocating a 8MB array. Yay!
The sum(mask) check adds only 1% overhead so it seems worth it.
I will be adding this to MIRT.jl but I wonder if it belongs elsewhere…

A slight modification on @mcabbott’s answer (2.5X faster on my machine than original getindex1!):

function getindex4!(y::Vector{T}, x::Vector{T}, mask::AbstractVector{Bool}) where {T}
    count = 0
    for i = findfirst(mask):length(mask)
        count += mask[i]
        y[count] = x[i]
    end 
end

N = (2^20,)
mask = rand(N...) .< 0.5
@show K = sum(mask)
T = ComplexF32
y = Array{T}(undef, K)
x = randn(T, N...)

@btime getindex1!($y, $x, $mask) # 2.117 ms (2 allocations: 4.00 MiB)
@btime getindex4!($y, $x, $mask) # 868.900 μs (0 allocations: 0 bytes)

Unfortunately it does not quite return the correct answer:
getindex4!(zeros(2), [1,-2.,3], [true,false,true])
returns [-2,3] instead of [1,3].
It would be nice to use the (presumably faster) findfirst but I don’t see how.

The code of @Seif_Shebl is wrong. Starting from zero and incrementing count before the attribution gives wrong results because every time a false is found it is written in the same position the last saved value was already stored. The algorithm depends on count being one index ahead when false values are found. This is also what made @mcabbott amend their code.

I tried a final version combining the other tries, but the results were not better (for the original 2^16 arrays).

julia> function getindex8!(y::Vector{T}, x::Vector{T}, mask::AbstractVector{Bool}) where {T}
           @assert sum(mask) <= length(y) && length(x) >= length(mask)
           # optionally, instead of this assert, just remove the @inbounds below
           count = 1
           @inbounds for i = findfirst(mask):findlast(mask)
               y[count] = x[i]
               count += mask[i]
           end
           y
       end

julia> @btime getindex8!($y, $x, $mask)
  48.913 μs (0 allocations: 0 bytes)

Thanks for the input. I actually needed this to work for general arrays, not just vectors, so here is my current version that requires that x and mask have the same size. I also need AbstractVector because often y is a view in my work.
The feedback on this forum was enormously helpful. Yay Julia community!

@inline function getindex!(y::AbstractVector, x::AbstractArray{T,D},
    mask::AbstractArray{Bool,D},
) where {T,D}
    sum(mask) == length(y) || throw("wrong length")
    count = 1
    @inbounds for i in 1:LinearIndices(mask)[findlast(mask)]
        y[count] = x[i]
        count += mask[i]
    end
    return y
end

Edit: I just realized that the types enforce same dimensions but not same size so I do need to add a size consistency check between x and mask.

Are you sure you do not want axes?

1 Like

After looking into it, it seems like the way to go here is:
for i in eachindex(x, mask)
where eachindex uses axes internally I think.

I meant comparing the mask and x axis for your assert. But yes,
eachindex is a way to go, if you do not use @inbounds I imagine.

1 Like

Just a few points here:

  • A[I] will allocate unless it’s prefixed by @view or @views. The right hand side of .= isn’t special. In many cases using @views y .= A[I] will do exactly what you want and be efficient about it. That’s by-and-large why I’ve not suggested adding getindex!.
  • One exception, though, is with logical indexing. @view A[mask] unfortunately must allocate in order to create a usable array (it converts the mask into immediate indices).
  • In general, I’ve found logical indexing to be a super-convenient shorthand, but you can often do things faster by avoiding the mask creation in the first place. If this is truly a hot loop in your program, I wouldn’t just optimize y .= x[mask], I’d also seek to fuse the mask creation into one single loop.
2 Likes

The operation y .= x[mask] is fundamentally strange. You’re asking to assign an array of variable length elementwise to an array of fixed length.

3 Likes

@Henrique_Becker, I misunderstood your suggestion about axes; thanks for clarifying. In looking into that more, I just discovered that my eachindex() version writes past the end of the array (doh). So I took your axes suggestion and went back to lastindex. Here’s the current version:

@inline function getindex!(
    y::AbstractVector,
    x::AbstractArray{T,D},
    mask::AbstractArray{Bool,D},
) where {T,D}
    axes(y) == (Base.OneTo(sum(mask)),) || throw("y axes $(axes(y))")
    axes(mask) == axes(x) || throw(DimensionMismatch("x vs mask"))
    count = 1
    @inbounds for i in 1:LinearIndices(mask)[findlast(mask)]
        y[count] = x[i]
        count += mask[i]
    end
    return y
end

@mbauman, thanks for the points. I wish the Julia manual had an entry for getindex! that says: “nope, just use @views y .= A[I] in most cases”!
Unfortunately, searching for getindex! in this web site does not work because the search seems to discard the “!” so I don’t know if others will find this helpful discussion.

@benninkrs, my mask is provided by the user at the start of a long computation so it is not “variable length” in that long computation, and I can preallocate y based on it at the start. While writing this, and thinking about @mbauman’s fuse" suggestion, I just realized that I could do index = findall(mask) before the long computation and then use the @views strategy.

Edit: I tried the @views way with findall and it was slower than getindex! for a small problem and the same speed for a larger problem.

Here’s an ugly implementation written to only support Complex{Float32} (but it should work for any 64-bit numbers that are stored inline).

WARNING: DON’T RUN THIS UNLESS

(Base.libllvm_version ≥ v"7" && VectorizationBase.AVX512F) || Base.libllvm_version ≥ v"9"

Otherwise, it will crash. LLVM 7 introduced the intrinsic and implemented it for AVX512, but not for other x86 CPUs. Starting with LLVM 9, it does work for the rest.
Therefore, if you want to try this code, use Julia nightly, or a CPU with AVX512. It should be fast if your CPU has it, but probably won’t be otherwise.

using SIMDPirates
@inline function getindex_compress!(y::Vector{Complex{Float32}}, x::Vector{Complex{Float32}}, mask::BitVector)
    @assert ((Base.libllvm_version ≥ v"7" && SIMDPirates.VectorizationBase.AVX512F) || Base.libllvm_version ≥ v"9") "Running this code will crash your Julia session."
    mask_ptr = Base.unsafe_convert(Ptr{UInt8}, pointer(mask.chunks))
    y_ptr = Base.unsafe_convert(Ptr{Int64}, pointer(y));
    x_ptr = Base.unsafe_convert(Ptr{Int64}, pointer(x));
    x_max = x_ptr + 8length(mask)
    GC.@preserve x y mask begin
        while x_ptr < x_max - 7*8
            m = unsafe_load(mask_ptr)
            SIMDPirates.compressstore!(y_ptr, vload(Vec{8,Int64}, x_ptr), m)
            mask_ptr += 1; x_ptr += 8*8; y_ptr += count_ones(m) * 8;
        end
        if x_ptr < x_max
            m = SIMDPirates.mask(Val(8), length(x)) & unsafe_load(mask_ptr)
            SIMDPirates.compressstore!(y_ptr, vload(Vec{8,Int64}, x_ptr, m), m)
        end
    end # GC.@preserve
end

That said, it can pay off; versioninfo and setup:

julia> versioninfo()
Julia Version 1.4.1
Commit 381693d3df* (2020-04-14 17:20 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i9-7900X CPU @ 3.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, skylake)

julia> using BenchmarkTools

julia> using Random: seed!

julia> seed!(0);

julia> @inline function getindex1!(y::Vector{T}, x::Vector{T}, mask::AbstractVector{Bool}) where {T}
               @inbounds y .= x[mask] # the obvious way, but it allocates!
       end
getindex1! (generic function with 1 method)

julia> @inline function getindex2!(y::Vector{T}, x::Vector{T}, mask::AbstractVector{Bool}) where {T}
               count = 0
               @inbounds for i=1:length(mask)
                       if mask[i]
                               count += 1
                               y[count] = x[i] # non-allocating, but slower :(
                       end
               end
       end
getindex2! (generic function with 1 method)

julia> N = (2^16,)
(65536,)

julia> mask = rand(N...) .< 0.5;

julia> @show K = sum(mask)
K = sum(mask) = 32799
32799

julia> T = ComplexF32
Complex{Float32}

julia> y = Array{T}(undef, K);

julia> x = randn(T, N...);

julia> function getindex4!(y::Vector{T}, x::Vector{T}, mask::AbstractVector{Bool}) where {T}
           count = 0
           for i = findfirst(mask):length(mask)
               count += mask[i]
               y[count] = x[i]
           end
       end
getindex4! (generic function with 1 method)

julia> @inline function getindex!(y::AbstractVector, x::AbstractArray{T,D},
           mask::AbstractArray{Bool,D},
       ) where {T,D}
           sum(mask) == length(y) || throw("wrong length")
           count = 1
           @inbounds for i in 1:LinearIndices(mask)[findlast(mask)]
               y[count] = x[i]
               count += mask[i]
           end
           return y
       end
getindex! (generic function with 1 method)

Results:

julia> @btime getindex_compress!($y, $x, $mask); y == x[mask]
  15.043 μs (0 allocations: 0 bytes)
true

julia> @btime getindex1!($y, $x, $mask); y == x[mask]
  45.704 μs (3 allocations: 256.36 KiB)
true

julia> @btime getindex2!($y, $x, $mask); y == x[mask]
  239.255 μs (0 allocations: 0 bytes)
true

julia> @btime getindex4!($y, $x, $mask); y == x[mask]
  99.014 μs (0 allocations: 0 bytes)
false

julia> @btime getindex8!($y, $x, $mask); y == x[mask]
  51.028 μs (0 allocations: 0 bytes)
true

julia> @btime getindex!($y, $x, $mask); y == x[mask]
  50.978 μs (0 allocations: 0 bytes)
true

@Elrod, impressive speedup!
I tried it on a server that satisfies the requirements:

Julia Version 1.4.2
Commit 44fa15b150* (2020-05-23 18:35 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, ivybridge)
Environment:
  JULIA_NUM_THREADS = 40

VectorizationBase.AVX512F was true

And it crashed. Probably I’ll wait to try again with Julia 1.5 / llvm9. Thanks.

LLVM ERROR: Cannot select: 0x1644ed8: ch = masked_store<(store 32 into %ir.ptr.i7 + 32)> 0x1b33ec0, 0x11c5568, 0x1b34dc8, 0x1b34818, /n/ironwood/y/fessler/.julia/packages/SIMDPirates/lBzBX/src/SIMDPirates.jl:10 @[ /n/ironwood/y/fessler/.julia/packages/SIMDPirates/lBzBX/src/memory.jl:1170 @[ /n/ironwood/y/fessler/.julia/packages/SIMDPirates/lBzBX/src/memory.jl:1151 @[ /n/ire/Volumes/s2/fessler/src/julia/work/getindex-.jl:130 ] ] ] 

You can check Intel Ark and see that that CPU (sandybridge) doesn’t even support FMA or AVX2, let alone AVX512.

FWIW, Travis’s Linux CPUs do support AVX512.

If VectorizationBase.AVX512F == true, that is a definite bug.
Mind double checking and showing ] st Vectorizationbase?

I just installed it today so it’s (presumably) the latest version:
[3d5dd08c] VectorizationBase v0.12.16

Looks like multiple bugs then:

julia-1.4.2> using VectorizationBase

julia-1.4.2> VectorizationBase.AVX512F
true

julia-1.4.2> VectorizationBase.FMA
true

julia-1.4.2> VectorizationBase.AVX2
true

Do you mind submitting the issue since you know much more about this?

I just released VectorizationBase v0.12.17. Mind updating and trying again?
If you still get true on any of these, mind showing me the output of:

# using Pkg; Pkg.add("LLVM")
using LLVM
println(unsafe_string(LLVM.API.LLVMGetHostCPUFeatures()))

Anyway, while the code works without AVX512 on a newer LLVM version, it needs AVX512 to be fast:

julia> @btime getindex_compress!($y, $x, $mask); y == x[mask]
  174.600 μs (0 allocations: 0 bytes)
true

julia> @btime getindex1!($y, $x, $mask); y == x[mask]
  43.300 μs (2 allocations: 256.33 KiB)
true

julia> @btime getindex2!($y, $x, $mask); y == x[mask]
  268.800 μs (0 allocations: 0 bytes)
true

julia> @btime getindex4!($y, $x, $mask); y == x[mask]
  100.100 μs (0 allocations: 0 bytes)
false

julia> @btime getindex8!($y, $x, $mask); y == x[mask]
  52.000 μs (0 allocations: 0 bytes)
true

julia> @btime getindex!($y, $x, $mask); y == x[mask]
  62.500 μs (0 allocations: 0 bytes)
true

julia> versioninfo()
Julia Version 1.6.0-DEV.355
Commit 955beb6a91* (2020-07-02 10:28 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Core(TM) i5-8350U CPU @ 1.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-10.0.0 (ORCJIT, skylake)

Because it works by using a special assembly instruction that’s only available on AVX512. Without it, that instruction has to be emulated.

They are all false after the update on that server.

2 Likes