Faster array indexing with a logical array

Hi everyone,

Pretty basic question…

I was experimenting with indexing arrays with logical arrays. For instance, say I have an array shaped (10000,2) and I need to extract all the indices where the sum across the columns is less than a threshold. What would be the best Julian way to do it? I tried list comprehension but that is slower than the method below

MWE:

using BenchmarkTools 

# what I could think of ?
function getindicesVectorized(x::Array)
    return x[getindex.(findall(sum(x, dims=2) .<= 1), 1), :];
end

function getindicesListComp(x::Array)
    ind = findall(sum(x, dims=2) .<= 1)
    temp = [x[ind[i][1], :] for i in 1:length(ind)]
    return hcat(temp...)';
end

points = rand(10000,2)
julia> @btime getindicesVectorized($points)
  32.201 μs (15 allocations: 279.70 KiB)

It seems to be fast and readable (see equivalent numpy version), but I was wondering if there would be a faster way to do this ?

What I would do in python naively

import numpy as np
from timeit import timeit
points = np.random.random((10000,2))
In [8]: %timeit points[np.sum(points, axis=1) <= 1, :]
212 µs ± 2.73 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Thanks in advance :slight_smile: !!

I could write a version that’s fast if you have AVX512 and probably okay otherwise.
I should try to come up with a map-like API so I can put it in LoopVectorization next to vfilter.

1 Like

You don’t need findall, you can index with the boolean array from broadcasting .< (as in the python version). But even better is to avoid creating the sum anywhere:

julia> gb1(x) = x[vec(sum(x,dims=2) .< 1), :];

julia> gb2(x) = x[(@views x[:,1] .+ x[:,2] .< 1), :];

julia> @btime gb1($points);
  30.618 μs (13 allocations: 160.41 KiB)

julia> @btime gb2($points);
  20.707 μs (5 allocations: 82.00 KiB)

julia> @btime getindicesVectorized($points); # to compare
  37.253 μs (15 allocations: 275.02 KiB)
3 Likes

AVX512-optimized version:

using VectorizationBase, SIMDPirates

function getindscompress(X::AbstractMatrix{T}) where {T<:Base.HWReal}
    M, N = size(X)
    @assert N == 2
    Y = similar(X)
    W, Wshift = VectorizationBase.pick_vector_width_shift(T)
    m = 0; my = 0
    ptrX = stridedpointer(X); ptrY = stridedpointer(Y);
    for _ ∈ 1:(M >>> Wshift)
        x₁ = vload(ptrX, (_MM{W}(m),0))
        x₂ = vload(ptrX, (_MM{W}(m),1))
        mask = (x₁ + x₂) <= 1
        SIMDPirates.compressstore!(gep(ptrY, (my,)), extract_data(x₁), mask)
        SIMDPirates.compressstore!(gep(ptrY, (my+M,)), extract_data(x₂), mask)
        my += count_ones(mask)
        m += W
    end
    if (M & (W -1)) > 0
        x₁ = vload(ptrX, (_MM{W}(m),0))
        x₂ = vload(ptrX, (_MM{W}(m),1))
        mask = ((x₁ + x₂) <= 1) & VectorizationBase.mask(T, M)
        SIMDPirates.compressstore!(gep(ptrY, (my,)), extract_data(x₁), mask)
        SIMDPirates.compressstore!(gep(ptrY, (my+M,)), extract_data(x₂), mask)        
    end
    view(Y, 1:my, :)
end

Benchmarks:

julia> X = rand(1000,2);

julia> getindicesVectorized(X) == getindscompress(X)
true

julia> @benchmark getindicesVectorized($X)
BenchmarkTools.Trial:
  memory estimate:  32.83 KiB
  allocs estimate:  11
  --------------
  minimum time:     4.207 μs (0.00% GC)
  median time:      4.600 μs (0.00% GC)
  mean time:        5.130 μs (9.27% GC)
  maximum time:     1.259 ms (97.40% GC)
  --------------
  samples:          10000
  evals/sample:     7

julia> @benchmark getindscompress($X)
BenchmarkTools.Trial:
  memory estimate:  15.75 KiB
  allocs estimate:  1
  --------------
  minimum time:     681.347 ns (0.00% GC)
  median time:      881.409 ns (0.00% GC)
  mean time:        1.011 μs (10.37% GC)
  maximum time:     37.658 μs (88.79% GC)
  --------------
  samples:          10000
  evals/sample:     121

julia> @benchmark gb1($X)
BenchmarkTools.Trial:
  memory estimate:  20.69 KiB
  allocs estimate:  11
  --------------
  minimum time:     2.971 μs (0.00% GC)
  median time:      3.364 μs (0.00% GC)
  mean time:        3.723 μs (8.39% GC)
  maximum time:     1.178 ms (92.85% GC)
  --------------
  samples:          10000
  evals/sample:     8

julia> @benchmark gb2($X)
BenchmarkTools.Trial:
  memory estimate:  12.55 KiB
  allocs estimate:  4
  --------------
  minimum time:     1.847 μs (0.00% GC)
  median time:      2.149 μs (0.00% GC)
  mean time:        2.397 μs (7.13% GC)
  maximum time:     975.513 μs (92.04% GC)
  --------------
  samples:          10000
  evals/sample:     10

Using 10_000 points:

julia> points = rand(10000,2);

julia> @btime getindicesVectorized($points);
  30.121 μs (15 allocations: 279.58 KiB)

julia> @btime getindscompress($points);
  4.958 μs (2 allocations: 156.33 KiB)

julia> @btime gb1($points);
  23.194 μs (13 allocations: 162.22 KiB)

julia> @btime gb2($points);
  14.622 μs (5 allocations: 83.81 KiB)

To me, Julian means winning all the benchmarks. :wink:

4 Likes

It looks to me like you’re extracting the values, not the indices.

1 Like

I don’t think so, but frankly I don’t know if I have some kind of vector extensions instructions. The processor I have is Intel Xeon(R). And simply looking up under Instruction Set Extension, I only see AVX2.

I can read more about it as I don’t have background in architecture :slight_smile:

A bit off the topic, but is there a way to check if my CPU supports AVX512 (I am on Windows) ?

The timings are great Chris! I think this and the previous problem that I posted above both have significant performance gain when doing SIMD operations. I will read more on it.

The idea is to be able to use this in Finite Element Assembly, where surely one can take advantage of LoopVectorization.jl. Thanks!

Sadly, for the current problem, I don’t gain much by using it. :frowning:

julia> @btime getindscompress($points)
  34.701 μs (2 allocations: 156.33 KiB)

Thanks @mcabbott!! :slight_smile:

You are right, I mean’t extracting the elements of the array at those indices. :slight_smile:

1 Like

If the Xeon CPU you linked had AVX512, it would’ve been listed. You can also use:

julia> using VectorizationBase; VectorizationBase.AVX512F
true

But between the link and the fact that it wasn’t faster on your computer, we can already be quite sure that it’ll be false.

The code needs AVX512 to be fast, because AVX512F (F for “Foundation”, common to all AVX512 CPUs) added a “compressed store” instruction that’s perfect for accelerating things like this.
This will be emulated if you don’t have AVX512 so that the code still works, but that’s of course much slower.

2 Likes

If you happen to have Python and numba installed, running numba -s is works pretty well:

System info:
--------------------------------------------------------------------------------
__Time Stamp__
2020-07-10 09:46:04.439609

__Hardware Information__
Machine                                       : AMD64
CPU Name                                      : skylake
CPU count                                     : 8
CPU Features                                  : 
64bit adx aes avx avx2 bmi bmi2 clflushopt cmov cx16 f16c fma fsgsbase invpcid
lzcnt mmx movbe pclmul popcnt prfchw rdrnd rdseed sahf sgx sse sse2 sse3 sse4.1
sse4.2 ssse3 xsave xsavec xsaveopt xsaves

__OS Information__
Platform                                      : Windows-10-10.0.17134-SP0
Release                                       : 10
System Name                                   : Windows
Version                                       : 10.0.17134
OS specific info                              : 1010.0.17134SP0

[...omitted]

Code here:
https://github.com/numba/numba/blob/5b9d29b441a014793e3ad38f8cce2a16144a56ef/numba/misc/numba_sysinfo.py#L281

I couldn’t find a similar Julia utility by briefly googling (but might be worthwhile?).

There is CpuId.jl.

julia> using CpuId

julia> cpufeaturetable()
  Cpu Feature Description
  ––––––––––– ––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
  ACPI        Thermal monitor and software controlled clock facilities (MSR)
  ADX         Intel ADX (Multi-Precision Add-Carry Instruction Extensions)
  AES         AES encryption instruction set
  AHF64       LAHF and SAHF in PM64
  APIC        APIC on-chip (Advanced Programmable Interrupt Controller)
  AVX         256bit Advanced Vector Extensions, AVX
  AVX2        SIMD 256bit Advanced Vector Extensions 2
  AVX512BW    AVX-512 Byte and Word Instructions
  AVX512CD    AVX-512 Conflict Detection Instructions
  AVX512DQ    AVX-512 Doubleword and Quadword Instructions
  AVX512F     AVX-512 Foundation
  AVX512VL    AVX-512 Vector Length Extensions
  BMI1        Bit Manipulation Instruction Set 1

Or

julia> using LLVM

julia> s = unsafe_string(LLVM.API.LLVMGetHostCPUFeatures())
"+sse2,+cx16,+sahf,-tbm,-avx512ifma,-sha,-gfni,-fma4,-vpclmulqdq,+prfchw,+bmi2,-cldemote,+fsgsbase,-ptwrite,+xsavec,+popcnt,+mpx,+aes,-avx512bitalg,-movdiri,+xsaves,-avx512er,+avx512vnni,-avx512vpopcntdq,-pconfig,+clwb,+avx512f,-clzero,-pku,+mmx,-lwp,-rdpid,-xop,+rdseed,-waitpkg,-movdir64b,-sse4a,+avx512bw,+clflushopt,+xsave,-avx512vbmi2,+64bit,+avx512vl,+invpcid,+avx512cd,+avx,-vaes,+cx8,+fma,-rtm,+bmi,-enqcmd,+rdrnd,-mwaitx,+sse4.1,+sse4.2,+avx2,+fxsr,-wbnoinvd,+sse,+lzcnt,+pclmul,-prefetchwt1,+f16c,+ssse3,-sgx,-shstk,+cmov,-avx512vbmi,-avx512bf16,+movbe,+xsaveopt,+avx512dq,+adx,-avx512pf,+sse3"

julia> filter(f -> occursin("avx512", f), split(s, ','))
14-element Array{SubString{String},1}:
 "-avx512ifma"
 "-avx512bitalg"
 "-avx512er"
 "+avx512vnni"
 "-avx512vpopcntdq"
 "+avx512f"
 "+avx512bw"
 "-avx512vbmi2"
 "+avx512vl"
 "+avx512cd"
 "-avx512vbmi"
 "-avx512bf16"
 "+avx512dq"
 "-avx512pf"

julia> Libc.free(s); # don't leak memory

A + means that it has the feature, while a - indicates it doesn’t.
So, for example, this CPU has AVX512F, but not AVX512ER (accurate reciprocals or exponentiation) or AVX512BF16 (for BF16 support).
This is basically what VectorizationBase does to define variables for each of these as true/false.

julia> using VectorizationBase

julia> VectorizationBase.AVX512IFMA # integer fused multiply add
false

julia> VectorizationBase.AVX512VNNI # vector neural net instructions
true

Note that LLVM is more complete. CpuId.jl seems to be missing VNNI, for example.

3 Likes

A bit off the topic, but is there a way to check if my CPU supports AVX512 (I am on Windows) ?

Use CPU-Z.
On its CPU tab you’ll have:

image

The Instructions section shows what’s your CPU support.
If you see AVX512F there, then it is supported (You won’t see as yours doesn’t support it).

1 Like

Thanks Chris!

Indeed it doesn’t have AVX512

julia> cpufeaturetable()
  Cpu Feature Description
  ––––––––––– ––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
  ACPI        Thermal monitor and software controlled clock facilities (MSR)
  ADX         Intel ADX (Multi-Precision Add-Carry Instruction Extensions)  
  AES         AES encryption instruction set
  AHF64       LAHF and SAHF in PM64
  APIC        APIC on-chip (Advanced Programmable Interrupt Controller)     
  AVX         256bit Advanced Vector Extensions, AVX
  AVX2        SIMD 256bit Advanced Vector Extensions 2
  BMI1        Bit Manipulation Instruction Set 1
  BMI2        Bit Manipulation Instruction Set 2
  .....