Zeroing elements of matrix A depending on matrix B. Fast?

performance

#1

Hi,

two matrices A, B, B has a bunch of zero elements, and I need to set

A[i,j] = 0 wherever B[i,j] == 0

I hoped to use that B .!= 0 is a boolean matrix, and try something like

(B .!= 0) .&& A

but that’s bad syntax (I don’t exactly know why.)

Any ideas, or hints?

Thank you, z.


#2

Do you want to modify A? And are the two matrices the same size? Then that’s a case for logical indexing:

A[B .!= 0] = 0

#3

Or a slightly uglier but faster version:
A .= ifelse.(B .!= 0, 0.0, A)


#4

Actually, both of these give the wrong answer. It should be the other way around, set to zero when B is zero. Anyway, this is both nicer and faster than either:

A .= .!iszero(B) .* A

Edit: I guess the logical indexing is nice enough, it’s just very slow.


#5

@mbauman I knew I had seen something like this. Thank you!

@sdanisch LOL. This even looks fast. I’ll do some speed measurements tomorrow :slight_smile: Thank you!


#6

@DNF Right :slight_smile: Thank you.


#7

I think you forgot a dot:
A .= .!iszero.(B) .* A
iszero(B) just returns false, so yours would just be true .* A


#8

Indeed I did. How strange, I could have sworn that I checked the output visually.

The @. macro really comes in handy to avoid situations like this:

@. A = !iszero(B) * A


#9

@DNF Grin. You don’t have to do all my work, so I’m fine :slight_smile:

@sdanisch Nevertheless thank you for reducing my irritation level.


#10

If performance is critical, the fastest thing will be to write a loop (in a function). Assuming the matrices are the same size:

for i in eachindex(B)
    if !iszero(B[i])
        A[i] = 0
    end
end

Vectorized notations like A[B .!= 0] = 0 are convenient and are reasonably fast, but aren’t generally the fastest way to do things. Making things faster in Julia is generally not a matter of digging through the standard library like you would do in other interactive languages; you just write a loop.


#11

@stevengj Thank you. I read about devectorisation, but seemingly didn’t know enough basics to get it right.


#12

If you add an @inbounds in front of the for loop that may improve performance further.


#13

Well, the performance difference is quite minimal though:

function test1(A, B)
    @inbounds for i in eachindex(B)
        if !iszero(B[i])
            A[i] = 0
        end
    end
end

function test2(A, B)
    A .= ifelse.(iszero.(B), 0.0, A)
end

julia> @benchmark test2($A, $B)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     520.308 μs (0.00% GC)
  median time:      564.948 μs (0.00% GC)
  mean time:        578.246 μs (0.00% GC)
  maximum time:     1.115 ms (0.00% GC)
  --------------
  samples:          8609
  evals/sample:     1

julia> @benchmark test1($A, $B)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     510.479 μs (0.00% GC)
  median time:      605.630 μs (0.00% GC)
  mean time:        659.029 μs (0.00% GC)
  maximum time:     1.571 ms (0.00% GC)
  --------------
  samples:          7548
  evals/sample:     1

Making things faster in Julia is generally not a matter of digging through the standard library like you would do in other interactive languages

I generally agree with that, but the fact, that it takes even a seasoned Julia programmer a couple of minutes to figure out the right combination of @inbounds and @simd or maybe even threading to get the fastest version also illustrates, why it’s nice to just find a reliable library function :wink:


#14

Interestingly the difference varies quite a bit with the size of the matrices (assuming my code is correct):

is zero ‘smarter’ (or faster) than eachindex (is that doing column major?)?


using Compat
using BenchmarkTools,DataFrames

function test1(A, B)
    @inbounds for i in eachindex(B)
        if !iszero(B[i])
            A[i] = 0.0
        end
    end
end

function test2(A, B)
    A .= ifelse.(iszero.(B), 0.0, A)
end

function testme(upto)
    res=DataFrame()
    n=1
    while n<upto #for n in [10 100 1000][:]
        a = rand(n,n);
        b = round.(rand(n,n));

        x1 = @benchmark test1($a,$b)
        x2 = @benchmark test2($a,$b)
        
        if size(res,1)==0
            res[:matrix_sizes]=[size(a)]
            res[:time_test1]=[median(x1).time]
            res[:time_test2]=[median(x2).time]
            res[:ratio]=[ratio(median(x1),median(x2)).time]
        else
            push!(res[:matrix_sizes],size(a))
            push!(res[:time_test1],median(x1).time)
            push!(res[:time_test2],median(x2).time)
            push!(res[:ratio],ratio(median(x1),median(x2)).time)    
        end
    n*=10
    end 
    return res
end

result=testme(1e5);

 Row │ matrix_sizes   │ time_test1 │ time_test2 │ ratio    │
├─────┼────────────────┼────────────┼────────────┼──────────┤
│ 1   │ (1, 1)         │ 3.647      │ 26.257     │ 0.138896 │
│ 2   │ (10, 10)       │ 125.722    │ 88.4512    │ 1.42137  │
│ 3   │ (100, 100)     │ 44855.0    │ 3373.25    │ 13.2973  │
│ 4   │ (1000, 1000)   │ 5.0808e6   │ 1.36752e6  │ 3.71534  │
│ 5   │ (10000, 10000) │ 5.22581e8  │ 1.66149e8  │ 3.14526  │
5×4 DataFrames.DataFrame


#15

It’s the ifelse, which likely enables simd optimization (~3-4x faster is what you would expect from that) :wink:
Not sure about the jump for (100, 100) - probably due to a better aligned array for that size.

julia> function test1(A, B)
           @inbounds for i in eachindex(B)
               A[i] = ifelse(iszero(B[i]), 0.0, A[i])
           end
       end
test1 (generic function with 1 method)

julia> result=testme(1e5)
5×4 DataFrames.DataFrame
│ Row │ matrix_sizes   │ time_test1 │ time_test2 │ ratio    │
├─────┼────────────────┼────────────┼────────────┼──────────┤
│ 1   │ (1, 1)         │ 3.605      │ 18.6251    │ 0.193556 │
│ 2   │ (10, 10)       │ 18.4724    │ 103.445    │ 0.178572 │
│ 3   │ (100, 100)     │ 1799.6     │ 2253.11    │ 0.798718 │
│ 4   │ (1000, 1000)   │ 833293.0   │ 7.85601e5  │ 1.06071  │
│ 5   │ (10000, 10000) │ 1.04362e8  │ 1.04433e8  │ 0.999316 │

#16

@sdanisch Is this an example of avoiding a branch? I mean like in https://www.juliabloggers.com/cpu-pipelines-when-more-is-less/ ?


#17

I tried some versions, and even if I did not use benchmark but just @time maybe the results are interesting to some of you.
My main surprise was how performant the vectorised version is. Faster than a loop without @inbounds.

Here my result, code is appended. Ehm. No can’t append Julia code. So code is below.
Kind regards, z.

> include("perf1.jl")
:perf1
 
> test()
zeroing with logical indexing   :  0.581205 seconds (9 allocations: 11.925 MiB)
zeroing vectorized              :  0.145941 seconds
zeroing loop branch             :  0.438066 seconds
zeroing loop no branch          :  0.195619 seconds
zeroing loop no branch inbounds :  0.145681 seconds
 
> test()
zeroing with logical indexing   :  0.583505 seconds (9 allocations: 11.925 MiB)
zeroing vectorized              :  0.150736 seconds
zeroing loop branch             :  0.439501 seconds
zeroing loop no branch          :  0.177093 seconds
zeroing loop no branch inbounds :  0.145801 seconds
 
> test()
zeroing with logical indexing   :  0.566651 seconds (9 allocations: 11.925 MiB)
zeroing vectorized              :  0.147940 seconds
zeroing loop branch             :  0.440264 seconds
zeroing loop no branch          :  0.204290 seconds
zeroing loop no branch inbounds :  0.146339 seconds
 

And the code:

# Logical Indexing

function zeroingLogicalIndexing(A, B)
    A[B .!= 0] = 0
    A
end

function testLogicalIndexing()
    print("zeroing with logical indexing   :")
    n1 = 100
    n2 = 100
    A  = rand(n1, n2)
    B  = rand(0:1, n1, n2) * 1.0
    # preHeating
    zeroingLogicaIndexing(A, B)
    n1 = 10000
    n2 = 10000
    A  = rand(n1, n2)
    B  = rand(0:1, n1, n2) * 1.0
    @time zeroingLogicaIndexing(A, B)
end


# Vectorized

function zeroingVectorized(A, B)
    # A .= .!iszero.(B) .* A
    @. A = !iszero(B) * A
end

function testVectorized()
    print("zeroing vectorized              :")
    n1 = 100
    n2 = 100
    A  = rand(n1, n2)
    B  = rand(0:1, n1, n2) * 1.0
    # preHeating
    zeroingVectorized(A, B)
    n1 = 10000
    n2 = 10000
    A  = rand(n1, n2)
    B  = rand(0:1, n1, n2) * 1.0
    @time zeroingVectorized(A, B)
end


# Loop

function zeroingLoopBranch(A, B)
    for i in eachindex(B)
        if iszero(B[i])
            A[i] = 0
        end
    end
end

function zeroingLoopNoBranch(A, B)
    for i in eachindex(B)
        A[i] = !iszero(B[i]) * A[i]
    end
end

function zeroingLoopNoBranchInbounds(A, B)
    for i in eachindex(B)
        @inbounds A[i] = !iszero(B[i]) * A[i]
    end
end

function testLoopBranch()
    print("zeroing loop branch             :")
    n1 = 100
    n2 = 100
    A  = rand(n1, n2)
    B  = rand(0:1, n1, n2) * 1.0
    # preHeating
    zeroingLoopBranch(A, B)
    n1 = 10000
    n2 = 10000
    A  = rand(n1, n2)
    B  = rand(0:1, n1, n2) * 1.0
    @time zeroingLoopBranch(A, B)
end

# https://www.juliabloggers.com/cpu-pipelines-when-more-is-less/
function testLoopNoBranch()
    print("zeroing loop no branch          :")
    n1 = 100
    n2 = 100
    A  = rand(n1, n2)
    B  = rand(0:1, n1, n2) * 1.0
    # preHeating
    zeroingLoopNoBranch(A, B)
    n1 = 10000
    n2 = 10000
    A  = rand(n1, n2)
    B  = rand(0:1, n1, n2) * 1.0
    @time zeroingLoopNoBranch(A, B)
end

function testLoopNoBranchInbounds()
    print("zeroing loop no branch inbounds :")
    n1 = 100
    n2 = 100
    A  = rand(n1, n2)
    B  = rand(0:1, n1, n2) * 1.0
    # preHeating
    zeroingLoopNoBranchInbounds(A, B)
    n1 = 10000
    n2 = 10000
    A  = rand(n1, n2)
    B  = rand(0:1, n1, n2) * 1.0
    @time zeroingLoopNoBranchInbounds(A, B)
end


function test()
    testLogicalIndexing()
    testVectorized()
    testLoopBranch()
    testLoopNoBranch()
    testLoopNoBranchInbounds()
end

:perf1


#18

Yes, I totally agree that an explicit loop would be the fastest and most readable, but only if you use ifelse instead of just if. So, basically, the following would be as fast as the fastest vectorized implementation:

function test3(A, B)
    for i in eachindex(B)
      @inbounds A[i] = ifelse(B[i] == 0, 0.0, A[i])
    end
end 

In fact, it is more of a ifelse vs if case, I don’t know why this happens, though.


#19

I usually do such things with something like:

function test4(A, B)
  @inbounds for i in eachindex(B)
    B[i] == 0 && (A[i] = 0)
  end
end

Is there any disadvantage on the use of && ?


#20

That’s effectively the same as using an explicit if statement. Both && and if do whats called “branching” — that is, the path the computer takes through your code changes depending upon the value of the conditional. This only really matters due to a quirk in how modern processors work:

https://stackoverflow.com/a/11227902/176071

That’s where the ifelse construct differs; it always evaluates both the true and false clauses, and simply changes which value gets returned. The code path is the same either way, so the processor always knows what its next instruction to execute will be.

Note that these benchmarks are all (I think) using random data as the test case, which of course the processor won’t predict accurately. If zeros occur in a predictable pattern, then I’d expect branching to perform just as well (or better!) than ifelse:

julia> function test1(A, B)
           @inbounds for i in eachindex(B)
               A[i] = ifelse(iszero(B[i]), 0.0, A[i])
           end
       end
test1 (generic function with 1 method)

julia> function test4(A, B)
           @inbounds for i in eachindex(B)
             B[i] == 0 && (A[i] = 0)
           end
       end
test4 (generic function with 1 method)

julia> A = zeros(10000, 100);

julia> B = rand(0.0:1.0, size(A));

julia> C = copy(B); sort!(@view C[:]);

julia> @btime test1($A, $B)
  2.601 ms (0 allocations: 0 bytes)

julia> @btime test4($A, $B)
  5.665 ms (0 allocations: 0 bytes)

julia> @btime test1($A, $C)
  2.590 ms (0 allocations: 0 bytes)

julia> @btime test4($A, $C)
  1.998 ms (0 allocations: 0 bytes)