Shouldn't `@views sum(data[mask])` be non-allocating?

I was surprised to see the following function allocates - is this expected?

using BenchmarkTools

function f_viewalloc(data, mask)
    @views sum(data[mask])
end

let N = 1_000_000
    data = fill(true, N)
    mask = fill(true, N)
    @btime f_viewalloc($data, $mask)
end

1.476 ms (2 allocations: 7.63 MiB)

I think the view is making a copy the indexing array but I am not sure.

Changing mask to a different form of indexing removes the allocations:

julia> let N = 1_000_000
           data = fill(true, N)
           mask = 1:N
           @btime f_viewalloc($data, $mask)
       end
  73.056 ÎĽs (0 allocations: 0 bytes)
1000000

You can however avoid constructing the view like this:

julia> function f_viewalloc2(data, mask)
           @views sum(((d,m),) -> ifelse(m, d, zero(eltype(data))), zip(data, mask))
       end

julia> let N = 1_000_000
           data = fill(true, N)
           mask = fill(true, N)
           @btime f_viewalloc2($data, $mask)
       end
  532.342 ÎĽs (0 allocations: 0 bytes)
1000000

Edit:
A better test of this hypothesis is benchmarking the creation of the view:

julia> let N = 1_000_000
           data = fill(true, N)
           mask = fill(true, N)
           @btime view($data,$mask)
       end
  797.884 ÎĽs (2 allocations: 7.63 MiB)

which indeed shows that this causes the allocations.

2 Likes
julia> @view data[mask]
1000000-element view(::Vector{Bool}, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10  …  999991, 999992, 999993, 999994, 999995, 999996, 999997, 999998, 999999, 1000000]) with eltype Bool:

Notice that this converted the Bool index array to an array of indices. So @views is allocating on logical indexing. You could use an iterator or some other mechanism, however:

function f_viewalloc2(data, mask)
    sum(last, Iterators.filter(first, zip(mask,data)))
end
3 Likes

Yes, that’s correct; you generally want arrays to have O(1) lookup and size performance, but with a logical array index, you’d have to count the number of trues to figure out which index goes where. So views do a findall on logical vectors when you create them.

If you want to do a masked sum without allocations, you can use a Iterators.filter or (or even a generator with an if) instead.

7 Likes

So following from the answers of @mbauman and @mikmoore, if you have an array of indices instead, there are no allocations:

julia> let N = 1_000_000
           data = fill(true, N)
           mask = rand(1:N,N) # array of random indices
           @btime view($data,$mask)
       end
  390.367 ÎĽs (0 allocations: 0 bytes)
end
1 Like

Thank you all for looking into this! Does it only look like a performance trap to me? Neither “create a view” nor “iterate over a view” do not sound like an allocating operation.

2 Likes

I’ll remark that logical indexing was something I used constantly in languages like MATLAB, but is seldom something I use in Julia. Usually, I can create the masked data in the first place rather than building full array and then masking after. It often ends up a little cleaner this way, anyway.

The cost of allocating and converting the logical array to an index array is probably less than a few random-access indexes into the view would be, since one cannot get O(1) random-access into logical indices. So there’s a performance trap either way, but at least the convert-to-index version is only one expensive build operation followed by cheap access (rather than one free build followed by many expensive accesses).

2 Likes

Interestingly, if we didn’t allocate, it’d be an even bigger performance trap.

Views are AbstractArrays, and sum has a specialized implementation for arrays that implicitly assumes that indexing and size calls are cheap. It effectively computes r += A[i] for all indices i in the array A. If we didn’t allocate here, computing A[1] has to start at the beginning of the array and look for the first true value. A[2] starts at the beginning of the array and looks for the second value. A[3] starts at the beginning of the array and looks for the third true. And on and an. This is a classic O(n^2) “accidentally quadratic” performance trap way way worse than doing the allocation.

sum does have support for generic iterables that could keep track of — effectively — where you left of that lookup, but if you’re using an indexing API you can’t hold onto that context or assume that all the accesses will be sequential. So then you need to use a structure that’s not an array in order to use that iterable method. And those are Iterators.filter or (a for a in A if _____).

Julia doesn’t do context-sensitive magic — your @view data[mask] doesn’t know that it’s getting passed to a sum, and sum doesn’t know that your view was constructed with a logical mask, and the indexing calls don’t know that all accesses will be sequential. Arguments to functions are evaluated in exactly the same way they would be if they weren’t arguments. And functions are evaluated in exactly the same way regardless of the syntax used for its arguments.

7 Likes

Slightly faster and shorter:

julia> function f_viewalloc2(data, mask)
                 @views sum(((d,m),) -> ifelse(m, d, zero(eltype(data))), zip(data, mask))
             end

julia> function f_viewalloc3(data, mask)
           mapreduce(*, +, data, mask)
       end

julia> let N = 1_000_000
                 data = fill(true, N)
                 mask = fill(true, N)
                 @btime f_viewalloc3($data, $mask)
               end
  146.478 ÎĽs (2 allocations: 976.67 KiB)
1000000

julia> let N = 1_000_000
                 data = fill(true, N)
                 mask = fill(true, N)
                 @btime f_viewalloc2($data, $mask)
               end
  292.176 ÎĽs (0 allocations: 0 bytes)
1000000

Your @views in f_viewalloc2 isn’t doing anything (except maybe confusing the situation). Both those options only work if you have numeric data types, but they’re indeed both great tricks. You can do even better:

# This is the generically correct thing: it works for all summable arrays
julia> f_generator_generic(data, mask) = sum(v for (i,v) in pairs(data) if mask[i])

julia> f_generator_fast(data, mask) = sum(v*mask[i] for (i,v) in pairs(data))

julia> @btime f_generator_generic($data, $mask)
  1.239 ms (0 allocations: 0 bytes)
1000000

julia> @btime f_generator_fast($data, $mask)
  79.125 ÎĽs (0 allocations: 0 bytes)
1000000

For reference, your methods were ~500µs and ~100µs, respectively.

1 Like

Not sure why this seems to be faster than the equivalent mapreduce() above, by @roflmaostc:

f_dot_product(data, mask) = sum(data .* mask)

You’ll notice both the mapreduce(*, +, data, mask) and the sum(data .* mask) are allocating — in the latter case that’s because it’s what you literally asked for (data .* mask is creating a new array to use as the argument), and in the former it’s an open issue (julia#53417). I believe the performance discrepancy is just due to some peculiarities in how Julia happens to be doing that unnecessary intermediate allocation.

2 Likes

What about simply data' * mask, or dot(data, mask)?

1 Like

Both have 0 allocs, but seem to be 7x slower?

No, they aren’t. At least on my machine they both are as fast as the fastest sum(data[i] * mask[i] for i in eachindex(data)).

@Seif_Shebl, could you please try the following code, where the mask is a random sequence of true and false?

Benchmarks
using Random
Random.seed!(0)
N = 1_000_000
data = fill(true, N)
mask = rand((true, false), N)

# sum = 499261

f_viewalloc(data, mask) = @views sum(data[mask])
@btime f_viewalloc($data, $mask)            # 4.859 ms (2 allocs: 3.8 MiB)

f_viewalloc1(data, mask) = sum(last, Iterators.filter(first, zip(mask,data)))
@btime f_viewalloc1($data, $mask)           # 3.659 ms (0 allocs: 0 bytes)

f_viewalloc3(data, mask) = mapreduce(*, +, data, mask)
@btime f_viewalloc3($data, $mask)           # 153 ÎĽs (2 allocs: 977 KiB)

f_generator_fast(data, mask) = sum(v*mask[i] for (i,v) in pairs(data))
@btime f_generator_fast($data, $mask)       # 129 ÎĽs (0 allocs: 0 bytes)

f_dot_product(data, mask) = sum(data .* mask)
@btime f_dot_product($data, $mask)          # 106 ÎĽs (4 allocs: 126 KiB)

f_dot_product2(data, mask) = data' * mask
@btime f_dot_product2($data, $mask)         # 771 ÎĽs (0 allocs: 0 bytes)

f_dot_product3(data, mask) = dot(data, mask)
@btime f_dot_product3($data, $mask)         # 771 ÎĽs (0 allocs: 0 bytes)

And for reference, here is my:

versioninfo()
julia> versioninfo()
Julia Version 1.10.4
Commit 48d4fd4843 (2024-06-04 10:41 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 20 Ă— 13th Gen Intel(R) Core(TM) i9-13900H
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, goldmont)
Threads: 14 default, 0 interactive, 7 GC (on 20 virtual cores)
Environment:
  JULIA_PKG_PRESERVE_TIERED_INSTALLED = true
  JULIA_PKG_USE_CLI_GIT = true
  JULIA_STACKFRAME_FUNCTION_COLOR = blue
  JULIA_WARN_COLOR = cyan
  JULIA_EDITOR = code.cmd -g
  JULIA_NUM_THREADS = 14
julia> @btime f_viewalloc1($data, $mask);
  3.055 ms (0 allocations: 0 bytes)

julia> @btime f_viewalloc3($data, $mask);
  148.400 ÎĽs (3 allocations: 976.62 KiB)

julia> @btime f_generator_fast($data, $mask);
  116.700 ÎĽs (0 allocations: 0 bytes)

julia> @btime f_dot_product($data, $mask);
  152.800 ÎĽs (4 allocations: 122.22 KiB)

julia> @btime f_dot_product2($data, $mask);
  117.500 ÎĽs (0 allocations: 0 bytes)

julia> @btime f_dot_product3($data, $mask);
  117.700 ÎĽs (0 allocations: 0 bytes)

julia> versioninfo()
Julia Version 1.11.0-rc2
Commit 34c3a63147 (2024-07-29 06:24 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 16 Ă— Intel(R) Core(TM) i7-10875H CPU @ 2.30GHz
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, skylake)
Threads: 1 default, 0 interactive, 1 GC (on 16 virtual cores)

Here, data' * mask is as fast as the fastest alternative.

But on version 1.10.4 data' * mask uses 780us.

I get roughly the same timings as @rafael.guerra:

julia> @btime f_viewalloc($data, $mask);  
  4.092 ms (2 allocations: 3.81 MiB)

julia> @btime f_viewalloc1($data, $mask);
  3.183 ms (0 allocations: 0 bytes)

julia> @btime f_viewalloc3($data, $mask);
  137.800 ÎĽs (2 allocations: 976.67 KiB)

julia> @btime f_generator_fast($data, $mask);
  111.200 ÎĽs (0 allocations: 0 bytes)

julia> @btime f_dot_product($data, $mask);
  122.400 ÎĽs (4 allocations: 126.39 KiB)

julia> @btime f_dot_product2($data, $mask);
  777.800 ÎĽs (0 allocations: 0 bytes)

julia> @btime f_dot_product3($data, $mask);   
  777.900 ÎĽs (1 allocation: 16 bytes)

I do get a single allocation at the end, though.

Maybe something changed in 1.11 for the better (as I now see @DNF also already suggested)?

versioninfo
julia> versioninfo()
Julia Version 1.10.4
Commit 48d4fd4843 (2024-06-04 10:41 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 8 Ă— Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, skylake)
Threads: 8 default, 0 interactive, 4 GC (on 8 virtual cores)
Environment:
  JULIA_NUM_THREADS = auto

The result changes when data is not Bool. I was not sure if that was intentional that data is a Bool.

Code
using Random
using BenchmarkTools
using LinearAlgebra
Random.seed!(0)
N = 1_000_000
data = rand(Bool, N)  
mask = rand((true, false), N)

f_viewalloc(data, mask) = @views sum(data[mask])
@show @btime f_viewalloc($data, $mask)            # 4.859 ms (2 allocs: 3.8 MiB)

f_viewalloc1(data, mask) = sum(last, Iterators.filter(first, zip(mask,data)))
@show @btime f_viewalloc1($data, $mask)           # 3.659 ms (0 allocs: 0 bytes)

f_viewalloc3(data, mask) = mapreduce(*, +, data, mask)
@show @btime f_viewalloc3($data, $mask)           # 153 ÎĽs (2 allocs: 977 KiB)

f_generator_fast(data, mask) = sum(v*mask[i] for (i,v) in pairs(data))
@show @btime f_generator_fast($data, $mask)       # 129 ÎĽs (0 allocs: 0 bytes)

f_dot_product(data, mask) = sum(data .* mask)
@show @btime f_dot_product($data, $mask)          # 106 ÎĽs (4 allocs: 126 KiB)

f_dot_product2(data, mask) = data' * mask
@show @btime f_dot_product2($data, $mask)         # 771 ÎĽs (0 allocs: 0 bytes)

f_dot_product3(data, mask) = dot(data, mask)
@show @btime f_dot_product3($data, $mask)         # 771 ÎĽs (0 allocs: 0 bytes)

julia> include("/tmp/benchmark.jl") # data is Float64
  4.126 ms (2 allocations: 3.81 MiB)
#= /tmp/benchmark.jl:10 =# @btime(f_viewalloc($(Expr(:$, :data)), $(Expr(:$, :mask)))) = 249765.49111151227
  3.115 ms (0 allocations: 0 bytes)
#= /tmp/benchmark.jl:13 =# @btime(f_viewalloc1($(Expr(:$, :data)), $(Expr(:$, :mask)))) = 249765.49111151227
  1.390 ms (6 allocations: 7.63 MiB)
#= /tmp/benchmark.jl:16 =# @btime(f_viewalloc3($(Expr(:$, :data)), $(Expr(:$, :mask)))) = 249765.49111151224
  3.293 ms (0 allocations: 0 bytes)
#= /tmp/benchmark.jl:19 =# @btime(f_generator_fast($(Expr(:$, :data)), $(Expr(:$, :mask)))) = 249765.49111151227
  1.398 ms (2 allocations: 7.63 MiB)
#= /tmp/benchmark.jl:22 =# @btime(f_dot_product($(Expr(:$, :data)), $(Expr(:$, :mask)))) = 249765.49111151224
  3.170 ms (0 allocations: 0 bytes)
#= /tmp/benchmark.jl:25 =# @btime(f_dot_product2($(Expr(:$, :data)), $(Expr(:$, :mask)))) = 249765.49111151227
  3.094 ms (0 allocations: 0 bytes)
#= /tmp/benchmark.jl:28 =# @btime(f_dot_product3($(Expr(:$, :data)), $(Expr(:$, :mask)))) = 249765.49111151227
249765.49111151227

julia> include("/tmp/benchmark.jl") # data is bool
  3.876 ms (2 allocations: 3.81 MiB)
#= /tmp/benchmark.jl:10 =# @btime(f_viewalloc($(Expr(:$, :data)), $(Expr(:$, :mask)))) = 249502
  2.910 ms (0 allocations: 0 bytes)
#= /tmp/benchmark.jl:13 =# @btime(f_viewalloc1($(Expr(:$, :data)), $(Expr(:$, :mask)))) = 249502
  137.107 ÎĽs (2 allocations: 976.67 KiB)
#= /tmp/benchmark.jl:16 =# @btime(f_viewalloc3($(Expr(:$, :data)), $(Expr(:$, :mask)))) = 249502
  104.426 ÎĽs (0 allocations: 0 bytes)
#= /tmp/benchmark.jl:19 =# @btime(f_generator_fast($(Expr(:$, :data)), $(Expr(:$, :mask)))) = 249502
  84.967 ÎĽs (4 allocations: 126.39 KiB)
#= /tmp/benchmark.jl:22 =# @btime(f_dot_product($(Expr(:$, :data)), $(Expr(:$, :mask)))) = 249502
  652.136 ÎĽs (0 allocations: 0 bytes)
#= /tmp/benchmark.jl:25 =# @btime(f_dot_product2($(Expr(:$, :data)), $(Expr(:$, :mask)))) = 249502
  636.865 ÎĽs (0 allocations: 0 bytes)
#= /tmp/benchmark.jl:28 =# @btime(f_dot_product3($(Expr(:$, :data)), $(Expr(:$, :mask)))) = 249502
249502


``

Interesting, these are the timings on my generations-older machine than yours:

# YOUR NUMBERS:                 | MY NUMBERS:
#=====================================================================
# 4.859 ms (2 allocs: 3.8 MiB)  #   3.856 ms (3 allocations: 3.81 MiB)
# 3.659 ms (0 allocs: 0 bytes)  #   3.014 ms (0 allocations: 0 bytes)
# 153   ÎĽs (2 allocs: 977 KiB)  # 166.300 ÎĽs (3 allocations: 976.62 KiB)
# 129   ÎĽs (0 allocs: 0 bytes)  # 143.400 ÎĽs (0 allocations: 0 bytes)
# 106   ÎĽs (4 allocs: 126 KiB)  # 171.700 ÎĽs (4 allocations: 122.22 KiB)
# 771   ÎĽs (0 allocs: 0 bytes)  # 143.400 ÎĽs (0 allocations: 0 bytes)
# 771   ÎĽs (0 allocs: 0 bytes)  # 143.300 ÎĽs (0 allocations: 0 bytes)

But that could be a Julia-version thing than any other reason:

julia> versioninfo()
Julia Version 1.12.0-DEV.1024
Commit e7e8768a77 (2024-08-08 23:44 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 8 Ă— Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
  WORD_SIZE: 64
  LLVM: libLLVM-18.1.7 (ORCJIT, haswell)
Threads: 1 default, 0 interactive, 1 GC (on 8 virtual cores)
1 Like