# Shouldn't `@views sum(data[mask])` be non-allocating?

I was surprised to see the following function allocates - is this expected?

``````using BenchmarkTools

end

let N = 1_000_000
data = fill(true, N)
end
``````

`1.476 ms (2 allocations: 7.63 MiB)`

I think the view is making a copy the indexing array but I am not sure.

Changing `mask` to a different form of indexing removes the allocations:

``````julia> let N = 1_000_000
data = fill(true, N)
end
73.056 ÎĽs (0 allocations: 0 bytes)
1000000
``````

You can however avoid constructing the view like this:

``````julia> function f_viewalloc2(data, mask)
@views sum(((d,m),) -> ifelse(m, d, zero(eltype(data))), zip(data, mask))
end

julia> let N = 1_000_000
data = fill(true, N)
end
532.342 ÎĽs (0 allocations: 0 bytes)
1000000
``````

Edit:
A better test of this hypothesis is benchmarking the creation of the view:

``````julia> let N = 1_000_000
data = fill(true, N)
end
797.884 ÎĽs (2 allocations: 7.63 MiB)
``````

which indeed shows that this causes the allocations.

2 Likes
``````julia> @view data[mask]
1000000-element view(::Vector{Bool}, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10  â€¦  999991, 999992, 999993, 999994, 999995, 999996, 999997, 999998, 999999, 1000000]) with eltype Bool:
``````

Notice that this converted the `Bool` index array to an array of indices. So `@views` is allocating on logical indexing. You could use an iterator or some other mechanism, however:

``````function f_viewalloc2(data, mask)
end
``````
3 Likes

Yes, thatâ€™s correct; you generally want arrays to have O(1) lookup and size performance, but with a logical array index, youâ€™d have to count the number of trues to figure out which index goes where. So views do a `findall` on logical vectors when you create them.

If you want to do a masked sum without allocations, you can use a `Iterators.filter` or (or even a generator with an `if`) instead.

7 Likes

So following from the answers of @mbauman and @mikmoore, if you have an array of indices instead, there are no allocations:

``````julia> let N = 1_000_000
data = fill(true, N)
mask = rand(1:N,N) # array of random indices
end
390.367 ÎĽs (0 allocations: 0 bytes)
end
``````
1 Like

Thank you all for looking into this! Does it only look like a performance trap to me? Neither â€ścreate a viewâ€ť nor â€śiterate over a viewâ€ť do not sound like an allocating operation.

2 Likes

Iâ€™ll remark that logical indexing was something I used constantly in languages like MATLAB, but is seldom something I use in Julia. Usually, I can create the masked data in the first place rather than building full array and then masking after. It often ends up a little cleaner this way, anyway.

The cost of allocating and converting the logical array to an index array is probably less than a few random-access indexes into the `view` would be, since one cannot get O(1) random-access into logical indices. So thereâ€™s a performance trap either way, but at least the convert-to-index version is only one expensive build operation followed by cheap access (rather than one free build followed by many expensive accesses).

2 Likes

Interestingly, if we didnâ€™t allocate, itâ€™d be an even bigger performance trap.

Views are `AbstractArray`s, and `sum` has a specialized implementation for arrays that implicitly assumes that indexing and size calls are cheap. It effectively computes `r += A[i]` for all indices `i` in the array `A`. If we didnâ€™t allocate here, computing `A[1]` has to start at the beginning of the array and look for the first true value. `A[2]` starts at the beginning of the array and looks for the second value. `A[3]` starts at the beginning of the array and looks for the third true. And on and an. This is a classic O(n^2) â€śaccidentally quadraticâ€ť performance trap way way worse than doing the allocation.

`sum` does have support for generic iterables that could keep track of â€” effectively â€” where you left of that lookup, but if youâ€™re using an indexing API you canâ€™t hold onto that context or assume that all the accesses will be sequential. So then you need to use a structure thatâ€™s not an array in order to use that iterable method. And those are `Iterators.filter` or `(a for a in A if _____)`.

Julia doesnâ€™t do context-sensitive magic â€” your `@view data[mask]` doesnâ€™t know that itâ€™s getting passed to a `sum`, and `sum` doesnâ€™t know that your view was constructed with a logical mask, and the indexing calls donâ€™t know that all accesses will be sequential. Arguments to functions are evaluated in exactly the same way they would be if they werenâ€™t arguments. And functions are evaluated in exactly the same way regardless of the syntax used for its arguments.

7 Likes

Slightly faster and shorter:

``````julia> function f_viewalloc2(data, mask)
@views sum(((d,m),) -> ifelse(m, d, zero(eltype(data))), zip(data, mask))
end

end

julia> let N = 1_000_000
data = fill(true, N)
end
146.478 ÎĽs (2 allocations: 976.67 KiB)
1000000

julia> let N = 1_000_000
data = fill(true, N)
end
292.176 ÎĽs (0 allocations: 0 bytes)
1000000

``````

Your `@views` in `f_viewalloc2` isnâ€™t doing anything (except maybe confusing the situation). Both those options only work if you have numeric data types, but theyâ€™re indeed both great tricks. You can do even better:

``````# This is the generically correct thing: it works for all summable arrays

1.239 ms (0 allocations: 0 bytes)
1000000

79.125 ÎĽs (0 allocations: 0 bytes)
1000000
``````

For reference, your methods were ~500Âµs and ~100Âµs, respectively.

1 Like

Not sure why this seems to be faster than the equivalent `mapreduce()` above, by @roflmaostc:

``````f_dot_product(data, mask) = sum(data .* mask)
``````

Youâ€™ll notice both the `mapreduce(*, +, data, mask)` and the `sum(data .* mask)` are allocating â€” in the latter case thatâ€™s because itâ€™s what you literally asked for (`data .* mask` is creating a new array to use as the argument), and in the former itâ€™s an open issue (julia#53417). I believe the performance discrepancy is just due to some peculiarities in how Julia happens to be doing that unnecessary intermediate allocation.

1 Like

What about simply `data' * mask`, or `dot(data, mask)`?

1 Like

Both have 0 allocs, but seem to be 7x slower?

No, they arenâ€™t. At least on my machine they both are as fast as the fastest `sum(data[i] * mask[i] for i in eachindex(data))`.

@Seif_Shebl, could you please try the following code, where the mask is a random sequence of true and false?

Benchmarks
``````using Random
Random.seed!(0)
N = 1_000_000
data = fill(true, N)

# sum = 499261

@btime f_viewalloc(\$data, \$mask)            # 4.859 ms (2 allocs: 3.8 MiB)

@btime f_viewalloc1(\$data, \$mask)           # 3.659 ms (0 allocs: 0 bytes)

@btime f_viewalloc3(\$data, \$mask)           # 153 ÎĽs (2 allocs: 977 KiB)

@btime f_generator_fast(\$data, \$mask)       # 129 ÎĽs (0 allocs: 0 bytes)

@btime f_dot_product(\$data, \$mask)          # 106 ÎĽs (4 allocs: 126 KiB)

@btime f_dot_product2(\$data, \$mask)         # 771 ÎĽs (0 allocs: 0 bytes)

@btime f_dot_product3(\$data, \$mask)         # 771 ÎĽs (0 allocs: 0 bytes)
``````

And for reference, here is my:

versioninfo()
``````julia> versioninfo()
Julia Version 1.10.4
Commit 48d4fd4843 (2024-06-04 10:41 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: 20 Ă— 13th Gen Intel(R) Core(TM) i9-13900H
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, goldmont)
Threads: 14 default, 0 interactive, 7 GC (on 20 virtual cores)
Environment:
JULIA_PKG_PRESERVE_TIERED_INSTALLED = true
JULIA_PKG_USE_CLI_GIT = true
JULIA_STACKFRAME_FUNCTION_COLOR = blue
JULIA_WARN_COLOR = cyan
JULIA_EDITOR = code.cmd -g
``````
``````julia> @btime f_viewalloc1(\$data, \$mask);
3.055 ms (0 allocations: 0 bytes)

148.400 ÎĽs (3 allocations: 976.62 KiB)

116.700 ÎĽs (0 allocations: 0 bytes)

152.800 ÎĽs (4 allocations: 122.22 KiB)

117.500 ÎĽs (0 allocations: 0 bytes)

117.700 ÎĽs (0 allocations: 0 bytes)

julia> versioninfo()
Julia Version 1.11.0-rc2
Commit 34c3a63147 (2024-07-29 06:24 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: 16 Ă— Intel(R) Core(TM) i7-10875H CPU @ 2.30GHz
WORD_SIZE: 64
LLVM: libLLVM-16.0.6 (ORCJIT, skylake)
Threads: 1 default, 0 interactive, 1 GC (on 16 virtual cores)
``````

Here, `data' * mask` is as fast as the fastest alternative.

But on version 1.10.4 `data' * mask` uses 780us.

I get roughly the same timings as @rafael.guerra:

``````julia> @btime f_viewalloc(\$data, \$mask);
4.092 ms (2 allocations: 3.81 MiB)

3.183 ms (0 allocations: 0 bytes)

137.800 ÎĽs (2 allocations: 976.67 KiB)

111.200 ÎĽs (0 allocations: 0 bytes)

122.400 ÎĽs (4 allocations: 126.39 KiB)

777.800 ÎĽs (0 allocations: 0 bytes)

777.900 ÎĽs (1 allocation: 16 bytes)
``````

I do get a single allocation at the end, though.

Maybe something changed in 1.11 for the better (as I now see @DNF also already suggested)?

versioninfo
``````julia> versioninfo()
Julia Version 1.10.4
Commit 48d4fd4843 (2024-06-04 10:41 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: 8 Ă— Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, skylake)
Threads: 8 default, 0 interactive, 4 GC (on 8 virtual cores)
Environment:
``````

The result changes when `data` is not `Bool`. I was not sure if that was intentional that `data` is a `Bool`.

Code
``````using Random
using BenchmarkTools
using LinearAlgebra
Random.seed!(0)
N = 1_000_000
data = rand(Bool, N)

@show @btime f_viewalloc(\$data, \$mask)            # 4.859 ms (2 allocs: 3.8 MiB)

@show @btime f_viewalloc1(\$data, \$mask)           # 3.659 ms (0 allocs: 0 bytes)

@show @btime f_viewalloc3(\$data, \$mask)           # 153 ÎĽs (2 allocs: 977 KiB)

@show @btime f_generator_fast(\$data, \$mask)       # 129 ÎĽs (0 allocs: 0 bytes)

@show @btime f_dot_product(\$data, \$mask)          # 106 ÎĽs (4 allocs: 126 KiB)

@show @btime f_dot_product2(\$data, \$mask)         # 771 ÎĽs (0 allocs: 0 bytes)

@show @btime f_dot_product3(\$data, \$mask)         # 771 ÎĽs (0 allocs: 0 bytes)

``````
``````julia> include("/tmp/benchmark.jl") # data is Float64
4.126 ms (2 allocations: 3.81 MiB)
#= /tmp/benchmark.jl:10 =# @btime(f_viewalloc(\$(Expr(:\$, :data)), \$(Expr(:\$, :mask)))) = 249765.49111151227
3.115 ms (0 allocations: 0 bytes)
#= /tmp/benchmark.jl:13 =# @btime(f_viewalloc1(\$(Expr(:\$, :data)), \$(Expr(:\$, :mask)))) = 249765.49111151227
1.390 ms (6 allocations: 7.63 MiB)
#= /tmp/benchmark.jl:16 =# @btime(f_viewalloc3(\$(Expr(:\$, :data)), \$(Expr(:\$, :mask)))) = 249765.49111151224
3.293 ms (0 allocations: 0 bytes)
#= /tmp/benchmark.jl:19 =# @btime(f_generator_fast(\$(Expr(:\$, :data)), \$(Expr(:\$, :mask)))) = 249765.49111151227
1.398 ms (2 allocations: 7.63 MiB)
#= /tmp/benchmark.jl:22 =# @btime(f_dot_product(\$(Expr(:\$, :data)), \$(Expr(:\$, :mask)))) = 249765.49111151224
3.170 ms (0 allocations: 0 bytes)
#= /tmp/benchmark.jl:25 =# @btime(f_dot_product2(\$(Expr(:\$, :data)), \$(Expr(:\$, :mask)))) = 249765.49111151227
3.094 ms (0 allocations: 0 bytes)
#= /tmp/benchmark.jl:28 =# @btime(f_dot_product3(\$(Expr(:\$, :data)), \$(Expr(:\$, :mask)))) = 249765.49111151227
249765.49111151227

julia> include("/tmp/benchmark.jl") # data is bool
3.876 ms (2 allocations: 3.81 MiB)
#= /tmp/benchmark.jl:10 =# @btime(f_viewalloc(\$(Expr(:\$, :data)), \$(Expr(:\$, :mask)))) = 249502
2.910 ms (0 allocations: 0 bytes)
#= /tmp/benchmark.jl:13 =# @btime(f_viewalloc1(\$(Expr(:\$, :data)), \$(Expr(:\$, :mask)))) = 249502
137.107 ÎĽs (2 allocations: 976.67 KiB)
#= /tmp/benchmark.jl:16 =# @btime(f_viewalloc3(\$(Expr(:\$, :data)), \$(Expr(:\$, :mask)))) = 249502
104.426 ÎĽs (0 allocations: 0 bytes)
#= /tmp/benchmark.jl:19 =# @btime(f_generator_fast(\$(Expr(:\$, :data)), \$(Expr(:\$, :mask)))) = 249502
84.967 ÎĽs (4 allocations: 126.39 KiB)
#= /tmp/benchmark.jl:22 =# @btime(f_dot_product(\$(Expr(:\$, :data)), \$(Expr(:\$, :mask)))) = 249502
652.136 ÎĽs (0 allocations: 0 bytes)
#= /tmp/benchmark.jl:25 =# @btime(f_dot_product2(\$(Expr(:\$, :data)), \$(Expr(:\$, :mask)))) = 249502
636.865 ÎĽs (0 allocations: 0 bytes)
#= /tmp/benchmark.jl:28 =# @btime(f_dot_product3(\$(Expr(:\$, :data)), \$(Expr(:\$, :mask)))) = 249502
249502

````````

Interesting, these are the timings on my generations-older machine than yours:

``````# YOUR NUMBERS:                 | MY NUMBERS:
#=====================================================================
# 4.859 ms (2 allocs: 3.8 MiB)  #   3.856 ms (3 allocations: 3.81 MiB)
# 3.659 ms (0 allocs: 0 bytes)  #   3.014 ms (0 allocations: 0 bytes)
# 153   ÎĽs (2 allocs: 977 KiB)  # 166.300 ÎĽs (3 allocations: 976.62 KiB)
# 129   ÎĽs (0 allocs: 0 bytes)  # 143.400 ÎĽs (0 allocations: 0 bytes)
# 106   ÎĽs (4 allocs: 126 KiB)  # 171.700 ÎĽs (4 allocations: 122.22 KiB)
# 771   ÎĽs (0 allocs: 0 bytes)  # 143.400 ÎĽs (0 allocations: 0 bytes)
# 771   ÎĽs (0 allocs: 0 bytes)  # 143.300 ÎĽs (0 allocations: 0 bytes)
``````

But that could be a Julia-version thing than any other reason:

``````julia> versioninfo()
Julia Version 1.12.0-DEV.1024
Commit e7e8768a77 (2024-08-08 23:44 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: 8 Ă— Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
WORD_SIZE: 64
LLVM: libLLVM-18.1.7 (ORCJIT, haswell)
Threads: 1 default, 0 interactive, 1 GC (on 8 virtual cores)
``````
1 Like