Shouldn't `@views sum(data[mask])` be non-allocating?

artemsolod · August 13, 2024, 1:48pm

I was surprised to see the following function allocates - is this expected?

using BenchmarkTools

function f_viewalloc(data, mask)
    @views sum(data[mask])
end

let N = 1_000_000
    data = fill(true, N)
    mask = fill(true, N)
    @btime f_viewalloc($data, $mask)
end

1.476 ms (2 allocations: 7.63 MiB)

abraemer · August 13, 2024, 1:56pm

I think the view is making a copy the indexing array but I am not sure.

Changing mask to a different form of indexing removes the allocations:

julia> let N = 1_000_000
           data = fill(true, N)
           mask = 1:N
           @btime f_viewalloc($data, $mask)
       end
  73.056 μs (0 allocations: 0 bytes)
1000000

You can however avoid constructing the view like this:

julia> function f_viewalloc2(data, mask)
           @views sum(((d,m),) -> ifelse(m, d, zero(eltype(data))), zip(data, mask))
       end

julia> let N = 1_000_000
           data = fill(true, N)
           mask = fill(true, N)
           @btime f_viewalloc2($data, $mask)
       end
  532.342 μs (0 allocations: 0 bytes)
1000000

Edit:
A better test of this hypothesis is benchmarking the creation of the view:

julia> let N = 1_000_000
           data = fill(true, N)
           mask = fill(true, N)
           @btime view($data,$mask)
       end
  797.884 μs (2 allocations: 7.63 MiB)

which indeed shows that this causes the allocations.

mikmoore · August 13, 2024, 1:59pm

julia> @view data[mask]
1000000-element view(::Vector{Bool}, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10  …  999991, 999992, 999993, 999994, 999995, 999996, 999997, 999998, 999999, 1000000]) with eltype Bool:

Notice that this converted the Bool index array to an array of indices. So @views is allocating on logical indexing. You could use an iterator or some other mechanism, however:

function f_viewalloc2(data, mask)
    sum(last, Iterators.filter(first, zip(mask,data)))
end

mbauman · August 13, 2024, 2:00pm

Yes, that’s correct; you generally want arrays to have O(1) lookup and size performance, but with a logical array index, you’d have to count the number of trues to figure out which index goes where. So views do a findall on logical vectors when you create them.

If you want to do a masked sum without allocations, you can use a Iterators.filter or (or even a generator with an if) instead.

abraemer · August 13, 2024, 2:02pm

So following from the answers of @mbauman and @mikmoore, if you have an array of indices instead, there are no allocations:

julia> let N = 1_000_000
           data = fill(true, N)
           mask = rand(1:N,N) # array of random indices
           @btime view($data,$mask)
       end
  390.367 μs (0 allocations: 0 bytes)
end

artemsolod · August 13, 2024, 2:15pm

Thank you all for looking into this! Does it only look like a performance trap to me? Neither “create a view” nor “iterate over a view” do not sound like an allocating operation.

mikmoore · August 13, 2024, 2:20pm

I’ll remark that logical indexing was something I used constantly in languages like MATLAB, but is seldom something I use in Julia. Usually, I can create the masked data in the first place rather than building full array and then masking after. It often ends up a little cleaner this way, anyway.

The cost of allocating and converting the logical array to an index array is probably less than a few random-access indexes into the view would be, since one cannot get O(1) random-access into logical indices. So there’s a performance trap either way, but at least the convert-to-index version is only one expensive build operation followed by cheap access (rather than one free build followed by many expensive accesses).

mbauman · August 13, 2024, 2:25pm

Interestingly, if we didn’t allocate, it’d be an even bigger performance trap.

Views are AbstractArrays, and sum has a specialized implementation for arrays that implicitly assumes that indexing and size calls are cheap. It effectively computes r += A[i] for all indices i in the array A. If we didn’t allocate here, computing A[1] has to start at the beginning of the array and look for the first true value. A[2] starts at the beginning of the array and looks for the second value. A[3] starts at the beginning of the array and looks for the third true. And on and an. This is a classic O(n^2) “accidentally quadratic” performance trap way way worse than doing the allocation.

sum does have support for generic iterables that could keep track of — effectively — where you left of that lookup, but if you’re using an indexing API you can’t hold onto that context or assume that all the accesses will be sequential. So then you need to use a structure that’s not an array in order to use that iterable method. And those are Iterators.filter or (a for a in A if _____).

Julia doesn’t do context-sensitive magic — your @view data[mask] doesn’t know that it’s getting passed to a sum, and sum doesn’t know that your view was constructed with a logical mask, and the indexing calls don’t know that all accesses will be sequential. Arguments to functions are evaluated in exactly the same way they would be if they weren’t arguments. And functions are evaluated in exactly the same way regardless of the syntax used for its arguments.

roflmaostc · August 13, 2024, 2:54pm

Slightly faster and shorter:

julia> function f_viewalloc2(data, mask)
                 @views sum(((d,m),) -> ifelse(m, d, zero(eltype(data))), zip(data, mask))
             end

julia> function f_viewalloc3(data, mask)
           mapreduce(*, +, data, mask)
       end

julia> let N = 1_000_000
                 data = fill(true, N)
                 mask = fill(true, N)
                 @btime f_viewalloc3($data, $mask)
               end
  146.478 μs (2 allocations: 976.67 KiB)
1000000

julia> let N = 1_000_000
                 data = fill(true, N)
                 mask = fill(true, N)
                 @btime f_viewalloc2($data, $mask)
               end
  292.176 μs (0 allocations: 0 bytes)
1000000

mbauman · August 13, 2024, 3:19pm

Your @views in f_viewalloc2 isn’t doing anything (except maybe confusing the situation). Both those options only work if you have numeric data types, but they’re indeed both great tricks. You can do even better:

# This is the generically correct thing: it works for all summable arrays
julia> f_generator_generic(data, mask) = sum(v for (i,v) in pairs(data) if mask[i])

julia> f_generator_fast(data, mask) = sum(v*mask[i] for (i,v) in pairs(data))

julia> @btime f_generator_generic($data, $mask)
  1.239 ms (0 allocations: 0 bytes)
1000000

julia> @btime f_generator_fast($data, $mask)
  79.125 μs (0 allocations: 0 bytes)
1000000

For reference, your methods were ~500µs and ~100µs, respectively.

rafael.guerra · August 13, 2024, 4:39pm

Not sure why this seems to be faster than the equivalent mapreduce() above, by @roflmaostc:

f_dot_product(data, mask) = sum(data .* mask)

mbauman · August 13, 2024, 4:50pm

You’ll notice both the mapreduce(*, +, data, mask) and the sum(data .* mask) are allocating — in the latter case that’s because it’s what you literally asked for (data .* mask is creating a new array to use as the argument), and in the former it’s an open issue (julia#53417). I believe the performance discrepancy is just due to some peculiarities in how Julia happens to be doing that unnecessary intermediate allocation.

DNF · August 13, 2024, 9:00pm

What about simply data' * mask, or dot(data, mask)?

rafael.guerra · August 13, 2024, 9:13pm

Both have 0 allocs, but seem to be 7x slower?

Seif_Shebl · August 13, 2024, 10:15pm

No, they aren’t. At least on my machine they both are as fast as the fastest sum(data[i] * mask[i] for i in eachindex(data)).

rafael.guerra · August 14, 2024, 6:18am

@Seif_Shebl, could you please try the following code, where the mask is a random sequence of true and false?

Benchmarks

using Random
Random.seed!(0)
N = 1_000_000
data = fill(true, N)
mask = rand((true, false), N)

# sum = 499261

f_viewalloc(data, mask) = @views sum(data[mask])
@btime f_viewalloc($data, $mask)            # 4.859 ms (2 allocs: 3.8 MiB)

f_viewalloc1(data, mask) = sum(last, Iterators.filter(first, zip(mask,data)))
@btime f_viewalloc1($data, $mask)           # 3.659 ms (0 allocs: 0 bytes)

f_viewalloc3(data, mask) = mapreduce(*, +, data, mask)
@btime f_viewalloc3($data, $mask)           # 153 μs (2 allocs: 977 KiB)

f_generator_fast(data, mask) = sum(v*mask[i] for (i,v) in pairs(data))
@btime f_generator_fast($data, $mask)       # 129 μs (0 allocs: 0 bytes)

f_dot_product(data, mask) = sum(data .* mask)
@btime f_dot_product($data, $mask)          # 106 μs (4 allocs: 126 KiB)

f_dot_product2(data, mask) = data' * mask
@btime f_dot_product2($data, $mask)         # 771 μs (0 allocs: 0 bytes)

f_dot_product3(data, mask) = dot(data, mask)
@btime f_dot_product3($data, $mask)         # 771 μs (0 allocs: 0 bytes)

And for reference, here is my:

versioninfo()

julia> versioninfo()
Julia Version 1.10.4
Commit 48d4fd4843 (2024-06-04 10:41 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 20 × 13th Gen Intel(R) Core(TM) i9-13900H
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, goldmont)
Threads: 14 default, 0 interactive, 7 GC (on 20 virtual cores)
Environment:
  JULIA_PKG_PRESERVE_TIERED_INSTALLED = true
  JULIA_PKG_USE_CLI_GIT = true
  JULIA_STACKFRAME_FUNCTION_COLOR = blue
  JULIA_WARN_COLOR = cyan
  JULIA_EDITOR = code.cmd -g
  JULIA_NUM_THREADS = 14

DNF · August 14, 2024, 8:34am

julia> @btime f_viewalloc1($data, $mask);
  3.055 ms (0 allocations: 0 bytes)

julia> @btime f_viewalloc3($data, $mask);
  148.400 μs (3 allocations: 976.62 KiB)

julia> @btime f_generator_fast($data, $mask);
  116.700 μs (0 allocations: 0 bytes)

julia> @btime f_dot_product($data, $mask);
  152.800 μs (4 allocations: 122.22 KiB)

julia> @btime f_dot_product2($data, $mask);
  117.500 μs (0 allocations: 0 bytes)

julia> @btime f_dot_product3($data, $mask);
  117.700 μs (0 allocations: 0 bytes)

julia> versioninfo()
Julia Version 1.11.0-rc2
Commit 34c3a63147 (2024-07-29 06:24 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 16 × Intel(R) Core(TM) i7-10875H CPU @ 2.30GHz
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, skylake)
Threads: 1 default, 0 interactive, 1 GC (on 16 virtual cores)

Here, data' * mask is as fast as the fastest alternative.

But on version 1.10.4 data' * mask uses 780us.

eldee · August 14, 2024, 8:44am

I get roughly the same timings as @rafael.guerra:

julia> @btime f_viewalloc($data, $mask);  
  4.092 ms (2 allocations: 3.81 MiB)

julia> @btime f_viewalloc1($data, $mask);
  3.183 ms (0 allocations: 0 bytes)

julia> @btime f_viewalloc3($data, $mask);
  137.800 μs (2 allocations: 976.67 KiB)

julia> @btime f_generator_fast($data, $mask);
  111.200 μs (0 allocations: 0 bytes)

julia> @btime f_dot_product($data, $mask);
  122.400 μs (4 allocations: 126.39 KiB)

julia> @btime f_dot_product2($data, $mask);
  777.800 μs (0 allocations: 0 bytes)

julia> @btime f_dot_product3($data, $mask);   
  777.900 μs (1 allocation: 16 bytes)

I do get a single allocation at the end, though.

Maybe something changed in 1.11 for the better (as I now see @DNF also already suggested)?

versioninfo

julia> versioninfo()
Julia Version 1.10.4
Commit 48d4fd4843 (2024-06-04 10:41 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 8 × Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, skylake)
Threads: 8 default, 0 interactive, 4 GC (on 8 virtual cores)
Environment:
  JULIA_NUM_THREADS = auto

roflmaostc · August 14, 2024, 8:58am

The result changes when data is not Bool. I was not sure if that was intentional that data is a Bool.

Code

using Random
using BenchmarkTools
using LinearAlgebra
Random.seed!(0)
N = 1_000_000
data = rand(Bool, N)  
mask = rand((true, false), N)

f_viewalloc(data, mask) = @views sum(data[mask])
@show @btime f_viewalloc($data, $mask)            # 4.859 ms (2 allocs: 3.8 MiB)

f_viewalloc1(data, mask) = sum(last, Iterators.filter(first, zip(mask,data)))
@show @btime f_viewalloc1($data, $mask)           # 3.659 ms (0 allocs: 0 bytes)

f_viewalloc3(data, mask) = mapreduce(*, +, data, mask)
@show @btime f_viewalloc3($data, $mask)           # 153 μs (2 allocs: 977 KiB)

f_generator_fast(data, mask) = sum(v*mask[i] for (i,v) in pairs(data))
@show @btime f_generator_fast($data, $mask)       # 129 μs (0 allocs: 0 bytes)

f_dot_product(data, mask) = sum(data .* mask)
@show @btime f_dot_product($data, $mask)          # 106 μs (4 allocs: 126 KiB)

f_dot_product2(data, mask) = data' * mask
@show @btime f_dot_product2($data, $mask)         # 771 μs (0 allocs: 0 bytes)

f_dot_product3(data, mask) = dot(data, mask)
@show @btime f_dot_product3($data, $mask)         # 771 μs (0 allocs: 0 bytes)

julia> include("/tmp/benchmark.jl") # data is Float64
  4.126 ms (2 allocations: 3.81 MiB)
#= /tmp/benchmark.jl:10 =# @btime(f_viewalloc($(Expr(:$, :data)), $(Expr(:$, :mask)))) = 249765.49111151227
  3.115 ms (0 allocations: 0 bytes)
#= /tmp/benchmark.jl:13 =# @btime(f_viewalloc1($(Expr(:$, :data)), $(Expr(:$, :mask)))) = 249765.49111151227
  1.390 ms (6 allocations: 7.63 MiB)
#= /tmp/benchmark.jl:16 =# @btime(f_viewalloc3($(Expr(:$, :data)), $(Expr(:$, :mask)))) = 249765.49111151224
  3.293 ms (0 allocations: 0 bytes)
#= /tmp/benchmark.jl:19 =# @btime(f_generator_fast($(Expr(:$, :data)), $(Expr(:$, :mask)))) = 249765.49111151227
  1.398 ms (2 allocations: 7.63 MiB)
#= /tmp/benchmark.jl:22 =# @btime(f_dot_product($(Expr(:$, :data)), $(Expr(:$, :mask)))) = 249765.49111151224
  3.170 ms (0 allocations: 0 bytes)
#= /tmp/benchmark.jl:25 =# @btime(f_dot_product2($(Expr(:$, :data)), $(Expr(:$, :mask)))) = 249765.49111151227
  3.094 ms (0 allocations: 0 bytes)
#= /tmp/benchmark.jl:28 =# @btime(f_dot_product3($(Expr(:$, :data)), $(Expr(:$, :mask)))) = 249765.49111151227
249765.49111151227

julia> include("/tmp/benchmark.jl") # data is bool
  3.876 ms (2 allocations: 3.81 MiB)
#= /tmp/benchmark.jl:10 =# @btime(f_viewalloc($(Expr(:$, :data)), $(Expr(:$, :mask)))) = 249502
  2.910 ms (0 allocations: 0 bytes)
#= /tmp/benchmark.jl:13 =# @btime(f_viewalloc1($(Expr(:$, :data)), $(Expr(:$, :mask)))) = 249502
  137.107 μs (2 allocations: 976.67 KiB)
#= /tmp/benchmark.jl:16 =# @btime(f_viewalloc3($(Expr(:$, :data)), $(Expr(:$, :mask)))) = 249502
  104.426 μs (0 allocations: 0 bytes)
#= /tmp/benchmark.jl:19 =# @btime(f_generator_fast($(Expr(:$, :data)), $(Expr(:$, :mask)))) = 249502
  84.967 μs (4 allocations: 126.39 KiB)
#= /tmp/benchmark.jl:22 =# @btime(f_dot_product($(Expr(:$, :data)), $(Expr(:$, :mask)))) = 249502
  652.136 μs (0 allocations: 0 bytes)
#= /tmp/benchmark.jl:25 =# @btime(f_dot_product2($(Expr(:$, :data)), $(Expr(:$, :mask)))) = 249502
  636.865 μs (0 allocations: 0 bytes)
#= /tmp/benchmark.jl:28 =# @btime(f_dot_product3($(Expr(:$, :data)), $(Expr(:$, :mask)))) = 249502
249502


``

Seif_Shebl · August 14, 2024, 5:27pm

Interesting, these are the timings on my generations-older machine than yours:

# YOUR NUMBERS:                 | MY NUMBERS:
#=====================================================================
# 4.859 ms (2 allocs: 3.8 MiB)  #   3.856 ms (3 allocations: 3.81 MiB)
# 3.659 ms (0 allocs: 0 bytes)  #   3.014 ms (0 allocations: 0 bytes)
# 153   μs (2 allocs: 977 KiB)  # 166.300 μs (3 allocations: 976.62 KiB)
# 129   μs (0 allocs: 0 bytes)  # 143.400 μs (0 allocations: 0 bytes)
# 106   μs (4 allocs: 126 KiB)  # 171.700 μs (4 allocations: 122.22 KiB)
# 771   μs (0 allocs: 0 bytes)  # 143.400 μs (0 allocations: 0 bytes)
# 771   μs (0 allocs: 0 bytes)  # 143.300 μs (0 allocations: 0 bytes)

But that could be a Julia-version thing than any other reason:

julia> versioninfo()
Julia Version 1.12.0-DEV.1024
Commit e7e8768a77 (2024-08-08 23:44 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 8 × Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
  WORD_SIZE: 64
  LLVM: libLLVM-18.1.7 (ORCJIT, haswell)
Threads: 1 default, 0 interactive, 1 GC (on 8 virtual cores)

Topic		Replies	Views
Reduce memory allocated in array view and in place sum Performance question	12	693	November 10, 2023
How do I use `@view` correctly? General Usage	4	750	June 23, 2020
Optimization of array views Performance	6	618	August 8, 2018
Is there a way to guarantee 0 allocations when accessing an array? General Usage	18	4306	June 1, 2017
Why does THIS allocate? No hairs were harmed in this experiment ('cause I am bald...) General Usage	16	549	October 16, 2023

Shouldn't `@views sum(data[mask])` be non-allocating?

Related topics