Reduce allocations for shaped views

Egwene_al_Vere · February 13, 2019, 4:49pm

tl;dr

Reduce the allocation for foo(u,...): Inside the function, given an array u of size m*1, get the first n=d*d*p elements of u where n<m is guaranteed, and reshape it as an array f of d-by-d matrices (access as f[j], j in 1…p). f can be read-only. Is there a minimum allocations that cannot be avoided?

For example, I need to take, say, 40 elements of an array r of size 60, reshape it into 10 2-by-2 matrices f[...], and do calculation on these matrices later (e.g. the remaining 20 elements of r is used somewhere else). The matrices in real applications is much bigger so using StaticArrays is not such a good fit. Sample code: [note that f[i] .* i here is a placeholder; in real code, it’s like df[:,:,i] .= c[j,k].*f[k]*f[j]+d[j]*f[j], where c and d are coefficients, and there’s another sum over j and k inside the i loop]

df=zeros(Complex{Float64},2,2,10)
r=rand(1,60)+1im*rand(1,60)
f=[zeros(ComplexF64,2,2) for i in 1:10]

function d(df,f,r)
    @views f .= [reshape(r[(i-1)*4+1:4*i],2,2) for i in 1:10]
    for i in 1:10
        df[:,:,i] .= f[i] .* i
    end
    return 0
end

@btime shows 925.667 ns (53 allocations: 3.17 KiB); the df[:,:,i] loop does not allocate. This function gets called about 1e5 times so reducing the allocation can help a lot. Thanks!

Seif_Shebl · February 13, 2019, 5:35pm

There is no real need for f, you can actually loop through df and assign the values from r directly with zero allocations. Also note the ! convention in Julia for functions modifying their arguments.

function d!(df,r)
    for i in eachindex(df)
        df[i] = r[i] * (1 + (i-1)÷4)
    end 
    return 0
end

df = zeros(Complex{Float64},2,2,10)
r  = rand(Complex{Float64}, 60) 

julia> using BenchmarkTools

julia> @btime d!($df,$r)
  34.922 ns (0 allocations: 0 bytes)

Seif_Shebl · February 13, 2019, 5:54pm

Just to add that if a function in Julia returns nothing, we just say return nothing or simply remove the return statement completely.

Egwene_al_Vere · February 13, 2019, 6:28pm

Thanks! Sorry for not being exactly clear, the f[i] .* i example is not the actual calculation; in the real calculation (too long to share here), it’s a complex matrix multiply/add, etc: basically, df[:,:,i] .= c[j,k].*f[k]*f[j]+d[j]*f[j],where c and d are predetermined coefficients, and there’s a sum over j and k

stillyslalom · February 13, 2019, 7:27pm

**edit: ignore post. Incorrect translation of algorithm.

If you aren’t updating f on the LHS, I’m not sure why you’d use @view. Just use a temporary array:

using BenchmarkTools

df = zeros(Complex{Float64},2,2,10)
r  = zeros(ComplexF64, 60)
f0 = [zeros(ComplexF64,2,2) for i in 1:10]
f1 = zeros(ComplexF64,2,2)
C = D = rand(2,2)

function d0(df,f,r,c,d)
    @views f .= [reshape(r[(i-1)*4+1:4*i],2,2) for i in 1:10]
    @inbounds for k in 1:2, j in 1:2, i in 1:10
        df[:,:,i] .= c[j,k].*f[k]*f[j]+d[j]*f[j]
    end
    return 0
end

function d1(df,f,r,c,d)
    @inbounds for i in 1:10
        f .= reshape(r[(i-1)*4+1:4*i],2,2)
        for k in 1:2, j in 1:2
            df[:,:,i] .= c[j,k].*f[k]*f[j]+d[j]*f[j]
        end
    end
    return 0
end

@btime d0($df, $f0, $r, $C, $D) #   10.569 μs (233 allocations: 27.39 KiB)
@btime d1($df, $f1, $r, $C, $D) #   880.000 ns (30 allocations: 2.34 KiB)

bennedich · February 13, 2019, 7:50pm

That solution seems broken (d1 doesn’t do the same thing as d0). I’m sharing a post I recently wrote below, which I think is really important, and at the same time very simple to do.

As for how to speed this up: I often find that the easiest and fastest way is to just stick to for loops instead of views and reshapes. Just remember to do @inbounds, and to acess arrays in memory order, along columns. (Note: due to caching/data locality, you might sometimes benefit from reorganizing your data.)

Egwene_al_Vere · February 13, 2019, 8:07pm

Thanks! I agree with your method; in fact, I have the “golden master” program output to a csv file and check not sum, but the actual sha-256 checksum of the file to check they are indeed identical: while one do not normally compare float numbers, for the exact same algorithm, a strict rule can be that even the float number is identical, or within a reasonable distance from, say, eps(Float64).

longer story

Since here I have coupled equations that deal with matrices, 1 vector and 1 number, and the differential equation solver natively supports only arrays that are linearly indexed, I need to linearize for the solver and restore for the matrix operations, which means reshape; I’m in the process of reorganizing my matrices so that reshapes are no longer necessary, and when the RecursiveArrayTools.jl in the julia differential equation solver gets fully supported this problem is automatically solved. [already a known issue here]

bennedich · February 13, 2019, 8:50pm

Sounds great! Hmm, yes, for the exact same algorithm, run under the exact same circumstances, floating points should be identical, but I think that can be a bit too strict. For example, vectorizing code (SIMD) is often the key to great performance for loops like this, but can reorder operations so that the end results are not identical, or within eps. Simple example illustrating how the order of additions can make a difference:

julia> sum(1/n for n=1:100000)
12.090146129863335

julia> sum([1/n for n=1:100000])
12.090146129863427

But yes… it’s not the exact same algorithm: The first version sums the numbers consecutively, the second version uses pair-wise summation.

On the other hand, even if the code and data is identical, floating point operations may differ slightly for different CPU architectures (this is important to know if for example you test your code on Travis, which uses several different CPU architectures).

Btw, do you have a good method for comparing SHA checksums while allowing numbers to be inexact?

Egwene_al_Vere · February 13, 2019, 9:59pm

Thanks for the info on different cpu architectures! Off-topic to the question here, but I don’t think sha or other checksum can be used that way (except where you truncate the output to x digits): there’s an “avalanche effect”, where, as per this wikipedia page, “a good checksum algorithm will usually output a significantly different value, even for small changes made to the input”. For arrays, I just use something like maximum(abs.(GM_value-currentValue))

improbable22 · February 13, 2019, 11:57pm

If I understood right, f is essentially intended as pre-allocated space into which you copy the data from r. But I think your code makes views, and then converts these to new Matrixes (as this is what the container f can hold) and stores them. If you do want to copy into the individual f[i] matrices, then I think f[i] .= … needs the reshape but copyto! does not. My version:

julia> @btime d($df,$f,$r)
  1.022 μs (53 allocations: 3.17 KiB)

julia> using UnsafeArrays

julia> function d2(df,f,r)
           for i in 1:10
               # copyto!(f[i], r[(i-1)*4+1:4*i]) # in d1()
               copyto!(f[i], uview(r,(i-1)*4+1:4*i))
           end
           for i in 1:10
               df[:,:,i] .= f[i] .* i
           end
           return 0
       end
d2 (generic function with 1 method)

julia> @btime d1($df,$f,$r)
  499.830 ns (10 allocations: 1.41 KiB)

julia> @btime d2($df,$f,$r)
  186.583 ns (0 allocations: 0 bytes)

Topic		Replies	Views
Passing views to function without allocation New to Julia	14	1533	December 18, 2020
Why these two functions have different allocations? General Usage	11	421	July 13, 2021
Counterintuitive array update allocations Performance	11	572	July 17, 2021
Reshape() still allocating memory in Julia 1.5.0 Performance	2	1392	August 6, 2020
Common allocation mistakes Performance memory-allocation	47	7136	August 21, 2023

Reduce allocations for shaped views

Related topics