Reduce allocations for shaped views


#1

tl;dr

Reduce the allocation for foo(u,...): Inside the function, given an array u of size m*1, get the first n=d*d*p elements of u where n<m is guaranteed, and reshape it as an array f of d-by-d matrices (access as f[j], j in 1…p). f can be read-only. Is there a minimum allocations that cannot be avoided?


For example, I need to take, say, 40 elements of an array r of size 60, reshape it into 10 2-by-2 matrices f[...], and do calculation on these matrices later (e.g. the remaining 20 elements of r is used somewhere else). The matrices in real applications is much bigger so using StaticArrays is not such a good fit. Sample code: [note that f[i] .* i here is a placeholder; in real code, it’s like df[:,:,i] .= c[j,k].*f[k]*f[j]+d[j]*f[j], where c and d are coefficients, and there’s another sum over j and k inside the i loop]

df=zeros(Complex{Float64},2,2,10)
r=rand(1,60)+1im*rand(1,60)
f=[zeros(ComplexF64,2,2) for i in 1:10]

function d(df,f,r)
    @views f .= [reshape(r[(i-1)*4+1:4*i],2,2) for i in 1:10]
    for i in 1:10
        df[:,:,i] .= f[i] .* i
    end
    return 0
end

@btime shows 925.667 ns (53 allocations: 3.17 KiB); the df[:,:,i] loop does not allocate. This function gets called about 1e5 times so reducing the allocation can help a lot. Thanks!


JuliaDiffEq with custom types
#2

There is no real need for f, you can actually loop through df and assign the values from r directly with zero allocations. Also note the ! convention in Julia for functions modifying their arguments.

function d!(df,r)
    for i in eachindex(df)
        df[i] = r[i] * (1 + (i-1)÷4)
    end 
    return 0
end

df = zeros(Complex{Float64},2,2,10)
r  = rand(Complex{Float64}, 60) 

julia> using BenchmarkTools

julia> @btime d!($df,$r)
  34.922 ns (0 allocations: 0 bytes)

#3

Just to add that if a function in Julia returns nothing, we just say return nothing or simply remove the return statement completely.


#4

Thanks! Sorry for not being exactly clear, the f[i] .* i example is not the actual calculation; in the real calculation (too long to share here), it’s a complex matrix multiply/add, etc: basically, df[:,:,i] .= c[j,k].*f[k]*f[j]+d[j]*f[j],where c and d are predetermined coefficients, and there’s a sum over j and k


#5

**edit: ignore post. Incorrect translation of algorithm.

If you aren’t updating f on the LHS, I’m not sure why you’d use @view. Just use a temporary array:

using BenchmarkTools

df = zeros(Complex{Float64},2,2,10)
r  = zeros(ComplexF64, 60)
f0 = [zeros(ComplexF64,2,2) for i in 1:10]
f1 = zeros(ComplexF64,2,2)
C = D = rand(2,2)

function d0(df,f,r,c,d)
    @views f .= [reshape(r[(i-1)*4+1:4*i],2,2) for i in 1:10]
    @inbounds for k in 1:2, j in 1:2, i in 1:10
        df[:,:,i] .= c[j,k].*f[k]*f[j]+d[j]*f[j]
    end
    return 0
end

function d1(df,f,r,c,d)
    @inbounds for i in 1:10
        f .= reshape(r[(i-1)*4+1:4*i],2,2)
        for k in 1:2, j in 1:2
            df[:,:,i] .= c[j,k].*f[k]*f[j]+d[j]*f[j]
        end
    end
    return 0
end

@btime d0($df, $f0, $r, $C, $D) #   10.569 μs (233 allocations: 27.39 KiB)
@btime d1($df, $f1, $r, $C, $D) #   880.000 ns (30 allocations: 2.34 KiB)

#6

That solution seems broken (d1 doesn’t do the same thing as d0). I’m sharing a post I recently wrote below, which I think is really important, and at the same time very simple to do.

As for how to speed this up: I often find that the easiest and fastest way is to just stick to for loops instead of views and reshapes. Just remember to do @inbounds, and to acess arrays in memory order, along columns. (Note: due to caching/data locality, you might sometimes benefit from reorganizing your data.)


#7

Thanks! I agree with your method; in fact, I have the “golden master” program output to a csv file and check not sum, but the actual sha-256 checksum of the file to check they are indeed identical: while one do not normally compare float numbers, for the exact same algorithm, a strict rule can be that even the float number is identical, or within a reasonable distance from, say, eps(Float64).

longer story

Since here I have coupled equations that deal with matrices, 1 vector and 1 number, and the differential equation solver natively supports only arrays that are linearly indexed, I need to linearize for the solver and restore for the matrix operations, which means reshape; I’m in the process of reorganizing my matrices so that reshapes are no longer necessary, and when the RecursiveArrayTools.jl in the julia differential equation solver gets fully supported this problem is automatically solved. [already a known issue here]


#8

Sounds great! Hmm, yes, for the exact same algorithm, run under the exact same circumstances, floating points should be identical, but I think that can be a bit too strict. For example, vectorizing code (SIMD) is often the key to great performance for loops like this, but can reorder operations so that the end results are not identical, or within eps. Simple example illustrating how the order of additions can make a difference:

julia> sum(1/n for n=1:100000)
12.090146129863335

julia> sum([1/n for n=1:100000])
12.090146129863427

But yes… it’s not the exact same algorithm: The first version sums the numbers consecutively, the second version uses pair-wise summation.

On the other hand, even if the code and data is identical, floating point operations may differ slightly for different CPU architectures (this is important to know if for example you test your code on Travis, which uses several different CPU architectures).

Btw, do you have a good method for comparing SHA checksums while allowing numbers to be inexact?


#9

Thanks for the info on different cpu architectures! Off-topic to the question here, but I don’t think sha or other checksum can be used that way (except where you truncate the output to x digits): there’s an “avalanche effect”, where, as per this wikipedia page, “a good checksum algorithm will usually output a significantly different value, even for small changes made to the input”. For arrays, I just use something like maximum(abs.(GM_value-currentValue))


#10

If I understood right, f is essentially intended as pre-allocated space into which you copy the data from r. But I think your code makes views, and then converts these to new Matrixes (as this is what the container f can hold) and stores them. If you do want to copy into the individual f[i] matrices, then I think f[i] .= … needs the reshape but copyto! does not. My version:

julia> @btime d($df,$f,$r)
  1.022 μs (53 allocations: 3.17 KiB)

julia> using UnsafeArrays

julia> function d2(df,f,r)
           for i in 1:10
               # copyto!(f[i], r[(i-1)*4+1:4*i]) # in d1()
               copyto!(f[i], uview(r,(i-1)*4+1:4*i))
           end
           for i in 1:10
               df[:,:,i] .= f[i] .* i
           end
           return 0
       end
d2 (generic function with 1 method)

julia> @btime d1($df,$f,$r)
  499.830 ns (10 allocations: 1.41 KiB)

julia> @btime d2($df,$f,$r)
  186.583 ns (0 allocations: 0 bytes)