Allocation of new variables vs modification of existing variables

Based on the performance tips in the documentation, I have been looking at unexpected memory allocations in my code and trying to clean them up. While I have had some success on a number of functions I’m struggling with pre-allocation (http://docs.julialang.org/en/release-0.5/manual/performance-tips/#man-preallocation) in particular when doing any element-wise operations

Here is a minimal example of the unexpected behaviour I’m getting (test.jl):

function tester{T}(a :: Array{T}, b :: Array{T})
     dummy = 0
     c = a .* b
end


function tester!{T}(c :: Array{T}, a :: Array{T}, b :: Array{T})
     dummy = 0
     c[:] = a .* b
     nothing
end

versioninfo()

N = 10 * 1000 * 1000
a = randn(N)
b = randn(N)
c = randn(N)

@show sizeof(a)

# Call for warmups
d = tester(a, b)
tester!(d, a, b)

Profile.clear_malloc_data()

# Call for testing
print( "Calling version which creates a new variable\n")
@time d = tester(a, b)
print( "Calling version which modifies an existing variable\n")
@time tester!(d, a, b)

/Applications/Julia-0.5.app/Contents/Resources/julia/bin/julia --track-allocation=user test.jl 
Julia Version 0.5.0
Commit 3c9d753 (2016-09-19 18:14 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin13.4.0)
  CPU: Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1 (ORCJIT, haswell)
sizeof(a) = 80000000
Calling version which creates a new variable
  **0.065016 seconds (134 allocations: 76.302 MB, 23.16% gc time)**
Calling version which modifies an existing variable
  **0.102362 seconds (8 allocations: 76.294 MB, 58.38% gc time)**

As can be seen, the tester! function does do fewer allocations (8 vs 134), but the size allocated seems to be roughly the same (a complete new array) and infact it is much slower.

What am I missing? Is there another notation to use to indicate I want to reuse the existing variable instead of reallocating it?
I have had success with using the answer[:] = foo(bar) notation for certain operations including abs2 and matrix multiply in some functions, however in other places and in particular when using element-wise operations this seems to still do the allocations.
I saw mention of .= notation, but not sure what the difference is and the result from limited testing seems to be similar.

Thanks in advance

This works:

julia> function tester!{T}(c :: Array{T}, a :: Array{T}, b :: Array{T})
            dummy = 0
            c .= (*).(a,b)
            nothing
       end

In 0.6 you can write it as c .= a.*b. The problem with your version is that a.*b creates a new array which is then copied into c, thus the allocation.

This also works:

julia> function tester!{T}(c :: Array{T}, a :: Array{T}, b :: Array{T})
            dummy = 0
            for i in eachindex(a)
              c[i] = a[i]*b[i]
            end
            nothing
       end

The Profile.clear_malloc_data() is not needed when using @time, just when writing a .mem file.

2 Likes

Thanks I see that it does seem to resolve the problem. I’ll try implementing it throughout, hopefully it will resolve most of my unnecesary allocations.

I added the clear_malloc_data(), since I was running it to view .mem files as well to see if I learned anything else there, just didn’t want to include it into the post and make it even longer.

Hi

I seem to be getting that
c .*= a also is allocating memory
While I can work around it with
c .= (*).(a,c) it is a lot less clear. Is this also something addressed in 0.6?

Thanks

Yes. In all earlier versions of Julia, .* was an operator, matching MATLAB. Because it was an operator, the . was not really a broadcasting . in this case. However, in v0.6 .* as an operator is deprecated and is replaced with the scalar * plus . broadcast, making .* look the same but act slightly differently. And because of this new way that it acts, it can fuse and thus not allocate.

What you’re doing with (*).(a,c) is writing f.(x), i.e. writing * in the broadcasted function form, instead of using the .* operator, to get around this issue. But that’s only a temporary fix for what is changed in later versions (I think the change still needs to be merged? But it’s coming).

2 Likes

Thanks for the help thus far, I’m managing to cut down on wasted dramatically and gaining significant speedups in the process.

However I still seem to be getting problems where I try to use subsets of a larger matrix:

function testerBlock!{T}(a :: Array{T}, b :: Array{T})
     dummy = 0

     a[:,:] .= b[5000000:5999999, :]
     nothing
end

function testerLoop!{T}(a :: Array{T}, b :: Array{T})
     dummy = 0
     outerdim = size(a,2)
     for outer=1:outerdim
       for inner=1:1000000
         @inbounds a[inner, outer] = b[5000000+inner, outer]
       end
     end
     nothing
end


function testerLoopSimd!{T}(a :: Array{T}, b :: Array{T})
     dummy = 0
     outerdim = size(a,2)
     for outer=1:outerdim
       @simd for inner=1:1000000
         @inbounds a[inner, outer] = b[5000000+inner, outer]
       end
     end
     nothing
end



versioninfo()

N = 1000000
a = randn(N, 10)
b = randn(10*N, 10)

@show sizeof(a)

# Call for warmups
testerBlock!(a, b)
testerLoop!(a, b)
testerLoopSimd!(a, b)

Profile.clear_malloc_data()

# Call for testing
print( "block copy \n")
@time testerBlock!(a, b)
print( "Loop copy \n")
@time testerLoop!(a, b)
print( "Loop copy with simd\n")
@time testerLoopSimd!(a, b)

With outputs:
/Applications/Julia-0.5.app/Contents/Resources/julia/bin/julia --track-allocation=user test.jl
Julia Version 0.5.0
Commit 3c9d753 (2016-09-19 18:14 UTC)
Platform Info:
System: Darwin (x86_64-apple-darwin13.4.0)
CPU: Intel(R) Core™ i7-4870HQ CPU @ 2.50GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
LAPACK: libopenblas64_
LIBM: libopenlibm
LLVM: libLLVM-3.7.1 (ORCJIT, haswell)
sizeof(a) = 80000000
block copy
0.117563 seconds (135 allocations: 76.302 MB, 55.28% gc time)
Loop copy
0.058884 seconds (4 allocations: 160 bytes)
Loop copy with simd
0.164927 seconds (4 allocations: 160 bytes)

Why is the block copy doing an extra allocation? Is there a way to prevent this without using a manual loop?
Any idea why adding the @simd macro makes the loop version actually slower? (My initial version had it in and was then slower than the block copy, but with less memory use and I only by accident discovered that removing it makes the loop faster than the block copy)

Thanks

You should checkout @view

As @mauro3 noted, b[5000000:5999999, :] creates a copy, while @view b[5000000:5999999, :] will creates a view. You likely don’t want to copy the array very often.

Thanks, the @view helped a lot, however I still seem to have allocations due to UnitRange variables, as an example:

inblock = randn(1000000)
outblock = randn(1000,1000)

for cnt=500:520
   @printf("Cntr %d", cnt)
   @time outblock[:,cnt] = @view inblock[((cnt*900+1):(cnt*900+1000))]
end

With output:
Cntr 500 0.000003 seconds (2 allocations: 80 bytes)
Cntr 501 0.000002 seconds (2 allocations: 80 bytes)
Cntr 502 0.000003 seconds (2 allocations: 80 bytes)
Cntr 503 0.000002 seconds (2 allocations: 80 bytes)
Cntr 504 0.000002 seconds (2 allocations: 80 bytes)
Cntr 505 0.000002 seconds (2 allocations: 80 bytes)
Cntr 506 0.000002 seconds (2 allocations: 80 bytes)
Cntr 507 0.000002 seconds (2 allocations: 80 bytes)
Cntr 508 0.000001 seconds (2 allocations: 80 bytes)
Cntr 509 0.000001 seconds (2 allocations: 80 bytes)
Cntr 510 0.000001 seconds (2 allocations: 80 bytes)
Cntr 511 0.000002 seconds (2 allocations: 80 bytes)
Cntr 512 0.000001 seconds (3 allocations: 96 bytes)
Cntr 513 0.000002 seconds (3 allocations: 96 bytes)
Cntr 514 0.000002 seconds (3 allocations: 96 bytes)
Cntr 515 0.000001 seconds (3 allocations: 96 bytes)
Cntr 516 0.000001 seconds (3 allocations: 96 bytes)
Cntr 517 0.000002 seconds (3 allocations: 96 bytes)
Cntr 518 0.000002 seconds (3 allocations: 96 bytes)
Cntr 519 0.000001 seconds (3 allocations: 96 bytes)
Cntr 520 0.000001 seconds (3 allocations: 96 bytes)

As far as I am able to determine the allocations now come from creating the UnitRange variables every time. Any idea why from 512 onwards, there is also an additional allocation?

Is there some trick to get rid of these allocations?

I’m not sure if the remaining allocations really makes a difference, however now and again I still see the GC coming in. And the fact that I still see many k allocations if I @time the whole block, is annoying and worrying about the potential unnecessary impact on the GC. Maybe that is just a side effect from coming from C where I needed to worry about all memory freeing and not having a GC at all.

Remember that views also allocate some memory, even though much less than a copy. That might be what you’re seeing. Anyway, there doesn’t seem to be much to gain here, performance-wise, even if you get rid of these last allocations:


julia> @benchmark for cnt=500:520
           $outblock[:,cnt] = @view $inblock[((cnt*900+1):(cnt*900+1000))]
       end
BenchmarkTools.Trial: 
  memory estimate:  1.13 kb
  allocs estimate:  30
  --------------
  minimum time:     10.359 μs (0.00% GC)
  median time:      10.875 μs (0.00% GC)
  mean time:        11.332 μs (0.00% GC)
  maximum time:     49.736 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark for cnt in 500:520
           for i in 1:1000
               $outblock[i, cnt] = $inblock[900cnt + i]
           end
       end
BenchmarkTools.Trial: 
  memory estimate:  0.00 bytes
  allocs estimate:  0
  --------------
  minimum time:     22.731 μs (0.00% GC)
  median time:      22.917 μs (0.00% GC)
  mean time:        24.535 μs (0.00% GC)
  maximum time:     97.092 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark for cnt in 500:520
           for i in 1:1000
               @inbounds $outblock[i, cnt] = $inblock[900cnt + i]
           end
       end
BenchmarkTools.Trial: 
  memory estimate:  0.00 bytes
  allocs estimate:  0
  --------------
  minimum time:     9.642 μs (0.00% GC)
  median time:      10.041 μs (0.00% GC)
  mean time:        10.433 μs (0.00% GC)
  maximum time:     31.683 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

80 bytes of memory allocation won’t be seen in any real-world benchmarks of nontrivial algorithms.

Hi

That was an attempt at a smallest code snippet showing the issue (plus the weirdness that at 512 there suddenly is the need for an extra allocation).

In reality my figures for realworld looks like:
1.292455 seconds (74.61 k allocations: 4.600 MB)
1.358962 seconds (76.88 k allocations: 4.859 MB)
1.334156 seconds (74.64 k allocations: 4.622 MB)
1.450134 seconds (79.16 k allocations: 5.131 MB, 4.82% gc time)
1.350128 seconds (77.63 k allocations: 4.954 MB)
1.309084 seconds (76.58 k allocations: 4.848 MB)
1.312494 seconds (78.97 k allocations: 5.113 MB)
1.347740 seconds (74.50 k allocations: 4.615 MB)

About 36k of the allocations each iteration, I have traced to this. Another 33k each iteration, I believe is related, but slightly more complex case. In total, that is about 10x more allocations than what I really want/use.
The GC effect claimed at below 5%, seems a bit of an underestimate if I look at that iteration compared to the others. But if you average the GC over the longer term, it is now small.
Look, atleast I have managed to get rid of the temporary variables that were KB or even 100MBs at a time, resulting in about a factor 2 speedup overall.

If the only option to get this down further is manually coding the loops, I think I will live with some GC. I just wanted to check if there was something else I missed.