Based on the performance tips in the documentation, I have been looking at unexpected memory allocations in my code and trying to clean them up. While I have had some success on a number of functions I’m struggling with pre-allocation (http://docs.julialang.org/en/release-0.5/manual/performance-tips/#man-preallocation) in particular when doing any element-wise operations
Here is a minimal example of the unexpected behaviour I’m getting (test.jl):
function tester{T}(a :: Array{T}, b :: Array{T})
dummy = 0
c = a .* b
end
function tester!{T}(c :: Array{T}, a :: Array{T}, b :: Array{T})
dummy = 0
c[:] = a .* b
nothing
end
versioninfo()
N = 10 * 1000 * 1000
a = randn(N)
b = randn(N)
c = randn(N)
@show sizeof(a)
# Call for warmups
d = tester(a, b)
tester!(d, a, b)
Profile.clear_malloc_data()
# Call for testing
print( "Calling version which creates a new variable\n")
@time d = tester(a, b)
print( "Calling version which modifies an existing variable\n")
@time tester!(d, a, b)
/Applications/Julia-0.5.app/Contents/Resources/julia/bin/julia --track-allocation=user test.jl
Julia Version 0.5.0
Commit 3c9d753 (2016-09-19 18:14 UTC)
Platform Info:
System: Darwin (x86_64-apple-darwin13.4.0)
CPU: Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
LAPACK: libopenblas64_
LIBM: libopenlibm
LLVM: libLLVM-3.7.1 (ORCJIT, haswell)
sizeof(a) = 80000000
Calling version which creates a new variable
**0.065016 seconds (134 allocations: 76.302 MB, 23.16% gc time)**
Calling version which modifies an existing variable
**0.102362 seconds (8 allocations: 76.294 MB, 58.38% gc time)**
As can be seen, the tester! function does do fewer allocations (8 vs 134), but the size allocated seems to be roughly the same (a complete new array) and infact it is much slower.
What am I missing? Is there another notation to use to indicate I want to reuse the existing variable instead of reallocating it?
I have had success with using the answer[:] = foo(bar) notation for certain operations including abs2 and matrix multiply in some functions, however in other places and in particular when using element-wise operations this seems to still do the allocations.
I saw mention of .= notation, but not sure what the difference is and the result from limited testing seems to be similar.
Thanks I see that it does seem to resolve the problem. I’ll try implementing it throughout, hopefully it will resolve most of my unnecesary allocations.
I added the clear_malloc_data(), since I was running it to view .mem files as well to see if I learned anything else there, just didn’t want to include it into the post and make it even longer.
I seem to be getting that c .*= a also is allocating memory
While I can work around it with c .= (*).(a,c) it is a lot less clear. Is this also something addressed in 0.6?
Yes. In all earlier versions of Julia, .* was an operator, matching MATLAB. Because it was an operator, the . was not really a broadcasting . in this case. However, in v0.6 .* as an operator is deprecated and is replaced with the scalar * plus . broadcast, making .* look the same but act slightly differently. And because of this new way that it acts, it can fuse and thus not allocate.
What you’re doing with (*).(a,c) is writing f.(x), i.e. writing * in the broadcasted function form, instead of using the .* operator, to get around this issue. But that’s only a temporary fix for what is changed in later versions (I think the change still needs to be merged? But it’s coming).
Thanks for the help thus far, I’m managing to cut down on wasted dramatically and gaining significant speedups in the process.
However I still seem to be getting problems where I try to use subsets of a larger matrix:
function testerBlock!{T}(a :: Array{T}, b :: Array{T})
dummy = 0
a[:,:] .= b[5000000:5999999, :]
nothing
end
function testerLoop!{T}(a :: Array{T}, b :: Array{T})
dummy = 0
outerdim = size(a,2)
for outer=1:outerdim
for inner=1:1000000
@inbounds a[inner, outer] = b[5000000+inner, outer]
end
end
nothing
end
function testerLoopSimd!{T}(a :: Array{T}, b :: Array{T})
dummy = 0
outerdim = size(a,2)
for outer=1:outerdim
@simd for inner=1:1000000
@inbounds a[inner, outer] = b[5000000+inner, outer]
end
end
nothing
end
versioninfo()
N = 1000000
a = randn(N, 10)
b = randn(10*N, 10)
@show sizeof(a)
# Call for warmups
testerBlock!(a, b)
testerLoop!(a, b)
testerLoopSimd!(a, b)
Profile.clear_malloc_data()
# Call for testing
print( "block copy \n")
@time testerBlock!(a, b)
print( "Loop copy \n")
@time testerLoop!(a, b)
print( "Loop copy with simd\n")
@time testerLoopSimd!(a, b)
Why is the block copy doing an extra allocation? Is there a way to prevent this without using a manual loop?
Any idea why adding the @simd macro makes the loop version actually slower? (My initial version had it in and was then slower than the block copy, but with less memory use and I only by accident discovered that removing it makes the loop faster than the block copy)
As @mauro3 noted, b[5000000:5999999, :] creates a copy, while @view b[5000000:5999999, :] will creates a view. You likely don’t want to copy the array very often.
As far as I am able to determine the allocations now come from creating the UnitRange variables every time. Any idea why from 512 onwards, there is also an additional allocation?
Is there some trick to get rid of these allocations?
I’m not sure if the remaining allocations really makes a difference, however now and again I still see the GC coming in. And the fact that I still see many k allocations if I @time the whole block, is annoying and worrying about the potential unnecessary impact on the GC. Maybe that is just a side effect from coming from C where I needed to worry about all memory freeing and not having a GC at all.
Remember that views also allocate some memory, even though much less than a copy. That might be what you’re seeing. Anyway, there doesn’t seem to be much to gain here, performance-wise, even if you get rid of these last allocations:
julia> @benchmark for cnt=500:520
$outblock[:,cnt] = @view $inblock[((cnt*900+1):(cnt*900+1000))]
end
BenchmarkTools.Trial:
memory estimate: 1.13 kb
allocs estimate: 30
--------------
minimum time: 10.359 μs (0.00% GC)
median time: 10.875 μs (0.00% GC)
mean time: 11.332 μs (0.00% GC)
maximum time: 49.736 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
julia> @benchmark for cnt in 500:520
for i in 1:1000
$outblock[i, cnt] = $inblock[900cnt + i]
end
end
BenchmarkTools.Trial:
memory estimate: 0.00 bytes
allocs estimate: 0
--------------
minimum time: 22.731 μs (0.00% GC)
median time: 22.917 μs (0.00% GC)
mean time: 24.535 μs (0.00% GC)
maximum time: 97.092 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
julia> @benchmark for cnt in 500:520
for i in 1:1000
@inbounds $outblock[i, cnt] = $inblock[900cnt + i]
end
end
BenchmarkTools.Trial:
memory estimate: 0.00 bytes
allocs estimate: 0
--------------
minimum time: 9.642 μs (0.00% GC)
median time: 10.041 μs (0.00% GC)
mean time: 10.433 μs (0.00% GC)
maximum time: 31.683 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
That was an attempt at a smallest code snippet showing the issue (plus the weirdness that at 512 there suddenly is the need for an extra allocation).
In reality my figures for realworld looks like:
1.292455 seconds (74.61 k allocations: 4.600 MB)
1.358962 seconds (76.88 k allocations: 4.859 MB)
1.334156 seconds (74.64 k allocations: 4.622 MB)
1.450134 seconds (79.16 k allocations: 5.131 MB, 4.82% gc time)
1.350128 seconds (77.63 k allocations: 4.954 MB)
1.309084 seconds (76.58 k allocations: 4.848 MB)
1.312494 seconds (78.97 k allocations: 5.113 MB)
1.347740 seconds (74.50 k allocations: 4.615 MB)
About 36k of the allocations each iteration, I have traced to this. Another 33k each iteration, I believe is related, but slightly more complex case. In total, that is about 10x more allocations than what I really want/use.
The GC effect claimed at below 5%, seems a bit of an underestimate if I look at that iteration compared to the others. But if you average the GC over the longer term, it is now small.
Look, atleast I have managed to get rid of the temporary variables that were KB or even 100MBs at a time, resulting in about a factor 2 speedup overall.
If the only option to get this down further is manually coding the loops, I think I will live with some GC. I just wanted to check if there was something else I missed.