Some feedback
- The outer
test
appears to be trying to free all garbage (including the last run’s result) between runs for fresh starts. The MATLAB version’s clear y
wouldn’t trigger the GC to immediately free the last result, so the Julia translation is doing something extra. It’s better to use a benchmarking library like BenchmarkTools.jl than trying to make your own; @benchmark test2(10, 10) samples=100
more or less does your benchmark with extra statistics, and the returned object holds the raw timings in the .times
field.
test1
seems mostly fine. I’m pretty sure stack
is doing the best-case I mentioned, though I can’t rule out some more efficient way of doing it. The allocated memory (we don’t call it workspace, and that’s not what it means in MATLAB either) is roughly twice the array size because you allocated the separate row vectors before allocating and copying to the result matrix. It’s possible to avoid allocating those row vectors separately by using StaticArrays.SVector
s instead, but the current implementation is generally slower than Vector
at larger sizes.
test2
allocates a separate vector per iteration at col = i .+ v
, which doubles the allocated memory like test1
does. It also essentially wastes your preallocation because you don’t directly write the results of the computation to the space in the matrix. x[i,:] .= i .+ v
as mentioned above works, and calling that test3
, we see a massive improvement:
julia> @time test2(1_000_000, 100);
0.659842 seconds (2.00 M allocations: 1.609 GiB, 27.25% gc time)
julia> @time test3(1_000_000, 100);
0.247804 seconds (5 allocations: 762.940 MiB, 0.84% gc time)
julia> all(test2(1_000_000, 100) .== test3(1_000_000, 100))
true
Note that writing to a manual preallocation applies just as much to MATLAB, it could just look different because MATLAB does in-place writes and JIT optimizations in its own way. I’d hazard a guess that your MATLAB test2
is actually doing the equivalent of test3
roughly based on the timings, but I don’t really have a clue.
Improving test1
is trickier than it looks. We could reduce the number of allocations by directly appending elements to a vector instead of pushing vectors to a vector, then we can avoid the last matrix allocation with lazy wrappers (the cost is indexing needs more work).
julia> function test4(r,c)
x = Int64[]
v = collect(Int64(2): 2: Int64(2c))
lengthv = length(v)
# note that column-major Vector is treated as a column vector
for i in Int64(1): Int64(r)
resize!(x, length(x) + lengthv)
x[end - (lengthv-1):end] .= i .+ v
end
# view needed for laziness (no allocation)
# reshape to column-major horizontal stack of columns
# transpose to vertical stack of rows
transpose(reshape(view(x, :), c, r))
end
test4 (generic function with 1 method)
julia> all(test1(1_000_000, 100) .== test4(1_000_000, 100))
true
julia> @time test1(1_000_000, 100);
1.068569 seconds (2.00 M allocations: 1.626 GiB, 42.75% gc time)
julia> @time test4(1_000_000, 100);
0.810941 seconds (44 allocations: 2.965 GiB, 40.79% gc time)
While it saved a bit of time, it actually allocated more because the massive vector needed to be resized to hold more elements. test1
resized its vector of vectors too, but it’s a much shorter vector of smaller pointers with the actual data scattered elsewhere. Preallocation really helps, there’s no getting around that.
Speaking more broadly, 7.45 GiB for a single output matrix is really pushing 16 GiB of RAM. Reconsider how much of this matrix you need in RAM at a time, and note that you have much more storage to hold the rest.