Some feedback
- The outer
test appears to be trying to free all garbage (including the last run’s result) between runs for fresh starts. The MATLAB version’s clear y wouldn’t trigger the GC to immediately free the last result, so the Julia translation is doing something extra. It’s better to use a benchmarking library like BenchmarkTools.jl than trying to make your own; @benchmark test2(10, 10) samples=100 more or less does your benchmark with extra statistics, and the returned object holds the raw timings in the .times field.
test1 seems mostly fine. I’m pretty sure stack is doing the best-case I mentioned, though I can’t rule out some more efficient way of doing it. The allocated memory (we don’t call it workspace, and that’s not what it means in MATLAB either) is roughly twice the array size because you allocated the separate row vectors before allocating and copying to the result matrix. It’s possible to avoid allocating those row vectors separately by using StaticArrays.SVectors instead, but the current implementation is generally slower than Vector at larger sizes.
test2 allocates a separate vector per iteration at col = i .+ v, which doubles the allocated memory like test1 does. It also essentially wastes your preallocation because you don’t directly write the results of the computation to the space in the matrix. x[i,:] .= i .+ v as mentioned above works, and calling that test3, we see a massive improvement:
julia> @time test2(1_000_000, 100);
0.659842 seconds (2.00 M allocations: 1.609 GiB, 27.25% gc time)
julia> @time test3(1_000_000, 100);
0.247804 seconds (5 allocations: 762.940 MiB, 0.84% gc time)
julia> all(test2(1_000_000, 100) .== test3(1_000_000, 100))
true
Note that writing to a manual preallocation applies just as much to MATLAB, it could just look different because MATLAB does in-place writes and JIT optimizations in its own way. I’d hazard a guess that your MATLAB test2 is actually doing the equivalent of test3 roughly based on the timings, but I don’t really have a clue.
Improving test1 is trickier than it looks. We could reduce the number of allocations by directly appending elements to a vector instead of pushing vectors to a vector, then we can avoid the last matrix allocation with lazy wrappers (the cost is indexing needs more work).
julia> function test4(r,c)
x = Int64[]
v = collect(Int64(2): 2: Int64(2c))
lengthv = length(v)
# note that column-major Vector is treated as a column vector
for i in Int64(1): Int64(r)
resize!(x, length(x) + lengthv)
x[end - (lengthv-1):end] .= i .+ v
end
# view needed for laziness (no allocation)
# reshape to column-major horizontal stack of columns
# transpose to vertical stack of rows
transpose(reshape(view(x, :), c, r))
end
test4 (generic function with 1 method)
julia> all(test1(1_000_000, 100) .== test4(1_000_000, 100))
true
julia> @time test1(1_000_000, 100);
1.068569 seconds (2.00 M allocations: 1.626 GiB, 42.75% gc time)
julia> @time test4(1_000_000, 100);
0.810941 seconds (44 allocations: 2.965 GiB, 40.79% gc time)
While it saved a bit of time, it actually allocated more because the massive vector needed to be resized to hold more elements. test1 resized its vector of vectors too, but it’s a much shorter vector of smaller pointers with the actual data scattered elsewhere. Preallocation really helps, there’s no getting around that.
Speaking more broadly, 7.45 GiB for a single output matrix is really pushing 16 GiB of RAM. Reconsider how much of this matrix you need in RAM at a time, and note that you have much more storage to hold the rest.