Why is this Julia code considerably slower than Matlab

I also tested v0.6.0-pre.alpha.34 and found something interesting:
Shouldn’t that be the fastest using loop fusion?:

function test_perf4()
    range = 1:2000000
    range_transp = collect(range)'
    steering_vectors = complex.(ones(4,11), ones(4,11))

    sum_signal = zeros(Complex{Float64}, 4, length(range))
    for i = 1:11
        sum_signal .+= steering_vectors[:,i] .* cis.((2 * pi * 1.023e6 / 4e6) .* range_transp .+ (40 * pi / 180));
    end
    return sum_signal
  end

However this gives me:

BenchmarkTools.Trial: 
  memory estimate:  137.33 MiB
  allocs estimate:  41
  --------------
  minimum time:     14.918 s (0.07% GC)
  median time:      14.918 s (0.07% GC)
  mean time:        14.918 s (0.07% GC)
  maximum time:     14.918 s (0.07% GC)
  --------------
  samples:          1
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

However, when I remove the the dot after cis I get the following:

BenchmarkTools.Trial: 
  memory estimate:  641.00 MiB
  allocs estimate:  899
  --------------
  minimum time:     3.818 s (2.12% GC)
  median time:      3.852 s (3.95% GC)
  mean time:        3.852 s (3.95% GC)
  maximum time:     3.887 s (5.75% GC)
  --------------
  samples:          2
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

This gives me the following warning, though:

WARNING: cis{T <: Number}(x::AbstractArray{T}) is deprecated, use cis.(x) instead.

So it seems that loop fusion still needs some adjustments.
BTW these tests were done with -O3
EDIT: The same goes for exp() and exp.()
EDIT I filed in issue for that: https://github.com/JuliaLang/julia/issues/20875

Loop fusion will actually be slower here. This is clearer if we simplify the code a bit. Suppose we are doing X .= colvec .* cis.(rowvec) (i.e. combining a column vector and a row vector to make a matrix). This is lowered to broadcast!((x,y) -> x * cis(y), X, colvec, rowvec), which is essentially equivalent to:

for j=1:length(rowvec), i=1:length(colvec)
    X[i,j] = colvec[i] * cis(rowvec[j])
end

The problem here is that if X is m×n, then we end up calling the cis function mn times.

If, instead, we use

tmp = cis.(rowvec)
X .= colvec .* tmp

it only calls the cis function n times, at the expense of requiring more storage.

This is the sort of space/time tradeoff that it is hard to automate. As a rule of thumb, however, if you are doing a broadcast operation combining a row and a column vector, it will be faster if you do any expensive operations on the row and column vector before doing the broadcast combination.

In this particular case, I would suggest doing something like:

function test_perf5()
    rangeᵀ = (1:2000000)'
    rowtemp = similar(Complex128, rangeᵀ)
    steering_vectors = complex.(ones(4,11), ones(4,11)) # placeholder for actual vectors?

    sum_signal = zeros(Complex{Float64}, 4, length(rangeᵀ))
    for i = 1:11
        rowtemp .= cis.(1.6069246423111792 .* rangeᵀ .+ 0.6981317007977318)
        sum_signal .+= steering_vectors[:,i] .* rowtemp
    end
    return sum_signal
  end

You could also maybe try @views, or simply allocate steering_vectors as an array of vectors, to avoid allocating a copy in steering_vectors[:,i]. You can also use @. before the for and then omit all of the dots.

9 Likes

Could I ask the original poster: how did you determine the memory usage (62.7M) of your Matlab function? I don’t know how to do this with the Matlab profiler, and I couldn’t find anything about it in the Matlab docs.

Thanks,
Steve Vavasis

62,7M wasn’t correct. I had to put the code into a function:

Is used:

profile -memory on
...
profile report

Thanks for the info about how to get memory usage data from the Matlab profiler! Apparently this is an undocumented feature of Matlab. I just played with it a bit and seemed to get inconsistent results, so perhaps it’s unreliable. Or perhaps I’m using it incorrectly. Hard to say, since it’s undocumented…