I’m trying to get an understanding on how to (most efficiently) handle elementwise array operations. If I construct an image like

function myImg(pts::Integer)::Array{Float64,2}
# variables
# rotation of ellipse
aEll = 35.0/180.0*pi
axes = [1/6,1/25]
values = collect(linspace(-0.5,0.5,pts))
# meshes
gridX = [i for i in values, j in values]
gridY = [j for i in values, j in values]
# generate ellipse
# rotate by alpha
Xr = cos(aEll).*gridX - sin(aEll).*gridY
Yr = cos(aEll).*gridY + sin(aEll).*gridX
img = ((1/axes[1]*Xr.^2 + 1/axes[2]*Yr.^2).<=1).*( 10*pi*Yr);
return mod(img-pi,2*pi)+pi
end
@time myImg(1024);

then I admit, the code is mainly taken from MatLab (despite the meshgrid function is replaced by the lines for Xr and Yr, that’s what I worked with the last years).

I would like to do this – merely to lean how Julia works – as efficient as possible: Take the equidistant grid of 1024 points on [0,1] in a rotated manner in order to build an ellipse with main axes axes and a linear increase inside (along the rotated y direction).

For now this code takes on my machine 0.552937 seconds (45.96 k allocations: 163.213 MB, 41.35% gc time) (while MatLab takes for the same roughly 0.15 seconds). This increases in both allocations and and time if I would add two more such elements on rotated grids (circles, squares,…); my approach to for-loops was indeed even slower…

TL;DR: How can I get such an element wise array construction as efficient as possible (mainly w.r.t. time, memory would be nice but not necessary)?

Thank you for four reply, your small example seems to work perfectly, however, i don’t seem able to generalize that to the case where I construct 3 those geometric objects. When I try to do the longer one (well two further Xr/Yr constructions and two further v/k in your code) I even get a time worse than my old one.

So: What am I missing here? Can you elaborate a little bit on why your code is so fast?

I’ve annotated your method below with where the array allocations occur:

function myImg(pts::Integer)::Array{Float64,2}
# variables
# rotation of ellipse
aEll = 35.0/180.0*pi
axes = [1/6,1/25] # allocates a 2-element array
values = collect(linspace(-0.5,0.5,pts)) # allocates a `pts`-element array
# meshes
gridX = [i for i in values, j in values] # allocates a `pts`-by-`pts` matrix
gridY = [j for i in values, j in values] # allocates a `pts`-by-`pts` matrix
# generate ellipse
# rotate by alpha
Xr = cos(aEll).*gridX - sin(aEll).*gridY # allocates 3 `pts`-by-`pts` matrices
Yr = cos(aEll).*gridY + sin(aEll).*gridX # allocates 3 `pts`-by-`pts` matrices
img = ((1/axes[1]*Xr.^2 + 1/axes[2]*Yr.^2).<=1).*( 10*pi*Yr); # allocates 5 `pts`-by-`pts` matrices
return mod(img-pi,2*pi)+pi # allocates 3 `pts`-by-`pts` matrices
end

Now #17623 will help a little with the Xr, Yr and img lines, reducing them each to one array allocation, but @kristoffer.carlsson’s can eliminate these directly.

A few other minor points (not really performance related, but just good practice):

Use sind and cosd instead of sin(x/180.0*pi): these handle cases like sind(180) correctly (though it won’t make a huge difference in this case).

Similarly there is a mod2pi function, though in this case you could move the multiplication by pi outside, so the last few lines of the above solution would look like:

k = v <= 1 ? 10*Yr : 0.0
img[i,j] = (mod(k-1,2)+1)*pi

It only does one (significant) allocation which is the output image. Your original code does an allocation for every vectorized call to store the result of that operation.

It doesn’t load and reload memory over and over which reduces cache misses.

Regarding why it doesn’t extend to more objects. Maybe you are just making a mistake. If you post your code, it can probably be fixed.

Thanks for all the tipps and explanations, I narrowed it down to the following

function artificialInSARImage(pts::Integer)::Array{Float64,2}
# variables
values = linspace(-0.5,0.5,pts)
aSteps = 60.0; cosS = cosd(aSteps); sinS = sind(aSteps)
l = 0.075
midP = [.125, .55]
img = zeros(Float64,pts,pts)
@inbounds @fastmath for j in eachindex(values), i in eachindex(values)
Xr = cosS*values[i] - sinS*values[j]
Yr = cosS*values[j] + sinS*values[i]
k3=0
@inbounds @fastmath for m=1:8
k3 = (abs(Xr+midP[1]-m*l)+abs(Yr+midP[2]-m*l)) <= l ? 2*pi*(m/8) : 0.0
end
img[i,j] = mod(k3-pi,2*pi)+pi
end
return img
end

While all others are quite similar to the idea above this loop (even including my idea of the break) seems to break allocations (they get up to 2.19 M and stays there) while all others seem to just keep it down to 9-12 allocations. And somehow I can’t see why. Combining this k3 with k1 even yields 15 M allocations…hm. Still trying to get used to looking for allocations, I think.

In particular, note k3 is Any, not a concrete type (in the actual output it is highlighted in red).

Change the first line of k3 to k3 = 0.0 (though I think your inner loop might be incorrect, as you only use the result of k3 on the last iteration), changes:

Another tip is that @inbounds will apply to everything inside the loop, also new loops. Same with the @fastmath macro. So there is no need to add @inbounds & co to the inner loop. And just as a warning, if you mess up with your indexing while using the @inbounds macro then julia can segfault or give weird results.

If you start julia with julia --check-bounds=yes then all the @inbounds macros will be turned off if you want to do a run with bounds checking activated again.

Thanks for the clarification, that was part of what I meant with the elaboration, so I removed the inner @inbounds and yes I am awayre of the consequences and just to use that if the inner parts are debugged long enough