# Elementwise array-operations and performance

I’m trying to get an understanding on how to (most efficiently) handle elementwise array operations. If I construct an image like

``````function myImg(pts::Integer)::Array{Float64,2}
# variables
# rotation of ellipse
aEll = 35.0/180.0*pi
axes = [1/6,1/25]
values = collect(linspace(-0.5,0.5,pts))
# meshes
gridX = [i for i in values, j in values]
gridY = [j for i in values, j in values]
# generate ellipse
# rotate by alpha
Xr = cos(aEll).*gridX - sin(aEll).*gridY
Yr = cos(aEll).*gridY + sin(aEll).*gridX
img = ((1/axes[1]*Xr.^2 + 1/axes[2]*Yr.^2).<=1).*( 10*pi*Yr);
return mod(img-pi,2*pi)+pi
end

@time myImg(1024);
``````

then I admit, the code is mainly taken from MatLab (despite the `meshgrid` function is replaced by the lines for `Xr` and `Yr`, that’s what I worked with the last years).

I would like to do this – merely to lean how Julia works – as efficient as possible: Take the equidistant grid of `1024` points on `[0,1]` in a rotated manner in order to build an ellipse with main axes `axes` and a linear increase inside (along the rotated `y` direction).

For now this code takes on my machine `0.552937 seconds (45.96 k allocations: 163.213 MB, 41.35% gc time)` (while MatLab takes for the same roughly `0.15 seconds`). This increases in both allocations and and time if I would add two more such elements on rotated grids (circles, squares,…); my approach to for-loops was indeed even slower…

TL;DR: How can I get such an element wise array construction as efficient as possible (mainly w.r.t. time, memory would be nice but not necessary)?

1 Like

Here is an attempt:

``````function myImg2(pts::Integer)
# variables
# rotation of ellipse
aEll = 35.0/180.0*pi
axes_inv = [6.0, 25.0]
values = linspace(-0.5,0.5,pts)
# meshes

img = zeros(Float64, pts, pts)
cosa = cos(aEll)
sina = sin(aEll)
@inbounds @fastmath for j in eachindex(values), i in eachindex(values)
Xr = cosa*values[i] - sina*values[j]
Yr = cosa*values[j] + sina*values[i]
v = (axes_inv[1]*Xr^2 + axes_inv[2]*Yr^2)
k = v <= 1 ? 10*pi*Yr : 0.0
img[i,j] = mod(k-pi,2*pi)+pi
end
return img
end
``````

You can leave out the fast math macro depending on if you care about proper NaN handling.

``````# New version
julia> @time myImg2(1024);
0.012618 seconds (7 allocations: 8.000 MB)

# Original version
julia> @time myImg(1024);
0.153331 seconds (839 allocations: 144.179 MB, 63.93% gc time)
``````
7 Likes

Thank you for four reply, your small example seems to work perfectly, however, i don’t seem able to generalize that to the case where I construct 3 those geometric objects. When I try to do the longer one (well two further Xr/Yr constructions and two further v/k in your code) I even get a time worse than my old one.

So: What am I missing here? Can you elaborate a little bit on why your code is so fast?

I’ve annotated your method below with where the array allocations occur:

``````function myImg(pts::Integer)::Array{Float64,2}
# variables
# rotation of ellipse
aEll = 35.0/180.0*pi
axes = [1/6,1/25]     # allocates a 2-element array
values = collect(linspace(-0.5,0.5,pts))  # allocates a `pts`-element array
# meshes
gridX = [i for i in values, j in values] # allocates a `pts`-by-`pts` matrix
gridY = [j for i in values, j in values] # allocates a `pts`-by-`pts` matrix
# generate ellipse
# rotate by alpha
Xr = cos(aEll).*gridX - sin(aEll).*gridY  # allocates 3 `pts`-by-`pts` matrices
Yr = cos(aEll).*gridY + sin(aEll).*gridX # allocates 3 `pts`-by-`pts` matrices
img = ((1/axes[1]*Xr.^2 + 1/axes[2]*Yr.^2).<=1).*( 10*pi*Yr); # allocates 5 `pts`-by-`pts` matrices
return mod(img-pi,2*pi)+pi # allocates 3 `pts`-by-`pts` matrices
end
``````

Now #17623 will help a little with the `Xr`, `Yr` and `img` lines, reducing them each to one array allocation, but @kristoffer.carlsson’s can eliminate these directly.

A few other minor points (not really performance related, but just good practice):

• Use `sind` and `cosd` instead of `sin(x/180.0*pi)`: these handle cases like `sind(180)` correctly (though it won’t make a huge difference in this case).
• Similarly there is a `mod2pi` function, though in this case you could move the multiplication by `pi` outside, so the last few lines of the above solution would look like:
``````k = v <= 1 ? 10*Yr : 0.0
img[i,j] = (mod(k-1,2)+1)*pi
``````
4 Likes

The code is fast because:

• It only does one (significant) allocation which is the output image. Your original code does an allocation for every vectorized call to store the result of that operation.
• It doesn’t load and reload memory over and over which reduces cache misses.

Regarding why it doesn’t extend to more objects. Maybe you are just making a mistake. If you post your code, it can probably be fixed.

1 Like

Thanks for all the tipps and explanations, I narrowed it down to the following

``````function artificialInSARImage(pts::Integer)::Array{Float64,2}
# variables
values = linspace(-0.5,0.5,pts)
aSteps = 60.0; cosS = cosd(aSteps); sinS = sind(aSteps)
l = 0.075
midP = [.125, .55]
img = zeros(Float64,pts,pts)
@inbounds @fastmath for j in eachindex(values), i in eachindex(values)
Xr = cosS*values[i] - sinS*values[j]
Yr = cosS*values[j] + sinS*values[i]
k3=0
@inbounds @fastmath for m=1:8
k3 = (abs(Xr+midP[1]-m*l)+abs(Yr+midP[2]-m*l)) <= l ? 2*pi*(m/8) : 0.0
end
img[i,j] = mod(k3-pi,2*pi)+pi
end
return img
end
``````

While all others are quite similar to the idea above this loop (even including my idea of the break) seems to break allocations (they get up to 2.19 M and stays there) while all others seem to just keep it down to 9-12 allocations. And somehow I can’t see why. Combining this k3 with k1 even yields 15 M allocations…hm. Still trying to get used to looking for allocations, I think.

Change this to `k3=0.0` and things will be ok.

I suggest reading through the performance tips: http://docs.julialang.org/en/release-0.5/manual/performance-tips/. Especially how to use the `@code_warntype` macro which highlights the problem immidiately:

You have a small type instability: try `@code_warntype`. It will spit out a lot of output, but the top will look like:

``````julia> @code_warntype artificialInSARImage(1024)
Variables:
#self#::#artificialInSARImage
pts::Int64
values::LinSpace{Float64}
aSteps::Float64
cosS::Float64
sinS::Float64
l::Float64
midP::Array{Float64,1}
img::Array{Float64,2}
#temp#@_10::Int64
j::Int64
#temp#@_12::Int64
i::Int64
Xr::Float64
Yr::Float64
k3::Any
#temp#@_17::Int64
m::Int64
#temp#@_19::Float64
#temp#@_20::LambdaInfo
#temp#@_21::Float64
``````

In particular, note `k3` is `Any`, not a concrete type (in the actual output it is highlighted in red).

Change the first line of `k3` to `k3 = 0.0` (though I think your inner loop might be incorrect, as you only use the result of `k3` on the last iteration), changes:

``````julia> @time artificialInSARImage(1024);
0.112607 seconds (9.44 M allocations: 152.000 MB, 11.51% gc time)
``````

to

``````julia> @time artificialInSARImage(1024);
0.023449 seconds (7 allocations: 8.000 MB, 15.50% gc time)
``````

(after warmup).

1 Like

Oh, the `@code_warntype` is very helpful, thank you very much and excuse my beginner error of using `0` instead of `0.0` I already know that (usually).

@simonbyrne yes its meant to be a `+=` but that also works fine if the type is correct. Very nice!

Another tip is that `@inbounds` will apply to everything inside the loop, also new loops. Same with the `@fastmath` macro. So there is no need to add `@inbounds` & co to the inner loop. And just as a warning, if you mess up with your indexing while using the `@inbounds` macro then julia can segfault or give weird results.

If you start julia with `julia --check-bounds=yes` then all the `@inbounds` macros will be turned off if you want to do a run with bounds checking activated again.

2 Likes

Thanks for the clarification, that was part of what I meant with the elaboration, so I removed the inner `@inbounds` and yes I am awayre of the consequences and just to use that if the inner parts are debugged long enough