Elementwise array-operations and performance

question

#1

I’m trying to get an understanding on how to (most efficiently) handle elementwise array operations. If I construct an image like

function myImg(pts::Integer)::Array{Float64,2}
  # variables
  # rotation of ellipse
  aEll = 35.0/180.0*pi
  axes = [1/6,1/25]
  values = collect(linspace(-0.5,0.5,pts))
  # meshes
  gridX = [i for i in values, j in values]
  gridY = [j for i in values, j in values]
  # generate ellipse
  # rotate by alpha
  Xr = cos(aEll).*gridX - sin(aEll).*gridY
  Yr = cos(aEll).*gridY + sin(aEll).*gridX
  img = ((1/axes[1]*Xr.^2 + 1/axes[2]*Yr.^2).<=1).*( 10*pi*Yr);
  return mod(img-pi,2*pi)+pi
end

@time myImg(1024);

then I admit, the code is mainly taken from MatLab (despite the meshgrid function is replaced by the lines for Xr and Yr, that’s what I worked with the last years).

I would like to do this – merely to lean how Julia works – as efficient as possible: Take the equidistant grid of 1024 points on [0,1] in a rotated manner in order to build an ellipse with main axes axes and a linear increase inside (along the rotated y direction).

For now this code takes on my machine 0.552937 seconds (45.96 k allocations: 163.213 MB, 41.35% gc time) (while MatLab takes for the same roughly 0.15 seconds). This increases in both allocations and and time if I would add two more such elements on rotated grids (circles, squares,…); my approach to for-loops was indeed even slower…

TL;DR: How can I get such an element wise array construction as efficient as possible (mainly w.r.t. time, memory would be nice but not necessary)?


#2

Here is an attempt:

function myImg2(pts::Integer)
  # variables
  # rotation of ellipse
  aEll = 35.0/180.0*pi
  axes_inv = [6.0, 25.0]
  values = linspace(-0.5,0.5,pts)
  # meshes

  img = zeros(Float64, pts, pts)
  cosa = cos(aEll)
  sina = sin(aEll)
  @inbounds @fastmath for j in eachindex(values), i in eachindex(values)
      Xr = cosa*values[i] - sina*values[j]
      Yr = cosa*values[j] + sina*values[i]
      v = (axes_inv[1]*Xr^2 + axes_inv[2]*Yr^2)
      k = v <= 1 ? 10*pi*Yr : 0.0
      img[i,j] = mod(k-pi,2*pi)+pi
    end
  return img
end

You can leave out the fast math macro depending on if you care about proper NaN handling.

# New version
julia> @time myImg2(1024);
  0.012618 seconds (7 allocations: 8.000 MB)

# Original version
julia> @time myImg(1024);
  0.153331 seconds (839 allocations: 144.179 MB, 63.93% gc time)

#3

Thank you for four reply, your small example seems to work perfectly, however, i don’t seem able to generalize that to the case where I construct 3 those geometric objects. When I try to do the longer one (well two further Xr/Yr constructions and two further v/k in your code) I even get a time worse than my old one.

So: What am I missing here? Can you elaborate a little bit on why your code is so fast?


#4

I’ve annotated your method below with where the array allocations occur:

function myImg(pts::Integer)::Array{Float64,2}
  # variables
  # rotation of ellipse
  aEll = 35.0/180.0*pi
  axes = [1/6,1/25]     # allocates a 2-element array
  values = collect(linspace(-0.5,0.5,pts))  # allocates a `pts`-element array
  # meshes
  gridX = [i for i in values, j in values] # allocates a `pts`-by-`pts` matrix
  gridY = [j for i in values, j in values] # allocates a `pts`-by-`pts` matrix
  # generate ellipse
  # rotate by alpha
  Xr = cos(aEll).*gridX - sin(aEll).*gridY  # allocates 3 `pts`-by-`pts` matrices
  Yr = cos(aEll).*gridY + sin(aEll).*gridX # allocates 3 `pts`-by-`pts` matrices
  img = ((1/axes[1]*Xr.^2 + 1/axes[2]*Yr.^2).<=1).*( 10*pi*Yr); # allocates 5 `pts`-by-`pts` matrices
  return mod(img-pi,2*pi)+pi # allocates 3 `pts`-by-`pts` matrices
end

Now #17623 will help a little with the Xr, Yr and img lines, reducing them each to one array allocation, but @kristoffer.carlsson’s can eliminate these directly.

A few other minor points (not really performance related, but just good practice):

  • Use sind and cosd instead of sin(x/180.0*pi): these handle cases like sind(180) correctly (though it won’t make a huge difference in this case).
  • Similarly there is a mod2pi function, though in this case you could move the multiplication by pi outside, so the last few lines of the above solution would look like:
k = v <= 1 ? 10*Yr : 0.0
img[i,j] = (mod(k-1,2)+1)*pi

#5

The code is fast because:

  • It only does one (significant) allocation which is the output image. Your original code does an allocation for every vectorized call to store the result of that operation.
  • It doesn’t load and reload memory over and over which reduces cache misses.

Regarding why it doesn’t extend to more objects. Maybe you are just making a mistake. If you post your code, it can probably be fixed.


#6

Thanks for all the tipps and explanations, I narrowed it down to the following

function artificialInSARImage(pts::Integer)::Array{Float64,2}
  # variables
  values = linspace(-0.5,0.5,pts)
  aSteps = 60.0; cosS = cosd(aSteps); sinS = sind(aSteps)
  l = 0.075
  midP = [.125, .55]
  img = zeros(Float64,pts,pts)
  @inbounds @fastmath for j in eachindex(values), i in eachindex(values)
    Xr = cosS*values[i] - sinS*values[j]
    Yr = cosS*values[j] + sinS*values[i]
    k3=0
    @inbounds @fastmath for m=1:8
      k3 = (abs(Xr+midP[1]-m*l)+abs(Yr+midP[2]-m*l)) <= l ? 2*pi*(m/8) : 0.0
    end
    img[i,j] = mod(k3-pi,2*pi)+pi
  end
  return img
end

While all others are quite similar to the idea above this loop (even including my idea of the break) seems to break allocations (they get up to 2.19 M and stays there) while all others seem to just keep it down to 9-12 allocations. And somehow I can’t see why. Combining this k3 with k1 even yields 15 M allocations…hm. Still trying to get used to looking for allocations, I think.


#7

Change this to k3=0.0 and things will be ok.

I suggest reading through the performance tips: http://docs.julialang.org/en/release-0.5/manual/performance-tips/. Especially how to use the @code_warntype macro which highlights the problem immidiately:


#8

You have a small type instability: try @code_warntype. It will spit out a lot of output, but the top will look like:

julia> @code_warntype artificialInSARImage(1024)
Variables:
  #self#::#artificialInSARImage
  pts::Int64
  values::LinSpace{Float64}
  aSteps::Float64
  cosS::Float64
  sinS::Float64
  l::Float64
  midP::Array{Float64,1}
  img::Array{Float64,2}
  #temp#@_10::Int64
  j::Int64
  #temp#@_12::Int64
  i::Int64
  Xr::Float64
  Yr::Float64
  k3::Any
  #temp#@_17::Int64
  m::Int64
  #temp#@_19::Float64
  #temp#@_20::LambdaInfo
  #temp#@_21::Float64

In particular, note k3 is Any, not a concrete type (in the actual output it is highlighted in red).

Change the first line of k3 to k3 = 0.0 (though I think your inner loop might be incorrect, as you only use the result of k3 on the last iteration), changes:

julia> @time artificialInSARImage(1024);
  0.112607 seconds (9.44 M allocations: 152.000 MB, 11.51% gc time)

to

julia> @time artificialInSARImage(1024);
  0.023449 seconds (7 allocations: 8.000 MB, 15.50% gc time)

(after warmup).


#9

Oh, the @code_warntype is very helpful, thank you very much and excuse my beginner error of using 0 instead of 0.0 I already know that (usually).

@simonbyrne yes its meant to be a += :slight_smile: but that also works fine if the type is correct. Very nice!


#10

Another tip is that @inbounds will apply to everything inside the loop, also new loops. Same with the @fastmath macro. So there is no need to add @inbounds & co to the inner loop. And just as a warning, if you mess up with your indexing while using the @inbounds macro then julia can segfault or give weird results.

If you start julia with julia --check-bounds=yes then all the @inbounds macros will be turned off if you want to do a run with bounds checking activated again.


#11

Thanks for the clarification, that was part of what I meant with the elaboration, so I removed the inner @inbounds and yes I am awayre of the consequences and just to use that if the inner parts are debugged long enough :slight_smile: