Looping over subset of array indices: Julia vs Python

I am new to Julia and currently trying to emulate some practices I’m used to in Python. One problem I am struggling with is how to best iterate over some of the indices of an array. For example, consider an array with size (nx, ny, nz). I want to apply a function that works on 1D arrays, so I do a loop over nx and ny to work on a subset of the array, say [i, j, :]. What is the best way to achieve this?

Here is an example for some code with a cumulative sum function, where I find my best Julia code is about twice as slow as Python. I can’t quite figure out why exactly Python is doing better here. Here’s the Python code and timings, for reference:

import numpy as np
from scipy.integrate.quadrature import cumtrapz 

(NX, NY, NZ) = (256, 256, 75)
z = np.linspace(0., 1.e3, NZ)
M = np.ones((NX, NY, NZ), dtype='f8')
out = np.empty_like(M)
%timeit cumtrapz(M[100, 100], x=z)

12.4 µs ± 249 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%timeit cumtrapz(M, x=z, axis=2)

44.3 ms ± 2.84 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

With Julia, I’m using NumericalIntegration.jl which provides a function called cumul_integrate that works similarly to cumtrapz. Here is my first try with Julia:

import NumericalIntegration: integrate, cumul_integrate, TrapezoidalFast
using BenchmarkTools

(NX, NY, NZ) = (256, 256, 75)
z = LinRange(0., 1.e3, NZ)
M = ones(Float64, (NX, NY, NZ))
out = similar(M);
@btime cumul_integrate(z, M[100, 100, :], TrapezoidalFast());

871.545 ns (6 allocations: 3.63 KiB)

@btime begin
    for j=1:256, i=1:256
        out[i, j, :] = cumul_integrate(z, M[i, j, :], TrapezoidalFast())
    end
end

127.190 ms (458752 allocations: 234.00 MiB)

You can see that in the case of only one column, the ~871 ns time is quite a bit faster than Python (12.4 µs), which clearly has more overheads. However, the Julia loop time does not scale as the number of runs of the loop. In this case, 256*256 * 871 ns should be around 57 ms, but the result is about twice as bad (and almost three times slower than Python).

Experimenting a bit, I found a more efficient way in Julia by transposing the original array and using selectdim. This is the fastest I was able to do:

M = ones(Float64, (75, 256, 256))
out = similar(M);
@btime begin
    for j=1:256, i=1:256
        @inbounds out[:, i, j] = cumul_integrate(z, selectdim(selectdim(M, 2, i), 2, j), TrapezoidalFast())
    end
end

85.986 ms (786432 allocations: 201.00 MiB)

Unfortunately, it is still about twice as slow as Python. Am I doing something obviously wrong here?

Any tips on how to better run loops in Julia calling functions with lower dimensionality would be very welcome. In this case, the cumulative sum function is easy enough to code from scratch, but in the general case this is not always possible/desirable.

Thanks!

The Julia performance tips from the manual have a lot of really useful suggestions. In fact, they already cover all three suggestions I would have made:

  1. Avoid global variables: https://docs.julialang.org/en/v1/manual/performance-tips/#Avoid-global-variables-1
  2. Use views for your slices: https://docs.julialang.org/en/v1/manual/performance-tips/#Consider-using-views-for-slices-1 (this is what numpy does by default, which is one important difference between your Julia and Python code.
  3. Access arrays in memory order: https://docs.julialang.org/en/v1/manual/performance-tips/#man-performance-column-major-1 (you’ve already discovered this when you found out that accessing out[:, i, j] was faster than out[i, j, :]).
5 Likes

What is the purpose of using selectdim instead of ordinary indexing?

I now realise selectdim was a silly way to get a view. Changing that bit to instead view(M, :, i, j) does a little better - about 18% faster.

Did you try wrapping the loop in a function and then passing your input and output arrays as arguments to that function?

1 Like

Thanks for the tip! Indeed putting it into a function makes it about 3x faster, or almost 5x faster than my first naive attempt. So it appears the global variables was the key factor here.

By the way, are there more elegant ways to iterate over sub-indices of an array instead of the explicit double loop I did in the example? eachindex() is the obvious candidate, but it reduces to single elements in the array, while I need a slice (1D or otherwise). In this case, I replaced the following:

for j=1:256, i=1:256
    out[:, i, j] = do_something(view(M, :, i, j))
end

using CartesianIndices in this way:

for J in CartesianIndices(M[1, :, :])
    out[:, J] = do_something(view(M, :, J))
end

It seems to have a slight performance penalty over the first loop. Is there a better way to do this?

Remember that M[1, :, :] creates a copy of that part of the array, so you’re creating a big copy of an array just to construct a set of indices (which should be trivial to do just from the dimensions of the array).

Instead, you can use the fact that CartesianIndices also happily accepts a tuple of sizes (see ?CartesianIndices for more):

julia> M = rand(2, 2, 2)
2×2×2 Array{Float64,3}:
[:, :, 1] =
 0.0337042  0.245066
 0.348989   0.587323

[:, :, 2] =
 0.514845  0.772325
 0.297874  0.709676

julia> CartesianIndices(M[1, :, :])
2×2 CartesianIndices{2,Tuple{Base.OneTo{Int64},Base.OneTo{Int64}}}:
 CartesianIndex(1, 1)  CartesianIndex(1, 2)
 CartesianIndex(2, 1)  CartesianIndex(2, 2)

julia> CartesianIndices((size(M, 2), size(M, 3)))
2×2 CartesianIndices{2,Tuple{Base.OneTo{Int64},Base.OneTo{Int64}}}:
 CartesianIndex(1, 1)  CartesianIndex(1, 2)
 CartesianIndex(2, 1)  CartesianIndex(2, 2)
1 Like

Not at my computer, but doesn’t size(M, (2,3)) work?

Edit: Apparently not. Odd, since Matlab has that, and it’s normally the other way around :thinking:

That sounds like a job for broadcasting!

julia> size.(Ref(M), (2, 3))
(2, 2)
2 Likes

Thank you for the suggestions! I do realise this can be done from the size of the array, just trying to find out if there was a simpler way to do it (doesn’t look like it).

@rdeits Yes, I understand that M[1, :, :1] creates a copy. In my example the performance penalty was minimal, but it is a sloppy way to go around it. Seems like CartesianIndices(view(M, 1, :, :)) is the way to go, at least the way with fewer characters.