I am trying to in-place map some vector functions across slices of multidimensional arrays
For instance, given three matrices: A, B, C = rand(4, 3), rand(4, 3), rand(4, 3)
I wish to efficiently compute the cross product across the rows of B and C, and store the results in A.

So I was wondering, what is the correct/best practice for doing it?
(Sorry if this is a duplicated question, since the operation need is elementary, but I couldn’t find a way for it…)

The best I could get was something like:

# I didn't find a way to cast a 2-D Array into nested Array{Array{Float64}},
# so the line below is just for initializing a storage
tmp = [rand(3) for _ in range(1, 4)]
# 155.633 ns (6 allocations: 768 bytes)
@btime broadcast!(cross, tmp, collect(eachrow($B)), collect(eachrow($C)))
# 154.115 ns (6 allocations: 768 bytes)
@btime map!((b, c) -> cross(b, c), tmp, collect(eachrow($B)), collect(eachrow($C)))

If you are working with 3-component arrays representing position vectors in space (so that you have lots of operations on small arrays of length 3), then I would strongly consider using StaticArrays.

In your example you would just have 1d arrays of SVectors and could use ordinary broadcasting:

julia> using StaticArrays, LinearAlgebra
julia> A, B, C = rand(SVector{3,Float64}, 4), rand(SVector{3,Float64}, 4), rand(SVector{3,Float64}, 4);
julia> A .= B .× C
4-element Vector{SVector{3, Float64}}:
[0.10928103138317774, -0.01855287736466356, -0.023819251068657343]
[0.16095133726726152, 0.34506454774262896, -0.2178141205786981]
[-0.453417349873811, 0.34516082042036367, 0.2587571993077106]
[0.5297917349791741, -0.2985033529890534, -0.3202434018984674]

In general, StaticArrays (or similar) are by far the most efficient and convenient way to work with physical coordinate vectors.

The point is that if you know ahead of time (at compile time) that you have lots of vectors of length 3 (or any other small fixed size), then StaticArrays lets you tell this information to the compiler to get efficient storage (e.g. inlined into arrays, stored in registers, …) and fast operations (e.g. unrolled loops, specialized implementations for particular sizes…).

Note also if you are coming from Matlab or Python or R and have been trained to rearrange all computations into “vectorized” form, one at a time (e.g. first a vectorized cross product, then a vectorized dot product, …), you might want to re-think your habits. Writing your own loop is perfectly fine in Julia too, and can avoid lots of inefficient temporary arrays or suboptimal repeated passes over arrays doing cheap operations (like cross products). Broadcasted operations can still be convenient, but then you should take advantage of Julia’s fusion for nested “dot calls.” See also this blog post.

I guess this is going to limit the capability for applications like sequentially broadcasting multiple vector functions along different dimensions, is that right?
If so, I guess we still don’t have a general solution…?

(Your solution solves my application, very much appreciated!)

Just saw your edit on encouraging implementing for-loop instead of trying to vectorize.
That makes a lot of sense, thanks!

Also note that, in the event that you are using matrices instead of vectors of StaticArrays, you should try to work along columns, not across rows, since Julia’s Array type is column major, like Matlab, and unlike numpy.

If you are doing more general operations along different dimensions of multidimensional arrays, I think looping is a lot more powerful than trying to shoehorn it into broadcasting — especially using Julia’s CartesianIndex machinery to easily express loops over N-dimensional arrays in code that is generic for any N. For example, see this code for wave propagation in N dimensions.