Using a function on each row of a matrix and getting a similar matrix back

Suppose we have a function acting on a 3-element vector,

function foo(xyz)
a = xyz[1]+1
b = xyz[2]*5
c = xyz[3]/2
return[a,b,c]
end

It works fine if I use a vector as an input, like foo([1,2,3]), but what if I’d like to perform the same operation in each row of a “n by 3” matrix and get a matrix of the same size back?
Is there a recommended way to do this?
Rewriting the function is also an option, if there’s a more elegant way to do so.

I tried using map, but map(foo,[1,2,3]) gives me errors for some reason I don’t understand. Any insights into this?

stack(foo.(eachrow(rand(10,3))))’ seems to do the job, but feels a bit clunky, and there must be a better way to do it.

Multiple dispatch–define a method for xyz::Vector and another for xyz::Matrix. Methods · The Julia Language

1 Like

See mapslices. For example mapslices(foo, rand(10,3); dims=2).

Although if you’re actually using vectors of length 3 (or <50 or so), you might do better to use the StaticArrays.jl package to make a Vector{SVector{3,Float64}} rather than Matrix{Float64}. SVectors are super nice (and fast!) for operations on small vectors, and this format comes pre-sliced so you can simply map foo on it.

2 Likes

This is equivalent to [foo(1), foo(2), foo(3)], i.e. calling foo for each element of [1,2,3]. This errors because your foo function expects a vector, not a scalar.

You could simply write a loop:

B = similar(A) # assumes result of `foo` is the same type
@views for i in axes(A,1)
    B[i,:] = foo(A[i,:])
end

Loops give a lot more flexibility. For example, you could even work in-place in A if you want, and maybe have an in-place foo! variant acting on each row so that the whole thing is allocation-free.

PS. Of course, if you are really working with lots of 3-component vectors as in your example, you might want to completely reconsider your data structures. If you know ahead of time that all your vectors have 3 components, e.g. if they are spatial positions in 3d, then it is likely to be more efficient and natural to work with them as SVectors from StaticArrays.jl. And if you have a collection of 3-component vectors, don’t express them as rows of a matrix, express them as 1d arrays of SVectors.

PPS. You know that Julia arrays are column-major, right? (Unlike numpy.) So it is more efficient to work by column than by row. But arrays of SVector are even better for collections of small fixed-length vectors.

4 Likes

I believe mapslices is the closest to what I was looking for, thanks!

Also, I was not aware of svectors, I’ll consider these.
Would that also imply that vectors of vectors are faster than matrices for applications like this one?

Thank you very much for the detailed reply, I now understand map a bit better.

As for the foo! variant, would I have to define that in the function, or is it akin to using a . in “foo.”?

I’m used to matrices due to matlab background (I’m sure that’s a classic case in this forum). I just find them easier to index for some reason. For exaple, in an array of 3 dimensional vectors, I’m not sure how to get an array with the 1st element of each of the vectors (with a matrix I’d just do M[:,1]).

Finally, I’m aware of the column-major thing, but I don’t have a solid grasp of its implications. For some reason I thought it might be better to have the “longer” dimension on the columns in general, but probably that’s just nonsense.

If I haven’t made a mistake in measuring, I don’t find any big differences in using the column version compared to the row version.
Even your version doesn’t look out of place compared to the one (certainly more elegant) with mapslices

row vs col
julia> using StaticArrays, BenchmarkTools 

julia> function foo(xyz)
           a = xyz[1]+1
           b = xyz[2]*5
           c = xyz[3]/2
           return[a,b,c]
           end
foo (generic function with 1 method)

julia> function foo!(xyz)
           xyz[1]+=1
           xyz[2]*=5
           xyz[3]/=2
           return xyz
           end
foo! (generic function with 1 method)

julia> A=rand(1:3.,10^5*3);

julia> Mrow=reshape(A,:,3);

julia> Mcol=reshape(A,3,:);

julia> function fooMrow(A)
           B = similar(A) # assumes result of `foo` is the same type
           @views for i in axes(A,1)
               B[i,:] = foo(A[i,:])
           end
           B
       end
fooMrow (generic function with 1 method)

julia> function fooMcol(A)
           B = similar(A) # assumes result of `foo` is the same type
           @views for i in axes(A,2)
               B[:,i] = foo(A[:,i])
           end
           B
       end
fooMcol (generic function with 1 method)

julia> @btime fooMrow($Mrow);
  3.472 ms (100002 allocations: 9.92 MiB)

julia> @btime fooMcol($Mcol);
  3.139 ms (100002 allocations: 9.92 MiB)

julia> function fooMrow!(A)
            # assumes result of `foo` is the same type
           @views for i in axes(A,1)
               foo!(A[i,:])
           end
       end
fooMrow! (generic function with 1 method)

julia> function fooMcol!(A)
            # assumes result of `foo` is the same type
           @views for i in axes(A,2)
               foo!(A[:,i])
           end
       end
fooMcol! (generic function with 1 method)

julia> @btime fooMrow!(cmr) setup=(cmr=copy($Mrow))
  182.700 μs (0 allocations: 0 bytes)

julia> @btime fooMcol!(cmr) setup=(cmr=copy($Mcol))
  200.500 μs (0 allocations: 0 bytes)

julia> @btime mapslices(foo, $Mrow; dims=2);
  4.384 ms (100003 allocations: 9.92 MiB)
julia> @btime stack(foo.(eachrow($Mrow)))';
  3.691 ms (100004 allocations: 10.68 MiB)
1 Like

first.(yourvector)

1 Like

In this example I don’t find a big improvement with the use of staticarrays.
The mutating version works better which cannot be obtained with staticarrays

julia> v=[1,2,3.]; V=repeat([v],10^5);

julia> sv=SA[v...]; SV=repeat([sv],10^5);

julia> @btime map(foo,$V);
  2.560 ms (100002 allocations: 8.39 MiB)

julia> @btime map(foo,$SV);
  2.345 ms (100002 allocations: 8.39 MiB)

julia> @btime map(foo!,$V);
  249.900 μs (2 allocations: 781.30 KiB)

The problem is that your foo function returns an ordinary heap-allocated vector, so you aren’t getting the full benefit of StaticArrays.jl. If you instead define foo to return an SVector:

function foo(xyz::SVector{3}) # specialized for 3-component SVectors
    a = xyz[1]+1
    b = xyz[2]*5
    c = xyz[3]/2
    return @SVector[a,b,c]
end

then it becomes about 30x faster on my machine.

PS. SV=repeat([sv],10^5); is clearer and more efficient as SV = fill(sv, 10^5).

4 Likes

The eachrow version should be faster as it is easier to SIMD.
You’d have to double check that it actually SIMDs, which I wouldn’t trust mapslices to do.

The cost wouldn’t be major, the eachcol version would just need a few extra CPU instructions to transpose the loaded vectors.

To get SIMD, I had to @inline foo!, as Julia’s inlining heuristics aren’t great:

julia> function fooMcol!(A)
            # assumes result of `foo` is the same type
           @views for i in axes(A,2)
               @inline foo!(A[:,i])
           end
       end
fooMcol! (generic function with 1 method)

julia> function fooMrow!(A)
            # assumes result of `foo` is the same type
           @views for i in axes(A,1)
               @inline foo!(A[i,:])
           end
       end
fooMrow! (generic function with 1 method)

julia> @btime fooMrow!(cmr) setup=(cmr=copy($Mrow))
  87.693 μs (0 allocations: 0 bytes)

julia> @btime fooMcol!(cmr) setup=(cmr=copy($Mcol))
  108.776 μs (0 allocations: 0 bytes)

This is mostly bottlenecked by memory bandwidth. Making the arrays much smaller helps see the difference:

julia> A=rand(1:3.,10^3*3);

julia> Mrow=reshape(A,:,3);

julia> Mcol=reshape(A,3,:);

julia> Mrs = similar(Mrow); Mcs = similar(Mcol);

julia> @btime fooMrow!(copyto!($Mrs, $Mrow))
  630.899 ns (0 allocations: 0 bytes)

julia> @btime fooMcol!(copyto!($Mcs, $Mcol))
  1.266 μs (0 allocations: 0 bytes)

and ow we get around a 2x performance difference, but that’s in part because fooMcol! is failing to SIMD in the way I thought it would.

1 Like

Thanks @Elrod for delving deeper into this topic. It interests me a lot and I might finally be able to understand more than the surface of it.
In general, from what I understand, it is (would be) more efficient to “walk” the matrices by columns.
Below I illustrate what I understood and some doubts that remain.
Contrary to the general rule, you, I understand, say that in this case you would expect it to be more efficient to “walk” the matrix by row due to the intervention of the parallelism mechanisms (this essentially means SIMD, right?) .
Specifically, to help the compiler activate SIMDation use the macro @inline (of which I have a vague idea of how it works: I believe it copies the lines of code of the function in place instead of the pointer to the function).
At the same time you would have expected that even if you go by columns the SIMDation would be activated, but this does not happen. How do you figure this out, other than it takes longer?

Is the intervention of the SIMD mechanism possible only because the foo function acts individually on the various components of the row vector? In fact, the compiler then applies the first part of foo (+1) to the first column; the second part (*5) to the second column and the third (/2) to the third and are all these independent operations carried out in parallel by the SIMD? Is that so?

Would it be possible (and maybe even useful) to define a fooM!(v) function that uses the mutable version of staticArrays?
how to do, if so?

That is the conventional wisdom, but I think it’s wrong as often as it is right.
It is more efficient to read memory by column, but often the best way to do that is to walk the matrix by row (like in this example).

Yes. Note that SIMD instructions that load or store memory in contiguous blocks are much faster than those that do random access. When separated by small strides, you can do okay via loading contiguous blocks and then shuffling contents around.

Ideally, Julia’s compiler heuristics would be good enough so that this is unnecessary.

Inlining is necessary because the backend compiler that does SIMD (LLVM) requires it.

@code_llvm is the easiest way.
By row:

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %offset.idx = or i64 %index, 1
  %27 = getelementptr inbounds double, double* %21, i64 %index
  %28 = bitcast double* %27 to <8 x double>*
  %wide.load = load <8 x double>, <8 x double>* %28, align 8
  %29 = fadd <8 x double> %wide.load, <double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00>
  store <8 x double> %29, <8 x double>* %28, align 8
  %30 = xor i64 %exit.mainloop.at, -1
  %31 = add nsw i64 %offset.idx, %30
  %32 = add i64 %31, %23
  %33 = getelementptr inbounds double, double* %21, i64 %32
  %34 = bitcast double* %33 to <8 x double>*
  %wide.load451 = load <8 x double>, <8 x double>* %34, align 8
  %35 = fmul <8 x double> %wide.load451, <double 5.000000e+00, double 5.000000e+00, double 5.000000e+00, double 5.000000e+00, double 5.000000e+00, double 5.000000e+00, double 5.000000e+00, double 5.000000e+00>
  store <8 x double> %35, <8 x double>* %34, align 8
  %36 = add i64 %31, %25
  %37 = getelementptr inbounds double, double* %21, i64 %36
  %38 = bitcast double* %37 to <8 x double>*
  %wide.load452 = load <8 x double>, <8 x double>* %38, align 8
  %39 = fmul <8 x double> %wide.load452, <double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01>
  store <8 x double> %39, <8 x double>* %38, align 8
  %index.next = add nuw i64 %index, 8
  %40 = icmp eq i64 %index.next, %n.vec
  br i1 %40, label %middle.block, label %vector.body

No such loop in the by-col version.

Note all the operations on <8 x double>. This is because my CPU (Intel 1165g7) has AVX512 instructions, and I started Julia with appropriate flags (-Cnative,-prefer-256-bit).
Otherwise, you’re likely to see <4 x double> instead if using Intel or AMD.
Apple only has <2 x double>, and both versions are likely to be about the same fast, as the by col code is using <2 x double> for me anyway. Apple silicon has much faster memory (but slower compute); the faster memory is probably more important here, unless you shrink the arrays by a lot.

2 Likes