Using a function on each row of a matrix and getting a similar matrix back

aris · January 23, 2024, 4:58pm

Suppose we have a function acting on a 3-element vector,

function foo(xyz)
a = xyz[1]+1
b = xyz[2]*5
c = xyz[3]/2
return[a,b,c]
end

It works fine if I use a vector as an input, like foo([1,2,3]), but what if I’d like to perform the same operation in each row of a “n by 3” matrix and get a matrix of the same size back?
Is there a recommended way to do this?
Rewriting the function is also an option, if there’s a more elegant way to do so.

I tried using map, but map(foo,[1,2,3]) gives me errors for some reason I don’t understand. Any insights into this?

stack(foo.(eachrow(rand(10,3))))’ seems to do the job, but feels a bit clunky, and there must be a better way to do it.

Jeff_Emanuel · January 23, 2024, 5:09pm

Multiple dispatch–define a method for xyz::Vector and another for xyz::Matrix. Methods · The Julia Language

mikmoore · January 23, 2024, 5:12pm

See mapslices. For example mapslices(foo, rand(10,3); dims=2).

Although if you’re actually using vectors of length 3 (or <50 or so), you might do better to use the StaticArrays.jl package to make a Vector{SVector{3,Float64}} rather than Matrix{Float64}. SVectors are super nice (and fast!) for operations on small vectors, and this format comes pre-sliced so you can simply map foo on it.

stevengj · January 23, 2024, 5:14pm

This is equivalent to [foo(1), foo(2), foo(3)], i.e. calling foo for each element of [1,2,3]. This errors because your foo function expects a vector, not a scalar.

You could simply write a loop:

B = similar(A) # assumes result of `foo` is the same type
@views for i in axes(A,1)
    B[i,:] = foo(A[i,:])
end

Loops give a lot more flexibility. For example, you could even work in-place in A if you want, and maybe have an in-place foo! variant acting on each row so that the whole thing is allocation-free.

PS. Of course, if you are really working with lots of 3-component vectors as in your example, you might want to completely reconsider your data structures. If you know ahead of time that all your vectors have 3 components, e.g. if they are spatial positions in 3d, then it is likely to be more efficient and natural to work with them as SVectors from StaticArrays.jl. And if you have a collection of 3-component vectors, don’t express them as rows of a matrix, express them as 1d arrays of SVectors.

PPS. You know that Julia arrays are column-major, right? (Unlike numpy.) So it is more efficient to work by column than by row. But arrays of SVector are even better for collections of small fixed-length vectors.

aris · January 23, 2024, 8:30pm

I believe mapslices is the closest to what I was looking for, thanks!

Also, I was not aware of svectors, I’ll consider these.
Would that also imply that vectors of vectors are faster than matrices for applications like this one?

aris · January 23, 2024, 8:38pm

Thank you very much for the detailed reply, I now understand map a bit better.

As for the foo! variant, would I have to define that in the function, or is it akin to using a . in “foo.”?

I’m used to matrices due to matlab background (I’m sure that’s a classic case in this forum). I just find them easier to index for some reason. For exaple, in an array of 3 dimensional vectors, I’m not sure how to get an array with the 1st element of each of the vectors (with a matrix I’d just do M[:,1]).

Finally, I’m aware of the column-major thing, but I don’t have a solid grasp of its implications. For some reason I thought it might be better to have the “longer” dimension on the columns in general, but probably that’s just nonsense.

rocco_sprmnt21 · January 23, 2024, 10:28pm

If I haven’t made a mistake in measuring, I don’t find any big differences in using the column version compared to the row version.
Even your version doesn’t look out of place compared to the one (certainly more elegant) with mapslices

row vs col

julia> using StaticArrays, BenchmarkTools 

julia> function foo(xyz)
           a = xyz[1]+1
           b = xyz[2]*5
           c = xyz[3]/2
           return[a,b,c]
           end
foo (generic function with 1 method)

julia> function foo!(xyz)
           xyz[1]+=1
           xyz[2]*=5
           xyz[3]/=2
           return xyz
           end
foo! (generic function with 1 method)

julia> A=rand(1:3.,10^5*3);

julia> Mrow=reshape(A,:,3);

julia> Mcol=reshape(A,3,:);

julia> function fooMrow(A)
           B = similar(A) # assumes result of `foo` is the same type
           @views for i in axes(A,1)
               B[i,:] = foo(A[i,:])
           end
           B
       end
fooMrow (generic function with 1 method)

julia> function fooMcol(A)
           B = similar(A) # assumes result of `foo` is the same type
           @views for i in axes(A,2)
               B[:,i] = foo(A[:,i])
           end
           B
       end
fooMcol (generic function with 1 method)

julia> @btime fooMrow($Mrow);
  3.472 ms (100002 allocations: 9.92 MiB)

julia> @btime fooMcol($Mcol);
  3.139 ms (100002 allocations: 9.92 MiB)

julia> function fooMrow!(A)
            # assumes result of `foo` is the same type
           @views for i in axes(A,1)
               foo!(A[i,:])
           end
       end
fooMrow! (generic function with 1 method)

julia> function fooMcol!(A)
            # assumes result of `foo` is the same type
           @views for i in axes(A,2)
               foo!(A[:,i])
           end
       end
fooMcol! (generic function with 1 method)

julia> @btime fooMrow!(cmr) setup=(cmr=copy($Mrow))
  182.700 μs (0 allocations: 0 bytes)

julia> @btime fooMcol!(cmr) setup=(cmr=copy($Mcol))
  200.500 μs (0 allocations: 0 bytes)

julia> @btime mapslices(foo, $Mrow; dims=2);
  4.384 ms (100003 allocations: 9.92 MiB)
julia> @btime stack(foo.(eachrow($Mrow)))';
  3.691 ms (100004 allocations: 10.68 MiB)

rocco_sprmnt21 · January 23, 2024, 10:30pm

first.(yourvector)

rocco_sprmnt21 · January 23, 2024, 10:35pm

In this example I don’t find a big improvement with the use of staticarrays.
The mutating version works better which cannot be obtained with staticarrays

julia> v=[1,2,3.]; V=repeat([v],10^5);

julia> sv=SA[v...]; SV=repeat([sv],10^5);

julia> @btime map(foo,$V);
  2.560 ms (100002 allocations: 8.39 MiB)

julia> @btime map(foo,$SV);
  2.345 ms (100002 allocations: 8.39 MiB)

julia> @btime map(foo!,$V);
  249.900 μs (2 allocations: 781.30 KiB)

stevengj · January 23, 2024, 10:39pm

The problem is that your foo function returns an ordinary heap-allocated vector, so you aren’t getting the full benefit of StaticArrays.jl. If you instead define foo to return an SVector:

function foo(xyz::SVector{3}) # specialized for 3-component SVectors
    a = xyz[1]+1
    b = xyz[2]*5
    c = xyz[3]/2
    return @SVector[a,b,c]
end

then it becomes about 30x faster on my machine.

PS. SV=repeat([sv],10^5); is clearer and more efficient as SV = fill(sv, 10^5).

Elrod · January 23, 2024, 10:39pm

The eachrow version should be faster as it is easier to SIMD.
You’d have to double check that it actually SIMDs, which I wouldn’t trust mapslices to do.

The cost wouldn’t be major, the eachcol version would just need a few extra CPU instructions to transpose the loaded vectors.

To get SIMD, I had to @inline foo!, as Julia’s inlining heuristics aren’t great:

julia> function fooMcol!(A)
            # assumes result of `foo` is the same type
           @views for i in axes(A,2)
               @inline foo!(A[:,i])
           end
       end
fooMcol! (generic function with 1 method)

julia> function fooMrow!(A)
            # assumes result of `foo` is the same type
           @views for i in axes(A,1)
               @inline foo!(A[i,:])
           end
       end
fooMrow! (generic function with 1 method)

julia> @btime fooMrow!(cmr) setup=(cmr=copy($Mrow))
  87.693 μs (0 allocations: 0 bytes)

julia> @btime fooMcol!(cmr) setup=(cmr=copy($Mcol))
  108.776 μs (0 allocations: 0 bytes)

This is mostly bottlenecked by memory bandwidth. Making the arrays much smaller helps see the difference:

julia> A=rand(1:3.,10^3*3);

julia> Mrow=reshape(A,:,3);

julia> Mcol=reshape(A,3,:);

julia> Mrs = similar(Mrow); Mcs = similar(Mcol);

julia> @btime fooMrow!(copyto!($Mrs, $Mrow))
  630.899 ns (0 allocations: 0 bytes)

julia> @btime fooMcol!(copyto!($Mcs, $Mcol))
  1.266 μs (0 allocations: 0 bytes)

and ow we get around a 2x performance difference, but that’s in part because fooMcol! is failing to SIMD in the way I thought it would.

rocco_sprmnt21 · January 24, 2024, 8:14am

Thanks @Elrod for delving deeper into this topic. It interests me a lot and I might finally be able to understand more than the surface of it.
In general, from what I understand, it is (would be) more efficient to “walk” the matrices by columns.
Below I illustrate what I understood and some doubts that remain.
Contrary to the general rule, you, I understand, say that in this case you would expect it to be more efficient to “walk” the matrix by row due to the intervention of the parallelism mechanisms (this essentially means SIMD, right?) .
Specifically, to help the compiler activate SIMDation use the macro @inline (of which I have a vague idea of how it works: I believe it copies the lines of code of the function in place instead of the pointer to the function).
At the same time you would have expected that even if you go by columns the SIMDation would be activated, but this does not happen. How do you figure this out, other than it takes longer?

Is the intervention of the SIMD mechanism possible only because the foo function acts individually on the various components of the row vector? In fact, the compiler then applies the first part of foo (+1) to the first column; the second part (*5) to the second column and the third (/2) to the third and are all these independent operations carried out in parallel by the SIMD? Is that so?

rocco_sprmnt21 · January 24, 2024, 8:19am

Would it be possible (and maybe even useful) to define a fooM!(v) function that uses the mutable version of staticArrays?
how to do, if so?

Elrod · January 24, 2024, 2:40pm

That is the conventional wisdom, but I think it’s wrong as often as it is right.
It is more efficient to read memory by column, but often the best way to do that is to walk the matrix by row (like in this example).

Yes. Note that SIMD instructions that load or store memory in contiguous blocks are much faster than those that do random access. When separated by small strides, you can do okay via loading contiguous blocks and then shuffling contents around.

Ideally, Julia’s compiler heuristics would be good enough so that this is unnecessary.

Inlining is necessary because the backend compiler that does SIMD (LLVM) requires it.

@code_llvm is the easiest way.
By row:

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %offset.idx = or i64 %index, 1
  %27 = getelementptr inbounds double, double* %21, i64 %index
  %28 = bitcast double* %27 to <8 x double>*
  %wide.load = load <8 x double>, <8 x double>* %28, align 8
  %29 = fadd <8 x double> %wide.load, <double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00>
  store <8 x double> %29, <8 x double>* %28, align 8
  %30 = xor i64 %exit.mainloop.at, -1
  %31 = add nsw i64 %offset.idx, %30
  %32 = add i64 %31, %23
  %33 = getelementptr inbounds double, double* %21, i64 %32
  %34 = bitcast double* %33 to <8 x double>*
  %wide.load451 = load <8 x double>, <8 x double>* %34, align 8
  %35 = fmul <8 x double> %wide.load451, <double 5.000000e+00, double 5.000000e+00, double 5.000000e+00, double 5.000000e+00, double 5.000000e+00, double 5.000000e+00, double 5.000000e+00, double 5.000000e+00>
  store <8 x double> %35, <8 x double>* %34, align 8
  %36 = add i64 %31, %25
  %37 = getelementptr inbounds double, double* %21, i64 %36
  %38 = bitcast double* %37 to <8 x double>*
  %wide.load452 = load <8 x double>, <8 x double>* %38, align 8
  %39 = fmul <8 x double> %wide.load452, <double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01>
  store <8 x double> %39, <8 x double>* %38, align 8
  %index.next = add nuw i64 %index, 8
  %40 = icmp eq i64 %index.next, %n.vec
  br i1 %40, label %middle.block, label %vector.body

No such loop in the by-col version.

Note all the operations on <8 x double>. This is because my CPU (Intel 1165g7) has AVX512 instructions, and I started Julia with appropriate flags (-Cnative,-prefer-256-bit).
Otherwise, you’re likely to see <4 x double> instead if using Intel or AMD.
Apple only has <2 x double>, and both versions are likely to be about the same fast, as the by col code is using <2 x double> for me anyway. Apple silicon has much faster memory (but slower compute); the faster memory is probably more important here, unless you shrink the arrays by a lot.

Topic		Replies	Views
Optimizing iteration over slices of multiple matrices Performance question , array	14	1896	December 30, 2018
Usage of arrays of static arrays New to Julia performance , staticarrays	16	1109	February 22, 2023
Julia-ism for two-dimensional map? New to Julia	22	3091	May 1, 2018
Quick Longitudinal Access to arrays General Usage	34	2288	March 3, 2018
Why substituting "for loop" with direct assignment of Matrix worsen the performance time? General Usage question	20	876	September 23, 2021

Using a function on each row of a matrix and getting a similar matrix back

Related topics