Is there anything like vmap to vectorize a computation

Leander · February 25, 2025, 12:32pm

I want to evaluate using julia for a project and so far it was straightforward. The setting is the following, I have a bunch of functions (moment-generating functions over different parameters), for which I want to compute all the taylor coeffcients up to a certain order. My first try looks roughly like this:

using TaylorSeries
using LinearAlgebra
using CUDA

function moment_generating_function(u, vs)
    # more complicated in reality
    return simplex_volume(vs) * norm(u)
end

function simplex_volume(vs)
    """
    Calculate the volume of a simplex given its vertices.

    Parameters:
    vs (Matrix): A matrix of size (n+1, n) representing the vertices of the simplex,
                 where n is the dimension of the simplex.

    Returns:
    Float64: The volume of the simplex.
    """
    n = size(vs, 2)
    return abs(det(vs[2:end, :] .- vs[1, :])) / factorial(n)
end


function moment_generating_taylor_coefficients(vs, order)
    """
    Calculate the Taylor series expansion of the moment generating function of a simplex around 0.
    """
    d = size(vs, 2)
    u = set_variables("u1 u2", numvars=d, order=order)
    println("Taylor series variables: ", u)
    F = moment_generating_function(u, vs)
    c = F.coeffs
    println("Taylor series coefficient size" , size(c))
    return c
end

# Main script
df = [1.0 1.0; 2.0 5.0; 3.0 2.0]
println(size(df))
order = 7
taylor_coeffs = moment_generating_taylor_coefficients(df, order)

so far, this works great! But I have up to a few thousands of these simplices, and therefore I want to compute the taylor coefficients on the gpu. In torch, I would now call vmap(lambda df: moment_generating_taylor_coefficients(df, order), axis=0)(simplices) to map the moment_generating_taylor_coefficients along axis=0. How can I achieve something similar in julia? I am a bit lost with all the packages.

I guess I also need to change the data-type of set_variables.

I looked into gpu kernels but that looked extremely low-level.

tbh, I even struggle to compute the volume of the determinant of a batch of simplices. Afaik there’s no batch-dimension in the determinant function and as far as I know the for loops are not vectorized (otherwise I could just loop here).

yolhan_mannes · February 25, 2025, 2:06pm

looking at TaylorSeries package, there is no external to CUDA or other gpu packages and nothing in docs about it, same for TaylorDiff, so, for now, I guess you can’t ? Note that you can make your code a lottt faster using

abs(det(@view(vs[2:end, :]) .- @view(vs[1, :]))) / factorial(n)

or even better, use a loop (if you stay on cpu) or AcceleratedKernels.jl can help on gpu but not for TaylorSeries.

About your code note that norm is not compatible with TaylorSeries (F is a float), you should use sqrt( sum(abs2,u) ) and do not diff around 0.

Leander · February 25, 2025, 2:57pm

thanks for the tip. Actually, judging from the issues, TaylorSeries is compatible? Looks like it broke recently though: Fix from #44 to work with CUDA arrays Does Not Work in Latest Version · Issue #93 · JuliaDiff/TaylorDiff.jl · GitHub

yolhan_mannes · February 25, 2025, 3:02pm

It works fine on gpu just tested, I have a little issue with det indexing for some reason but if you replace det with something else it works

Leander · February 25, 2025, 3:31pm

How have you tried it? I tried it like in the issue and get the same error. Maybe I need to switch to another package like AcceleratedKernels.jl you mentioned

yolhan_mannes · February 25, 2025, 3:54pm

this works

using TaylorSeries
using LinearAlgebra
using CUDA

function moment_generating_function(u, vs)
    # more complicated in reality
    return simplex_volume(vs) * sum(abs2,u)
end

function simplex_volume(vs)
    """
    Calculate the volume of a simplex given its vertices.

    Parameters:
    vs (Matrix): A matrix of size (n+1, n) representing the vertices of the simplex,
                 where n is the dimension of the simplex.

    Returns:
    Float64: The volume of the simplex.
    """
    n = size(vs, 2)
    return abs(sum(@views(vs[2:end, :] .- vs[1, :]))) / factorial(n)
end


function moment_generating_taylor_coefficients(vs, order)
    """
    Calculate the Taylor series expansion of the moment generating function of a simplex around 0.
    """
    d = size(vs, 2)
    u = set_variables("u1 u2", numvars=d, order=order)
    println("Taylor series variables: ", u)
    F = moment_generating_function(u, vs)
    # println("Taylor series coefficient size" , size(c))
    return F
end

# Main script
df = cu([1.0 1.0; 2.0 5.0; 3.0 2.0])
println(size(df))
order = 7
taylor_coeffs = moment_generating_taylor_coefficients(df, order)

but its not what you wanted to do since I removed det and norm which broke the code. This is closer but I don’t know about the performance (note that I diff around 1)

using TaylorSeries
using LinearAlgebra
using CUDA

function moment_generating_function(u, vs)
    # more complicated in reality
    return simplex_volume(vs) * sqrt(sum(abs2,u))
end

function simplex_volume(vs)
    """
    Calculate the volume of a simplex given its vertices.

    Parameters:
    vs (Matrix): A matrix of size (n+1, n) representing the vertices of the simplex,
                 where n is the dimension of the simplex.

    Returns:
    Float64: The volume of the simplex.
    """
    n = size(vs, 2)
    return abs(det(@views(vs[2:end, :] .- vs[1, :]))) / factorial(n)
end


function moment_generating_taylor_coefficients(vs, order)
    """
    Calculate the Taylor series expansion of the moment generating function of a simplex around 0.
    """
    d = size(vs, 2)
    u = set_variables("u1 u2", numvars=d, order=order)
    println("Taylor series variables: ", u)
    CUDA.@allowscalar F = moment_generating_function(1 .+u, vs)
    # println("Taylor series coefficient size" , size(c))
    return F
end

# Main script
df = cu([1.0 1.0; 2.0 5.0; 3.0 2.0])
println(size(df))
order = 7
CUDA.@time taylor_coeffs = moment_generating_taylor_coefficients(df, order)

Leander · February 25, 2025, 4:07pm

oh wow, that’s progress! Thank you! really cool.

I guess now I just have to figure out how to go from a single simplex (of shape (n+1,n)) to a batch of simplices (b, n+1, n).

danielwe · February 25, 2025, 5:02pm

GPU arrays implement generic julia functions as much as possible, including the broadcasting and mapreduce machinery, reducing the need for dedicated constructs like vmap. I haven’t tried it, but in this case, my first attempt would be the following:

mapslices(df -> moment_generating_taylor_coefficients(df, 7), simplices; dims=3)

where simplices has shape (n + 1, n, b), with the batch dimension last such that the slices are contiguous in memory. (Julia arrays are column major, i.e., dimensions are ordered from “inner” to “outer”, opposite to python.)

Leander · February 25, 2025, 5:53pm

according to this:

this is not really working afaik

danielwe · February 25, 2025, 6:04pm

I see, that makes sense, mapping and broadcasting can’t replace, as @maleadt said in that thread,

a tensor compiler that knows how to analyze the nested array ops and generate a single kernel

This sounds similar to what Reactant.jl aims to become. I haven’t tried it and don’t know whether it could solve your problem today.

mcabbott · February 25, 2025, 6:48pm

If your n is fairly small (<10 maybe) then you can try a CuVector of SMatrix, using StaticArrays.jl . That’s the same memory as a size (n + 1, n, b) CuArray, but these static arrays are like scalars (or complex numbers) as far as CUDA.jl is concerned, which allows some mapslices-like things to be done.

Topic		Replies	Views
Multiply many-matrices by many-vectors Performance matlab , parallel , multithreading , tensors	33	6852	December 14, 2018
Unroll and Vectorize EvalPoly General Usage	9	1373	May 22, 2017
How to make LoopVectorization work with computing Euclidean distance for every column? New to Julia tullio , loopvectorization	13	1161	August 21, 2020
Thinking through distance matrix calculation GPU gpu , multithreading	5	2788	May 16, 2017
GPU Map without reduction on multiple arrays indices GPU	1	706	February 8, 2019

Is there anything like vmap to vectorize a computation

Related topics