How to use ForwardDiff.GradientConfig for multiple input functions while calculating gradients in parallel

Hello,

I am parallelizing my gradient calculation in a mini-batch stochastic gradient optimization routine. I am wondering what is the best way to configure cfg = ForwardDiff.GradientConfig() when my function’s input is multiple dimensional (the function is real-valued) but I need to calculate gradient wrt to only one of them.

Please see the code posted below for what I am trying to achieve. I want to calculate the gradient wrt. beta (a 14 dimensional parameter) and would like to pass cfg as an argument, if possible (though this is by no means critical).

@everywhere function diff_distributed(beta::Array{T}, data::V, z::R, cfg::W) where {T<:Real, V<:Array{IndexedTable}, R<:DataFrame, W<:ForwardDiff.GradientConfig}
  
z = @distributed (vcat) for k in 1:size(data)[1]
      grad_final_k = ForwardDiff.gradient!(Array{Float64}(undef, 1, 14), x -> loglike_not_para(x, data[k], z), beta, cfg, Val{false}())
     end
#sum
return sum(z, dims = 1)
end

I am not sure if I understood your question, but ForwardDiff.GradientConfig is not used to select dimensions for differentiation, but for selecting things like chunk size:

http://www.juliadiff.org/ForwardDiff.jl/stable/user/advanced.html#Configuring-Chunk-Size-1

If you want derivatives along a subset of the input, use a closure.

If you need more help, please post a self-contained MWE.

this is the underlying code of ForwardDiff.gradient!(result,f,x,cfg) , in chunk mode. (i modified the source a bit to be callable)

function chunk_mode_gradient!(result, f::F, x, cfg::GradientConfig{T,V,N}) where {F,T,V,N}
        @assert length(x) >= N "chunk size cannot be greater than length(x)"    
        # precalculate loop bounds
        xlen = length(x)
        remainder = xlen % N
        lastchunksize = ifelse(remainder == 0, N, remainder)
        lastchunkindex = xlen - lastchunksize + 1
        middlechunks = 2:div(xlen - lastchunksize, N)

        # seed work vectors
        xdual = cfg.duals
        seeds = cfg.seeds
        ForwardDiff.seed!(xdual, x)

        # do first chunk manually to calculate output type
        ForwardDiff.seed!(xdual, x, 1, seeds)
        ydual = f(xdual)
        ForwardDiff.extract_gradient_chunk!(T, result, ydual, 1, N)
        ForwardDiff.seed!(xdual, x, 1)

        # do middle chunks
        for c in middlechunks
            i = ((c - 1) * N + 1)
            ForwardDiff.seed!(xdual, x, i, seeds)
            ydual = f(xdual)
            ForwardDiff.extract_gradient_chunk!(T, result, ydual, i, N)
            ForwardDiff.seed!(xdual, x, i)
        end

        # do final chunk
        ForwardDiff.seed!(xdual, x, lastchunkindex, seeds, lastchunksize)
        ydual = f(xdual)
        ForwardDiff.extract_gradient_chunk!(T, result, ydual, lastchunkindex, lastchunksize)

        # get the value, this is a no-op unless result is a DiffResult
        ForwardDiff.extract_value!(T, result, ydual)
    
        return result
    end

the loop part is the important one, this is where you should distribute:

        for c in middlechunks
            i = ((c - 1) * N + 1)
            ForwardDiff.seed!(xdual, x, i, seeds)
            ydual = f(xdual)
            ForwardDiff.extract_gradient_chunk!(T, result, ydual, i, N)
            ForwardDiff.seed!(xdual, x, i)
        end

i don’t know if i am right, but if you add the @distributed macro at this level, you basically have a distributed forward gradient, right? (its important to let the workers know about every function inside that distributed loop, with the @everywhere macro if i am correct)

@Tamas_Papp, yes you misunderstood the question.

@longemen3000, thanks for the helpful reply. Yes, a distributed gradient with the gradient calculated at each chunk of the data, an array on IndexedTables, so the chunk is just one IndexedTable. The final gradient is the sum of each individual gradient, where the sum is over the number in elements of the array (we can also use pmap instead of @distributed to achieve what I posted in the code above). I am parallelizing over the number of elements of the array so that the order of the gradient calculation does not matter.

Based on my understanding of http://www.juliadiff.org/ForwardDiff.jl/stable/user/advanced.html#Configuring-Chunk-Size-1 it seems like I could do

cfg = ForwardDiff.GradientConfig(x -> loglike_not_para(x, y, z), beta, Chunk{14}()). 

where beta is the 14-element variable I am taking the derivative wrt.

My confusion lies in how to deal with y which is data[k] in my code, which changes with each iteration k.

Does it make sense to define

cfg = ForwardDiff.GradientConfig(x -> loglike_not_para(x, data[k], z), beta, Chunk{14}()). 

in each iteration k before calling ForwardDiff.gradient! in my code? Or, does it make sense to set

const cfg = ForwardDiff.GradientConfig(x -> loglike_not_para(x, y, z), beta, Chunk{14}()) 

which can remain unchanged for the lifetime of the code and can be passed as an argument to the function diff_distributed() in my original code above?