Implementation of composite quantile loss for Flux

The composite quantile loss for a set of quantiles can be defined as

\sum_i \sum_ j (p_j - 1\{y_i < q_{ji}\}) (y_i - q_{ji})

where p_j \in(0,1) denote the j-th quantile level/probability, y_i the i-th target/observation, and q_{ji} the p_j-quantile for the i-th sample. For a year or two I have been using the following implementations

#  qt size: level, case
function qtloss(qt::AbstractMatrix, y::AbstractVector, prob::AbstractVector; agg = mean)
    err = y .- qt'                         # case x level
    agg( (prob' .- (err .< 0)) .* err )    
end

function qtlossMax(qt::AbstractMatrix, y::AbstractVector, prob::AbstractVector; agg = mean)
    err = y .- qt'
    agg( max.(prob' .* err, (prob' .- 1) .* err) )
end

where the former is slightly faster in my experience. Btw, are there better alternatives with respect to speed and memory usage?

Now I would like to extend this implementation to Flux “image” models with data dimensions WHCN. C would here be the level/probability dimension and the dimensions WHN the “i-summation”. Any suggestions would be appreciated, in particular for GPUs. Reducing to the 2D-case with permutedims/reshape could be one option, but …

Instead of reshaping the inputs down, you could pad prob and y with size-1 dims. That would also save a transpose on qt and prob down the line.

GitHub - mcabbott/Tullio.jl: ⅀ is worth a try here.

@ToucheSir Not sure I fully understand, but I made a test using permutedims only (without Tullio) as follows

using Flux, Statistics, BenchmarkTools

# size qt: WHCN where C is no. of quantiles
function qtloss(qt::AbstractArray{T,4}, y::AbstractArray{T,4}, prob::AbstractVector{T}; agg = mean) where T<:Real
    err = permutedims(y, (3,1,2,4)) .- permutedims(qt, (3,1,2,4))
    agg( (prob .- (err .< 0)) .* err )
end

function create_data(; d = 256, vars = 25, n = 32, nprob = 20, device = gpu)
    x    = rand(Float32, d, d, vars, n) |> device
    y    = rand(Float32, d, d, 1, n) |> device
    prob = Float32.(range(0, 1, length=nprob+2)[2:end-1]) |> device
    return x, y, prob
end

device     = gpu
x, y, prob = create_data(device = device);
opt        = ADAM()
model      = Conv((3,3), size(x,3) => length(prob), relu, pad=SamePad()) |> device
loss(x, y) = qtloss(model(x), y, prob)
@btime Flux.train!(loss, Flux.params($model), [($x,$y)], $opt)
#  RTX2070:  24.496 ms (1124 allocations: 101.52 KiB)
#  cpu:       4.632 s (540 allocations: 2.58 GiB)

Training times seem reasonable, but I do not understand the huge difference in memory allocated. Maybe the permutedims could also have been applied before the loss function, but not sure it would make any difference.

This is what I meant by pre-padding:

function qtloss2(qt::AbstractArray{T,4}, y::AbstractVector{T}, prob::AbstractVector{T}; agg = mean) where T<:Real
    err = reshape(y, 1, 1, 1, :) .- qt
    prob = reshape(prob, 1, 1, :)
    agg((prob .- (err .< 0)) .* err)
end

Allocations and runtime are still higher than I’d expect, but lower than the version with permutedims:

julia> @btime gradient($qtloss, $x, $y, $prob);
  969.884 ms (83 allocations: 1.59 GiB)

julia> @btime gradient($qtloss2, $x, $y2, $prob);
  580.226 ms (78 allocations: 1.25 GiB)

Hmm. What is y2 here? In the WHCN case, the targets y has dimension W,H,1,N.

I didn’t glean that from the spec, so I just made y2 a vector of targets like the original 1-D case. Since y is already the correct shape for broadcasting, you can remove the reshape on the first line and everything should still work just fine.

1 Like

Thanks @ToucheSir . Is there any reason for using reshape(prob, 1, 1, :) instead of reshape(prob, 1, 1, :, 1)?

No reason in particular, it happened to save a couple of characters for this example. I think it might also be easier to write a helper function that pads this way, but I haven’t actually tried to do so.