Implementation of composite quantile loss for Flux

johnbb · January 3, 2022, 5:49pm

The composite quantile loss for a set of quantiles can be defined as

\sum_i \sum_ j (p_j - 1\{y_i < q_{ji}\}) (y_i - q_{ji})

where p_j \in(0,1) denote the j-th quantile level/probability, y_i the i-th target/observation, and q_{ji} the p_j-quantile for the i-th sample. For a year or two I have been using the following implementations

#  qt size: level, case
function qtloss(qt::AbstractMatrix, y::AbstractVector, prob::AbstractVector; agg = mean)
    err = y .- qt'                         # case x level
    agg( (prob' .- (err .< 0)) .* err )    
end

function qtlossMax(qt::AbstractMatrix, y::AbstractVector, prob::AbstractVector; agg = mean)
    err = y .- qt'
    agg( max.(prob' .* err, (prob' .- 1) .* err) )
end

where the former is slightly faster in my experience. Btw, are there better alternatives with respect to speed and memory usage?

Now I would like to extend this implementation to Flux “image” models with data dimensions WHCN. C would here be the level/probability dimension and the dimensions WHN the “i-summation”. Any suggestions would be appreciated, in particular for GPUs. Reducing to the 2D-case with permutedims/reshape could be one option, but …

ToucheSir · January 3, 2022, 9:45pm

Instead of reshaping the inputs down, you could pad prob and y with size-1 dims. That would also save a transpose on qt and prob down the line.

GitHub - mcabbott/Tullio.jl: ⅀ is worth a try here.

johnbb · January 4, 2022, 10:33am

@ToucheSir Not sure I fully understand, but I made a test using permutedims only (without Tullio) as follows

using Flux, Statistics, BenchmarkTools

# size qt: WHCN where C is no. of quantiles
function qtloss(qt::AbstractArray{T,4}, y::AbstractArray{T,4}, prob::AbstractVector{T}; agg = mean) where T<:Real
    err = permutedims(y, (3,1,2,4)) .- permutedims(qt, (3,1,2,4))
    agg( (prob .- (err .< 0)) .* err )
end

function create_data(; d = 256, vars = 25, n = 32, nprob = 20, device = gpu)
    x    = rand(Float32, d, d, vars, n) |> device
    y    = rand(Float32, d, d, 1, n) |> device
    prob = Float32.(range(0, 1, length=nprob+2)[2:end-1]) |> device
    return x, y, prob
end

device     = gpu
x, y, prob = create_data(device = device);
opt        = ADAM()
model      = Conv((3,3), size(x,3) => length(prob), relu, pad=SamePad()) |> device
loss(x, y) = qtloss(model(x), y, prob)
@btime Flux.train!(loss, Flux.params($model), [($x,$y)], $opt)
#  RTX2070:  24.496 ms (1124 allocations: 101.52 KiB)
#  cpu:       4.632 s (540 allocations: 2.58 GiB)

Training times seem reasonable, but I do not understand the huge difference in memory allocated. Maybe the permutedims could also have been applied before the loss function, but not sure it would make any difference.

ToucheSir · January 8, 2022, 6:58pm

This is what I meant by pre-padding:

function qtloss2(qt::AbstractArray{T,4}, y::AbstractVector{T}, prob::AbstractVector{T}; agg = mean) where T<:Real
    err = reshape(y, 1, 1, 1, :) .- qt
    prob = reshape(prob, 1, 1, :)
    agg((prob .- (err .< 0)) .* err)
end

Allocations and runtime are still higher than I’d expect, but lower than the version with permutedims:

julia> @btime gradient($qtloss, $x, $y, $prob);
  969.884 ms (83 allocations: 1.59 GiB)

julia> @btime gradient($qtloss2, $x, $y2, $prob);
  580.226 ms (78 allocations: 1.25 GiB)

johnbb · January 9, 2022, 7:40pm

Hmm. What is y2 here? In the WHCN case, the targets y has dimension W,H,1,N.

ToucheSir · January 9, 2022, 7:45pm

I didn’t glean that from the spec, so I just made y2 a vector of targets like the original 1-D case. Since y is already the correct shape for broadcasting, you can remove the reshape on the first line and everything should still work just fine.

johnbb · January 9, 2022, 10:34pm

Thanks @ToucheSir . Is there any reason for using reshape(prob, 1, 1, :) instead of reshape(prob, 1, 1, :, 1)?

ToucheSir · January 9, 2022, 10:55pm

No reason in particular, it happened to save a couple of characters for this example. I think it might also be easier to write a helper function that pads this way, but I haven’t actually tried to do so.

Topic		Replies	Views
Code using Flux slow on GPU GPU flux	9	3071	November 6, 2019
Quantile regression in Flux v0.10 Machine Learning flux	10	1381	January 7, 2020
Negative binomial layer Machine Learning flux	5	1111	March 10, 2021
Flux: Locally connected layer performance Machine Learning gpu , performance , flux , tullio , loopvectorization	2	540	February 21, 2021
Flux.jl, Network size reduction New to Julia	1	509	April 12, 2020

Implementation of composite quantile loss for Flux

Related topics