Mix-mode training of large languages models in Julia

I am toying with Large Language models and amazing Transformers.jl. And although I am forced to call PyTorch because some models are not supported by Transformers.jl (falcon) and due to some external forces, I got interested what kind of secret sauce HuggingFace has. One of them, according to this paper https://arxiv.org/pdf/1910.02054.pdf is a mix-mode training, where gradients, activations, and weights are for stored in Float16, but optimization of weoghts is carried in Float32. So I was naturally curios, if Transformers.jl would work off the shelf with Float16.

I have started by writing a custom Adaptor was a piece of cake. I just copy-pasted adaptor from Flux, which lead to.

struct FluxFloatAdaptor{T} end

function FluxFloatAdaptor(T::DataType) 
	!(T <: Real) && error("FluxFloatAdaptor is reserved for floats only")
	FluxFloatAdaptor{T}()
end

# define rules for handling structured arrays
adapt_storage(to::FluxFloatAdaptor{T}, x::AbstractArray{S,N}) where {T,S,N} = adapt(Array{T,N}, x)
adapt_storage(to::FluxFloatAdaptor{T}, x::AbstractRange) where {T} = x
adapt_storage(to::FluxFloatAdaptor{T}, x::Zygote.FillArrays.AbstractFill) where {T} = x
adapt_storage(to::FluxFloatAdaptor{T}, x::X) where {T, X<: CUDA.CUSPARSE.CUDA.CUSPARSE.AbstractCuSparseMatrix} = adapt(Array, x)
adapt_storage(to::FluxFloatAdaptor{T}, x::Zygote.OneElement) where {T} = x
# adapt_storage(to::FluxFloatAdaptor{T}, x::AbstractSparseArray) where {T} = x
adapt_storage(to::FluxFloatAdaptor{T}, x::CUDA.RNG) where {T} = Random.default_rng()
adapt_storage(to::FluxFloatAdaptor{T}, x::AbstractRNG) where {T} = x
cpu_float16(x) = fmap(x -> adapt(FluxFloatAdaptor(Float16), x), x, exclude = _isleaf)

Equipped with that, I decided to benchmark inference and computation of gradient on GPT2.

textenc = hgf"gpt2:tokenizer"
model_32 = hgf"gpt2:lmheadmodel"
model_16 = cpu_float16(model_32)
gpu_model_32 = gpu(model_32)
gpu_model_16 = gpu(model_16)

tokens = encode(textenc, "Lorem ipsum dolor sit amet, consectetur adipiscing elit. In tristique iaculis arcu. Nullam non purus facilisis, dignissim lorem ut, consectetur nunc. Pellentesque sit amet tortor suscipit odio ultrices egestas at vel tellus. Donec molestie, mauris sed blandit gravida, turpis risus lacinia nulla, ut eleifend nulla justo non justo. Donec finibus dolor non turpis imperdiet dapibus. Integer venenatis ex ut ex cursus venenatis. Interdum et malesuada fames ac ante ipsum primis in faucibus. Aenean varius sapien vel enim molestie aliquet. Maecenas mi leo, dignissim a gravida eget, vestibulum et lorem. Morbi malesuada in metus vel lobortis.")
tokens = gpu(tokens)

julia> @benchmark CUDA.@sync gpu_model_16(tokens)
BenchmarkTools.Trial: 1008 samples with 1 evaluation.
 Range (min … max):  4.244 ms … 52.717 ms  β”Š GC (min … max): 0.00% … 34.93%
 Time  (median):     4.402 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   4.955 ms Β±  4.738 ms  β”Š GC (mean Β± Οƒ):  3.50% Β±  3.33%

  β–‡β–ˆβ–„        ▁
  β–ˆβ–ˆβ–ˆβ–‡β–β–…β–„β–β–„β–β–…β–ˆβ–ˆβ–„β–…β–„β–β–„β–„β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–„β–β–β–β–β–β–β–β–β–β–β–β–β–β–„ β–‡
  4.24 ms      Histogram: log(frequency) by time     11.8 ms <

 Memory estimate: 377.58 KiB, allocs estimate: 6684.

julia> @benchmark CUDA.@sync gpu_model_32(tokens)
BenchmarkTools.Trial: 542 samples with 1 evaluation.
 Range (min … max):  7.961 ms … 44.150 ms  β”Š GC (min … max): 0.00% … 44.35%
 Time  (median):     8.526 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   9.231 ms Β±  4.299 ms  β”Š GC (mean Β± Οƒ):  3.08% Β±  5.12%

  β–†β–ˆ β–‚
  β–ˆβ–ˆβ–…β–ˆβ–‡β–…β–„β–β–β–β–β–β–β–β–β–β–β–„β–β–β–„β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–„β–„β–† β–†
  7.96 ms      Histogram: log(frequency) by time     37.9 ms <

 Memory estimate: 397.30 KiB, allocs estimate: 6901.

According to benchmarks (I run the benchmark twice to precompile), the inference with Float16 is almost twice as fast as that with Float32 which pretty match the paper.

Let’s now try gradient as

julia> @benchmark CUDA.@sync gradient(m -> sum(m(tokens).hidden_state), gpu_model_16)
BenchmarkTools.Trial: 277 samples with 1 evaluation.
 Range (min … max):  14.271 ms … 53.894 ms  β”Š GC (min … max): 0.00% … 28.71%
 Time  (median):     15.378 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   18.039 ms Β±  9.122 ms  β”Š GC (mean Β± Οƒ):  5.36% Β±  6.97%

  β–ƒβ–ˆβ–†
  β–ˆβ–ˆβ–ˆβ–ˆβ–…β–†β–β–…β–β–‡β–β–„β–β–„β–β–β–β–β–β–β–β–β–„β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–…β–ˆβ–‡ β–…
  14.3 ms      Histogram: log(frequency) by time      51.8 ms <

 Memory estimate: 1.92 MiB, allocs estimate: 23417.


julia> @benchmark CUDA.@sync gradient(m -> sum(m(tokens).hidden_state), gpu_model_32)
BenchmarkTools.Trial: 157 samples with 1 evaluation.
 Range (min … max):  26.911 ms … 61.454 ms  β”Š GC (min … max): 0.00% … 18.83%
 Time  (median):     28.012 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   31.922 ms Β±  8.795 ms  β”Š GC (mean Β± Οƒ):  5.24% Β±  7.89%

  β–β–…β–ˆ                                              ▁
  β–ˆβ–ˆβ–ˆβ–†β–„β–β–β–β–β–β–β–β–„β–†β–„β–„β–…β–β–†β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–…β–‡β–ˆβ–„β–„β–„β–β–„β–„β–β–β–„ β–„
  26.9 ms      Histogram: log(frequency) by time      56.5 ms <

 Memory estimate: 1.95 MiB, allocs estimate: 23769.

where we again see the same story. Nice. So far so good. With few lines of code, we do Float16 on GPU.

Now, the remaining part to work-out is update of the model. The weights of the model used for training are kept in Float32 and then converted to Float16. Let’s now try to update the model with Float32 weights with gradient stored in Float16.

opt_state = Flux.setup(ADAM(), gpu_model_32);

julia> @benchmark CUDA.@sync Flux.update!(opt_state, gpu_model_32, gs_16)
BenchmarkTools.Trial: 134 samples with 1 evaluation.
 Range (min … max):  36.914 ms … 52.927 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     36.974 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   37.485 ms Β±  2.385 ms  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–ˆ
  β–ˆβ–‡β–β–„β–β–β–β–β–β–„β–β–β–β–β–„β–β–β–β–β–β–β–β–β–β–β–β–β–„β–β–β–β–β–β–β–β–β–β–β–„β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–… β–„
  36.9 ms      Histogram: log(frequency) by time        51 ms <

 Memory estimate: 2.63 MiB, allocs estimate: 32047.

julia> @benchmark CUDA.@sync Flux.update!(opt_state, gpu_model_32, gs_32)
BenchmarkTools.Trial: 134 samples with 1 evaluation.
 Range (min … max):  36.881 ms … 53.180 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     36.986 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   37.461 ms Β±  2.623 ms  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–ˆ
  β–ˆβ–†β–β–„β–„β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–„β–β–β–… β–„
  36.9 ms      Histogram: log(frequency) by time      52.5 ms <

 Memory estimate: 2.69 MiB, allocs estimate: 32893.

That seems to work (does not give errors) as well. There is no performance gain, but no penalty, so I consider this fine. I am surprised here that the update of the model is more expensive that computing the gradient. This does not seem to be right, since the operation should be effectively parallel. It is possible that I have some inefficiency, which I am not aware of (help wanted).

The last thing to remain is to copy Float32 weights stored in gpu_model_32 to gpu_model_16 to compute finish the cycle. This is I do not know, how to do efficiently. The adaptor approach I have used above for GPU written as

struct CUDAFloatAdaptor{T} end

function CUDAFloatAdaptor(T::DataType) 
	!(T <: Real) && error("CUDAFloatAdaptor is reserved for floats only")
	CUDAFloatAdaptor{T}()
end


adapt_storage(to::CUDAFloatAdaptor{T}, x::AbstractArray{S,N}) where {T,S,N} = adapt(CUDA.CuArray{T}, x)
adapt_storage(to::CUDAFloatAdaptor{T}, x::AbstractRange) where {T} = x
adapt_storage(to::CUDAFloatAdaptor{T}, x::Zygote.FillArrays.AbstractFill) where {T} = adapt(CUDA.CuArray{T}, collect(x))
adapt_storage(to::CUDAFloatAdaptor{T}, x::X) where {T, X<: CUDA.CUSPARSE.CUDA.CUSPARSE.AbstractCuSparseMatrix} = adapt(Array, x)
adapt_storage(to::CUDAFloatAdaptor{T}, x::Zygote.OneElement) where {T} = x
adapt_storage(to::CUDAFloatAdaptor, x::Random.TaskLocalRNG) = CUDA.default_rng()
# adapt_storage(to::CUDAFloatAdaptor{T}, x::AbstractSparseArray) where {T} = x
adapt_storage(to::CUDAFloatAdaptor{T}, x::CUDA.RNG) where {T} = x
adapt_storage(to::CUDAFloatAdaptor, x::AbstractRNG) =
  error("Cannot map RNG of type $(typeof(x)) to GPU. GPU execution only supports Random.default_rng().")
adapt_storage(to::CUDAFloatAdaptor{T}, x::Zygote.OneElement) where {T} = CUDA.CuArray{T}(collect(x))
adapt_storage(to::CUDAFloatAdaptor{T}, x::AbstractRNG) where {T} = x
gpu_float16(x) = fmap(x -> adapt(CUDAFloatAdaptor(Float16), x), x, exclude = _isleaf)

is horribly slow

julia> @benchmark  gpu_float16(gpu_model_32)
BenchmarkTools.Trial: 31 samples with 1 evaluation.
 Range (min … max):  152.372 ms … 180.391 ms  β”Š GC (min … max): 2.60% … 15.52%
 Time  (median):     162.814 ms               β”Š GC (median):    7.14%
 Time  (mean Β± Οƒ):   161.825 ms Β±   7.902 ms  β”Š GC (mean Β± Οƒ):  7.90% Β±  4.25%

  β–†          β–‚           β–ˆ                    β–„
  β–ˆβ–„β–β–β–β–β–β–β–β–„β–†β–ˆβ–„β–β–β–β–β–β–β–β–β–β–„β–ˆβ–β–β–β–β–β–β–„β–β–β–β–β–β–β–β–β–β–β–β–β–β–ˆβ–„β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–„ ▁
  152 ms           Histogram: frequency by time          180 ms <

 Memory estimate: 712.20 MiB, allocs estimate: 2814.

Here, I am effectively stuck. Suggestions and criticisms welcomed.

5 Likes

My guess is that accidental Float64 is the problem. The broadcasts inside the update! will end up promoting, even if the array they write into is a smaller type (similar to this issue).

julia> ADAM()  # old-style "implicit" optimiser, with IdDict, all coeff. Float64
Adam(0.001, (0.9, 0.999), 1.0e-8, IdDict{Any, Any}())

julia> Flux.setup(ans, Float32[1,2])  # converted to new-style Optimisers.jl rule
Leaf(Adam{Float64}(0.001, (0.9, 0.999), 1.0e-8), (Float32[0.0, 0.0], Float32[0.0, 0.0], (0.9, 0.999)))

I think you want loadmodel!(gpu_float16, gpu_model_32), which should call copyto! on pairs of arrays.

Can’t you just use the built-in f16, instead of writing these adapter definitions?

So the f16 works nicely. Thanks for that.
Flux.loadmodel!(gpu_model_16, gpu_model_32) gives me this error

julia> Flux.loadmodel!(gpu_model_16, gpu_model_32)
ERROR: Encountered tied destination parameters with untied and mismatched sources.
Stacktrace:
  [1] error(s::String)
    @ Base ./error.jl:35
  [2] _tie_check(dst::CuArray{Float16, 2, CUDA.Mem.DeviceBuffer}, src::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer})
    @ Flux ~/.julia/packages/Flux/n3cOc/src/loading.jl:29
  [3] loadmodel!(dst::Transformers.Layers.Embed{Nothing, CuArray{Float16, 2, CUDA.Mem.DeviceBuffer}}, src::Transformers.Layers.Embed{Nothing, CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}}; filter::Function, cache::Base.IdSet{Any})
    @ Flux ~/.julia/packages/Flux/n3cOc/src/loading.jl:100
  [4] loadmodel!(dst::Transformers.Layers.WithArg{(:token,), Transformers.Layers.Embed{Nothing, CuArray{Float16, 2, CUDA.Mem.DeviceBuffer}}}, src::Transformers.Layers.WithArg{(:token,), Transformers.Layers.Embed{Nothing, CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}}}; filter::Function, cache::Base.IdSet{Any})
    @ Flux ~/.julia/packages/Flux/n3cOc/src/loading.jl:105

I have not copy whole error stack, as I think only the top should be important. Any ideas?

Thanks for help. The code looks cleaner with f16 and it works on gpu arrays as well.

I’m trying to wrap up the mixed precision recipe into an Optimisers rule

Feedbacks welcome

Thanks Carlo, Installing and when I finish, I will try it. I have looked to the implementation and it was similar to what I have been imagining in my head, but your code is way neater. I will keep posting my updates here.

Carlo,

how do you expect the update to be called. With Float32 model and Float16 gradient as

opt_state, u = Flux.update!(opt_state, gpu_model_32, gs_16)

or with Float16 bit model as

opt_state, u = Flux.update!(opt_state, gpu_model_16, gs_16)

The lattter is better, as the output is in Float16, but I expect the weights are kept in Float16, right? So I expect the former to be correct, but want to be sure.

The latter! So one can have a single model in the code and the whole pipeline becomes

model = build_model() |> f16 |> gpu
opt_state = Optimisers.setup(MixedPrecision(Adam(1e-3)), model) 
# opt_state contains the Float32 version of the weights

for batch in train_dataloader
    batch = batch |> f16 |> gpu
    grad = gradient(model -> loss(model, batch), model) 
    # grad should be Float16 if everything ok

    Flux.update!(opt_state, model, grad[1])
end
3 Likes

So the final test with timings is as follows:

using Transformers
using Transformers.HuggingFace
using Flux
using TextEncodeBase
using Flux.Optimisers
using BenchmarkTools
	
textenc = hgf"gpt2:tokenizer"
tokens = encode(textenc, "Lorem ipsum dolor sit amet, consectetur adipiscing elit. In tristique iaculis arcu. Nullam non purus facilisis, dignissim lorem ut, consectetur nunc. Pellentesque sit amet tortor suscipit odio ultrices egestas at vel tellus. Donec molestie, mauris sed blandit gravida, turpis risus lacinia nulla, ut eleifend nulla justo non justo. Donec finibus dolor non turpis imperdiet dapibus. Integer venenatis ex ut ex cursus venenatis. Interdum et malesuada fames ac ante ipsum primis in faucibus. Aenean varius sapien vel enim molestie aliquet. Maecenas mi leo, dignissim a gravida eget, vestibulum et lorem. Morbi malesuada in metus vel lobortis.")
tokens = gpu(tokens)

model = hgf"gpt2:lmheadmodel" |> f16 |> gpu
opt_state = Optimisers.setup(MixedPrecision(Optimisers.Adam(1e-3)), model) 
grad = gradient(m -> sum(m(tokens).hidden_state), model)
Flux.update!(opt_state, model, grad[1])

which I think is pretty neat.
Benchmarking Float16 version as

model = hgf"gpt2:lmheadmodel" |> f16 |> gpu
opt_state = Optimisers.setup(MixedPrecision(Optimisers.Adam(1e-3)), model) 
@benchmark begin 
    grad = gradient(m -> sum(m(tokens).hidden_state), model)
    Flux.update!(opt_state, model, grad[1])
end

BenchmarkTools.Trial: 87 samples with 1 evaluation.
 Range (min … max):  27.972 ms … 90.758 ms  β”Š GC (min … max): 0.00% … 22.00%
 Time  (median):     54.875 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   57.723 ms Β± 10.192 ms  β”Š GC (mean Β± Οƒ):  2.30% Β±  5.14%

and Floatt32 version as

model = hgf"gpt2:lmheadmodel" |> f32 |> gpu
opt_state = Optimisers.setup(Optimisers.Adam(1e-3), model)
julia> @benchmark begin
           grad = gradient(m -> sum(m(tokens).hidden_state), model)
           Flux.update!(opt_state, model, grad[1])
       end

BenchmarkTools.Trial: 74 samples with 1 evaluation.
 Range (min … max):  42.342 ms … 88.827 ms  β”Š GC (min … max): 0.00% … 15.86%
 Time  (median):     63.606 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   68.347 ms Β±  9.911 ms  β”Š GC (mean Β± Οƒ):  3.47% Β±  5.95%

The average is about 11ms faster, which is about 16%. Nice.

Thanks all for help. It was interesting.

1 Like