Running a pre-trained BERT on twitter data using Flux.jl Transformer.jl

Hi everybody!

I have 8k labelled twitter data with either a 0 or 1 (binary classification).

MLJ.jl with classical machine learning has not very good performance with the problem. I am trying to fit a pre-trained BERT transformer model to see if I can improve.

I know how to do this in classical Flux.jl model building. But I am having difficulties to use Transformers.jl. I’ve followed the tutorials, but I still don’t understand what’s going on.

So I’ve loaded the pretrained BERT multilingual model:

_bert_model, wordpiece, tokenizer = pretrain"Bert-multilingual_L-12_H-768_A-12"

Then I constructed my classifier by appending an output layer:

hidden_size = size(_bert_model.classifier.pooler.W ,1)

const clf = gpu(Chain(
    Dense(hidden_size, 1), # binary classification

const bert_model = gpu(
                       pooler = _bert_model.classifier.pooler,
                       clf = clf

const ps = params(bert_model)
const opt = ADAM(1e-6)

How do I train this model? I have already a Train/Test DataLoader:

# Train/Test Split
function partitionTrainTest(data; at = 0.8)
           n = nrow(data)
           idx = shuffle(1:n)
           train_idx = view(idx, 1:floor(Int, at*n))
           test_idx = view(idx, (floor(Int, at*n)+1):n)
           data[train_idx,:], data[test_idx,:]
train_df, test_df = partitionTrainTest(df; at=0.8)

# # Train/Test DataLoader
train_loader = DataLoader((train_df[:, :tweet], train_df[:, :label]), batchsize=32, shuffle=true)
test_loader = DataLoader((test_df[:, :tweet], test_df[:, :label]), batchsize=32, shuffle=true)

I would appreciate any help.


1 Like

Just like regular layer in Flux, you need to write a loss function for it like this.

Thanks! I was using this script. How do I feed the data? BERT needs to do some masking. I probably would need something like this functions.

Yes, BERT assume that the input text is surround by a "[CLS]" and "[SEP]" token to indicate the start and end of sentence. Besides, BERT use 3 kind of embedding: token embedding, position embedding, and segment embedding. Position embedding is built inside the bert_model.embed so you don’t need to worry about it. Segment embedding depends on the pre-train and finetune task, but usually it’s just a list of 1 with the length equal to the token embedding. So as you can see in the preprocess function, wordpiece + tokenizer break the sentence into tokens, markline add the special token for start and end, vocab convert the tokens into onehot-like ready for token embedding, and segment is just an array similar to tok but filled with 1. Everything is wrapped in a NamedTuple with name tok for token embedding and segment for segment embedding.

Thanks, I will try to implement this. I’ve managed to do it using PyTorch but I really would like to use Flux and Transformers.

I don’t think you would need to rewrite this part of code since you can just copy that preprocess function. That’s one of the purposes of an example.

Of course I would need to add a binary loss to Flux and all. Do you mind if I send you the code that I’ve made following your example and a small sample of the input data (tweets)?

Flux has a number of built-in loss functions, so that very likely may not be required.

There is a binary cross entropy loss in Flux or just use cross entropy with 2 different label.

Do you mind if I send you the code that I’ve made following your example and a small sample of the input data (tweets)?

You can do that, but if it is not something that can’t be shown in public, I would prefer doing it here so we can have more public available resources.

I can show the model both the Julia and Python version.

Here is the PyTorch: COVID-Classifier/ at main · LabCidades/COVID-Classifier · GitHub
Here is the Flux + Transformers: COVID-Classifier/tweet_classifier_BERT.jl at main · LabCidades/COVID-Classifier · GitHub

What am I doing wrong? PyTorch takes 3 min to train with my 3070Ti NVIDIA GPU with 9k/1k train/test tweets. (obs: I cannot use a batch size larger than 8 otherwise I blow up the GPU RAM – 8GB). Julia I left training and did not complete one epoch in 10 min so I cancelled training.

The code is public but the l data is not (it wilbe as soon as we publish the paper), since it contains blood sweat and tears from 6 undergraduate volunteers. I only preprocessed the text by removing tweet handles (with the regex replace in the code).

Thank you for your attention :slight_smile:.

I will take some time to investigate the performance issue. But since Julia + GPU + AD actually require a few time to compile, the total running time might be longer even if all kernel are optimized under your case. In my personal experience, a single forward + backward run would take about 3~5 min. for compilation on my computer.

I just found that I need to update the bert example. The training part is totally out-dated. Things like l, p = loss(data, label, train_loader.batchsize; mask=mask); grad = gradient(() -> l, ps) won’t work anymore because that depends on a really old version of Flux. With newer Flux, you need to do:

(l, p), back = Flux.pullback(ps) do
  loss(data, label, train_loader.batchsize; mask=mask)
grad = back((Flux.Zygote.sensitivity(l), nothing))
1 Like

Thanks I will update my code

In newer versions of Zygote, withgradient can remove a bit of the boilerplate there.

That is still not enough here because we also want to get the prediction vector, not just loss value.

1 Like

Here is my train! function:

# Train
function train!(epoch, train_loader, test_loader)
    @info "start training"
    for e in 1:epoch
        @info "epoch: $e"
        i = 1
        al::Float64 = 0.0
        for batch in train_loader
            data, label, mask = todevice(preprocess(batch[1], batch[2]))
            (l, p), back = Flux.pullback(ps) do
                loss(data, label, train_loader.batchsize; mask=mask)
            #@show l
            a = acc(p, label)
            al += a
            grad = back((Flux.Zygote.sensitivity(l), nothing))
            i += 1
            update!(opt, ps, grad)
            #@show al / i

But there is an error in the execution:

julia> train!(2, train_loader, test_loader)
[ Info: start training
[ Info: epoch: 1
ERROR: MethodError: no method matching batchedmul(::CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}, ::CuArray{Float32, 3, CUDA.Mem.DeviceBuffer}; transB=true)
Closest candidates are:
  batchedmul(::AbstractArray{T, 3}, ::AbstractArray{T, 3}; transA, transB) where T at /home/storopoli/.julia/packages/Transformers/V363g/src/fix/batchedmul.jl:5
  batchedmul(::AbstractArray{T, N}, ::AbstractArray{T, N}; transA, transB) where {T, N} at /home/storopoli/.julia/packages/Transformers/V363g/src/fix/batchedmul.jl:13
  [1] (::Transformers.var"#8#12"{CuArray{Float32, 3, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 3, CUDA.Mem.DeviceBuffer}})(Δ::CuArray{Float64, 3, CUDA.Mem.DeviceBuffer})
    @ Transformers ~/.julia/packages/Transformers/V363g/src/fix/batchedmul.jl:45
  [2] (::Transformers.var"#11#back#13"{Transformers.var"#8#12"{CuArray{Float32, 3, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 3, CUDA.Mem.DeviceBuffer}}})(Δ::CuArray{Float64, 3, CUDA.Mem.DeviceBuffer})
    @ Transformers ~/.julia/packages/ZygoteRules/OjfTt/src/adjoint.jl:59
  [3] Pullback
    @ ~/.julia/packages/Transformers/V363g/src/basic/mh_atten.jl:207 [inlined]
  [4] (::typeof(∂(attention)))(Δ::CuArray{Float64, 3, CUDA.Mem.DeviceBuffer})
    @ Zygote ~/.julia/packages/Zygote/ajuwN/src/compiler/interface2.jl:0
  [5] Pullback
    @ ~/.julia/packages/Transformers/V363g/src/basic/mh_atten.jl:102 [inlined]
  [6] (::typeof(∂(#_#54)))(Δ::CuArray{Float64, 3, CUDA.Mem.DeviceBuffer})
    @ Zygote ~/.julia/packages/Zygote/ajuwN/src/compiler/interface2.jl:0
  [7] Pullback
    @ ~/.julia/packages/Transformers/V363g/src/basic/mh_atten.jl:80 [inlined]
  [8] (::typeof(∂(Any##kw)))(Δ::CuArray{Float64, 3, CUDA.Mem.DeviceBuffer})
    @ Zygote ~/.julia/packages/Zygote/ajuwN/src/compiler/interface2.jl:0
  [9] Pullback
    @ ~/.julia/packages/Transformers/V363g/src/basic/transformer.jl:69 [inlined]
 [10] (::typeof(∂(λ)))(Δ::CuArray{Float64, 3, CUDA.Mem.DeviceBuffer})
    @ Zygote ~/.julia/packages/Zygote/ajuwN/src/compiler/interface2.jl:0
 [11] macro expansion
    @ ~/.julia/packages/Transformers/V363g/src/stacks/stack.jl:0 [inlined]
 [12] Pullback
    @ ~/.julia/packages/Transformers/V363g/src/stacks/stack.jl:17 [inlined]
 [13] (::typeof(∂(λ)))(Δ::Tuple{CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}, Nothing})
    @ Zygote ~/.julia/packages/Zygote/ajuwN/src/compiler/interface2.jl:0
 [14] Pullback
    @ ~/.julia/packages/Transformers/V363g/src/bert/bert.jl:55 [inlined]
 [15] (::typeof(∂(#_#9)))(Δ::CuArray{Float64, 3, CUDA.Mem.DeviceBuffer})
    @ Zygote ~/.julia/packages/Zygote/ajuwN/src/compiler/interface2.jl:0
 [16] Pullback
    @ ~/.julia/packages/Transformers/V363g/src/bert/bert.jl:50 [inlined]
 [17] (::typeof(∂(λ)))(Δ::CuArray{Float64, 3, CUDA.Mem.DeviceBuffer})
    @ Zygote ~/.julia/packages/Zygote/ajuwN/src/compiler/interface2.jl:0
 [18] Pullback
    @ ./REPL[55]:3 [inlined]
 [19] (::typeof(∂(#loss#4)))(Δ::Tuple{Float64, Nothing})
    @ Zygote ~/.julia/packages/Zygote/ajuwN/src/compiler/interface2.jl:0
 [20] Pullback
    @ ./REPL[55]:2 [inlined]
 [21] (::typeof(∂(loss##kw)))(Δ::Tuple{Float64, Nothing})
    @ Zygote ~/.julia/packages/Zygote/ajuwN/src/compiler/interface2.jl:0
 [22] Pullback
    @ ./REPL[62]:10 [inlined]
 [23] (::typeof(∂(λ)))(Δ::Tuple{Float64, Nothing})
    @ Zygote ~/.julia/packages/Zygote/ajuwN/src/compiler/interface2.jl:0
 [24] (::Zygote.var"#94#95"{Zygote.Params, typeof(∂(λ)), Zygote.Context})(Δ::Tuple{Float64, Nothing})
    @ Zygote ~/.julia/packages/Zygote/ajuwN/src/compiler/interface.jl:348
 [25] train!(epoch::Int64, train_loader::DataLoader{Tuple{Vector{String}, Vector{Int64}}, Random._GLOBAL_RNG}, test_loader::DataLoader{Tuple{Vector{String}, Vector{Int64}}, Random._GLOBAL_RNG})
    @ Main ./REPL[62]:15
 [26] top-level scope
    @ REPL[65]:1
 [27] top-level scope
    @ ~/.julia/packages/CUDA/9T5Sq/src/initialization.jl:66

The full code can be found here: COVID-Classifier/tweet_classifier_BERT.jl at main · LabCidades/COVID-Classifier · GitHub

The error is because there are some output being promote to Float64 at some point but we need them to be Float32.

Probably this is the culprit:

al::Float64 = 0.0