Running a pre-trained BERT on twitter data using Flux.jl Transformer.jl

Storopoli · September 5, 2021, 4:37pm

Hi everybody!

I have 8k labelled twitter data with either a 0 or 1 (binary classification).

MLJ.jl with classical machine learning has not very good performance with the problem. I am trying to fit a pre-trained BERT transformer model to see if I can improve.

I know how to do this in classical Flux.jl model building. But I am having difficulties to use Transformers.jl. I’ve followed the tutorials, but I still don’t understand what’s going on.

So I’ve loaded the pretrained BERT multilingual model:

_bert_model, wordpiece, tokenizer = pretrain"Bert-multilingual_L-12_H-768_A-12"

Then I constructed my classifier by appending an output layer:

hidden_size = size(_bert_model.classifier.pooler.W ,1)

const clf = gpu(Chain(
    Dropout(0.1),
    Dense(hidden_size, 1), # binary classification
    logsoftmax
))

const bert_model = gpu(
    set_classifier(_bert_model,
                   (
                       pooler = _bert_model.classifier.pooler,
                       clf = clf
                   )
                  )
)

const ps = params(bert_model)
const opt = ADAM(1e-6)

How do I train this model? I have already a Train/Test DataLoader:

# Train/Test Split
function partitionTrainTest(data; at = 0.8)
           n = nrow(data)
           idx = shuffle(1:n)
           train_idx = view(idx, 1:floor(Int, at*n))
           test_idx = view(idx, (floor(Int, at*n)+1):n)
           data[train_idx,:], data[test_idx,:]
       end
train_df, test_df = partitionTrainTest(df; at=0.8)

# # Train/Test DataLoader
train_loader = DataLoader((train_df[:, :tweet], train_df[:, :label]), batchsize=32, shuffle=true)
test_loader = DataLoader((test_df[:, :tweet], test_df[:, :label]), batchsize=32, shuffle=true)

I would appreciate any help.

@chengchingwen

chengchingwen · September 6, 2021, 12:06am

Just like regular layer in Flux, you need to write a loss function for it like this.

Storopoli · September 6, 2021, 8:53am

Thanks! I was using this script. How do I feed the data? BERT needs to do some masking. I probably would need something like this functions.

chengchingwen · September 7, 2021, 9:17pm

Yes, BERT assume that the input text is surround by a "[CLS]" and "[SEP]" token to indicate the start and end of sentence. Besides, BERT use 3 kind of embedding: token embedding, position embedding, and segment embedding. Position embedding is built inside the bert_model.embed so you don’t need to worry about it. Segment embedding depends on the pre-train and finetune task, but usually it’s just a list of 1 with the length equal to the token embedding. So as you can see in the preprocess function, wordpiece + tokenizer break the sentence into tokens, markline add the special token for start and end, vocab convert the tokens into onehot-like ready for token embedding, and segment is just an array similar to tok but filled with 1. Everything is wrapped in a NamedTuple with name tok for token embedding and segment for segment embedding.

Storopoli · September 7, 2021, 9:25pm

Thanks, I will try to implement this. I’ve managed to do it using PyTorch but I really would like to use Flux and Transformers.

chengchingwen · September 7, 2021, 9:30pm

I don’t think you would need to rewrite this part of code since you can just copy that preprocess function. That’s one of the purposes of an example.

Storopoli · September 7, 2021, 10:52pm

Of course I would need to add a binary loss to Flux and all. Do you mind if I send you the code that I’ve made following your example and a small sample of the input data (tweets)?

ToucheSir · September 8, 2021, 6:44pm

Flux has a number of built-in loss functions, so that very likely may not be required.

chengchingwen · September 8, 2021, 10:42pm

There is a binary cross entropy loss in Flux or just use cross entropy with 2 different label.

Do you mind if I send you the code that I’ve made following your example and a small sample of the input data (tweets)?

You can do that, but if it is not something that can’t be shown in public, I would prefer doing it here so we can have more public available resources.

Storopoli · September 9, 2021, 1:08am

I can show the model both the Julia and Python version.

Here is the PyTorch: https://github.com/LabCidades/COVID-Classifier/blob/main/src/tweet_classifier_BERT.py
Here is the Flux + Transformers: https://github.com/LabCidades/COVID-Classifier/blob/main/src/tweet_classifier_BERT.jl

What am I doing wrong? PyTorch takes 3 min to train with my 3070Ti NVIDIA GPU with 9k/1k train/test tweets. (obs: I cannot use a batch size larger than 8 otherwise I blow up the GPU RAM – 8GB). Julia I left training and did not complete one epoch in 10 min so I cancelled training.

The code is public but the l data is not (it wilbe as soon as we publish the paper), since it contains blood sweat and tears from 6 undergraduate volunteers. I only preprocessed the text by removing tweet handles (with the regex replace in the code).

Thank you for your attention .

chengchingwen · September 9, 2021, 3:09am

I will take some time to investigate the performance issue. But since Julia + GPU + AD actually require a few time to compile, the total running time might be longer even if all kernel are optimized under your case. In my personal experience, a single forward + backward run would take about 3~5 min. for compilation on my computer.

chengchingwen · September 9, 2021, 3:39am

I just found that I need to update the bert example. The training part is totally out-dated. Things like l, p = loss(data, label, train_loader.batchsize; mask=mask); grad = gradient(() -> l, ps) won’t work anymore because that depends on a really old version of Flux. With newer Flux, you need to do:

(l, p), back = Flux.pullback(ps) do
  loss(data, label, train_loader.batchsize; mask=mask)
end
grad = back((Flux.Zygote.sensitivity(l), nothing))

Storopoli · September 9, 2021, 7:59am

Thanks I will update my code

ToucheSir · September 9, 2021, 4:42pm

In newer versions of Zygote, withgradient can remove a bit of the boilerplate there.

chengchingwen · September 9, 2021, 4:48pm

That is still not enough here because we also want to get the prediction vector, not just loss value.

Storopoli · September 16, 2021, 8:52am

Here is my train! function:

# Train
function train!(epoch, train_loader, test_loader)
    @info "start training"
    for e in 1:epoch
        @info "epoch: $e"
        i = 1
        al::Float64 = 0.0
        for batch in train_loader
            data, label, mask = todevice(preprocess(batch[1], batch[2]))
            (l, p), back = Flux.pullback(ps) do
                loss(data, label, train_loader.batchsize; mask=mask)
            end
            #@show l
            a = acc(p, label)
            al += a
            grad = back((Flux.Zygote.sensitivity(l), nothing))
            i += 1
            update!(opt, ps, grad)
            #@show al / i
        end
        test()
    end
end

But there is an error in the execution:

julia> train!(2, train_loader, test_loader)
[ Info: start training
[ Info: epoch: 1
ERROR: MethodError: no method matching batchedmul(::CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}, ::CuArray{Float32, 3, CUDA.Mem.DeviceBuffer}; transB=true)
Closest candidates are:
  batchedmul(::AbstractArray{T, 3}, ::AbstractArray{T, 3}; transA, transB) where T at /home/storopoli/.julia/packages/Transformers/V363g/src/fix/batchedmul.jl:5
  batchedmul(::AbstractArray{T, N}, ::AbstractArray{T, N}; transA, transB) where {T, N} at /home/storopoli/.julia/packages/Transformers/V363g/src/fix/batchedmul.jl:13
Stacktrace:
  [1] (::Transformers.var"#8#12"{CuArray{Float32, 3, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 3, CUDA.Mem.DeviceBuffer}})(Δ::CuArray{Float64, 3, CUDA.Mem.DeviceBuffer})
    @ Transformers ~/.julia/packages/Transformers/V363g/src/fix/batchedmul.jl:45
  [2] (::Transformers.var"#11#back#13"{Transformers.var"#8#12"{CuArray{Float32, 3, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 3, CUDA.Mem.DeviceBuffer}}})(Δ::CuArray{Float64, 3, CUDA.Mem.DeviceBuffer})
    @ Transformers ~/.julia/packages/ZygoteRules/OjfTt/src/adjoint.jl:59
  [3] Pullback
    @ ~/.julia/packages/Transformers/V363g/src/basic/mh_atten.jl:207 [inlined]
  [4] (::typeof(∂(attention)))(Δ::CuArray{Float64, 3, CUDA.Mem.DeviceBuffer})
    @ Zygote ~/.julia/packages/Zygote/ajuwN/src/compiler/interface2.jl:0
  [5] Pullback
    @ ~/.julia/packages/Transformers/V363g/src/basic/mh_atten.jl:102 [inlined]
  [6] (::typeof(∂(#_#54)))(Δ::CuArray{Float64, 3, CUDA.Mem.DeviceBuffer})
    @ Zygote ~/.julia/packages/Zygote/ajuwN/src/compiler/interface2.jl:0
  [7] Pullback
    @ ~/.julia/packages/Transformers/V363g/src/basic/mh_atten.jl:80 [inlined]
  [8] (::typeof(∂(Any##kw)))(Δ::CuArray{Float64, 3, CUDA.Mem.DeviceBuffer})
    @ Zygote ~/.julia/packages/Zygote/ajuwN/src/compiler/interface2.jl:0
  [9] Pullback
    @ ~/.julia/packages/Transformers/V363g/src/basic/transformer.jl:69 [inlined]
 [10] (::typeof(∂(λ)))(Δ::CuArray{Float64, 3, CUDA.Mem.DeviceBuffer})
    @ Zygote ~/.julia/packages/Zygote/ajuwN/src/compiler/interface2.jl:0
 [11] macro expansion
    @ ~/.julia/packages/Transformers/V363g/src/stacks/stack.jl:0 [inlined]
 [12] Pullback
    @ ~/.julia/packages/Transformers/V363g/src/stacks/stack.jl:17 [inlined]
 [13] (::typeof(∂(λ)))(Δ::Tuple{CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}, Nothing})
    @ Zygote ~/.julia/packages/Zygote/ajuwN/src/compiler/interface2.jl:0
 [14] Pullback
    @ ~/.julia/packages/Transformers/V363g/src/bert/bert.jl:55 [inlined]
 [15] (::typeof(∂(#_#9)))(Δ::CuArray{Float64, 3, CUDA.Mem.DeviceBuffer})
    @ Zygote ~/.julia/packages/Zygote/ajuwN/src/compiler/interface2.jl:0
 [16] Pullback
    @ ~/.julia/packages/Transformers/V363g/src/bert/bert.jl:50 [inlined]
 [17] (::typeof(∂(λ)))(Δ::CuArray{Float64, 3, CUDA.Mem.DeviceBuffer})
    @ Zygote ~/.julia/packages/Zygote/ajuwN/src/compiler/interface2.jl:0
 [18] Pullback
    @ ./REPL[55]:3 [inlined]
 [19] (::typeof(∂(#loss#4)))(Δ::Tuple{Float64, Nothing})
    @ Zygote ~/.julia/packages/Zygote/ajuwN/src/compiler/interface2.jl:0
 [20] Pullback
    @ ./REPL[55]:2 [inlined]
 [21] (::typeof(∂(loss##kw)))(Δ::Tuple{Float64, Nothing})
    @ Zygote ~/.julia/packages/Zygote/ajuwN/src/compiler/interface2.jl:0
 [22] Pullback
    @ ./REPL[62]:10 [inlined]
 [23] (::typeof(∂(λ)))(Δ::Tuple{Float64, Nothing})
    @ Zygote ~/.julia/packages/Zygote/ajuwN/src/compiler/interface2.jl:0
 [24] (::Zygote.var"#94#95"{Zygote.Params, typeof(∂(λ)), Zygote.Context})(Δ::Tuple{Float64, Nothing})
    @ Zygote ~/.julia/packages/Zygote/ajuwN/src/compiler/interface.jl:348
 [25] train!(epoch::Int64, train_loader::DataLoader{Tuple{Vector{String}, Vector{Int64}}, Random._GLOBAL_RNG}, test_loader::DataLoader{Tuple{Vector{String}, Vector{Int64}}, Random._GLOBAL_RNG})
    @ Main ./REPL[62]:15
 [26] top-level scope
    @ REPL[65]:1
 [27] top-level scope
    @ ~/.julia/packages/CUDA/9T5Sq/src/initialization.jl:66

The full code can be found here: https://github.com/LabCidades/COVID-Classifier/blob/main/src/tweet_classifier_BERT.jl

chengchingwen · September 16, 2021, 9:06am

The error is because there are some output being promote to Float64 at some point but we need them to be Float32.

Storopoli · September 16, 2021, 9:24am

Probably this is the culprit:

al::Float64 = 0.0

Topic		Replies	Views
[ANN] Transformers.jl Package Announcements announcement	6	1967	February 18, 2020
Transformers for NER classification Machine Learning transformers	9	1062	October 12, 2021
Julia Implementation of Transformer Neural Network Model Machine Learning flux	3	1649	April 19, 2019
BERT models from huggingface - Transformers.jl Machine Learning package	1	1212	July 15, 2021
Using Transformers.jl for "is next sentence" New to Julia	2	568	March 24, 2021

Running a pre-trained BERT on twitter data using Flux.jl Transformer.jl

Related topics