Transformers for NER classification

Hello. I am building a model for NER classification. Model is based on pre-trained BERT loaded with Transformers. I have some problems with the loss function. The problem is, that all losses are calculated only for the sequences, (separated with [CLS] token), and not for the tokens.
My classifier is.
const clf = cpu(Chain(Dropout(0.1), Dense(hidden_size, length(labels)), logsoftmax))

and I create classifier on top of BERT:
const bert_model = cpu(set_classifier(_bert_model, (pooler=_bert_model.classifier.pooler, clf=clf)))

Here labels is the number of NER-tags

@chengchingwen

What do you mean?

The number of predicted labels is equal to the number of sequences in the batch, and I want to get predicted labels for the tokens in each sequence.
For the token classification, I use a classifier layer on top of the pre-trained BERT model.
My classifier is:
Chain(Dropout(0.1), Dense(hidden_size, length(labels2)), logsoftmax)
Predictions I calculate as follows:
p = bert_model.classifier.clf(bert_model.classifier.pooler(t[:, 1, :]))

If you are doing sequence labeling, then just apply the cross entropy loss on each tokens.

For example:

function loss(model, batch)
    data = batch.input
    E = model.embed(data)
    H = model.transformers(E, batch.atten_mask)
    ner = @view model.classifier.ner(H)[:, 2:end-1, :] # remove the [CLS] and [SEP] token
    ner_loss = Basic.logcrossentropy(batch.ner, ner, batch.mask)
    return ner_loss
end

where model.classifier.ner is your bert_model.classifier.clf. The pooler is not needed.

Thanks for the help.
When calculating ner_loss, the error arises, since ner is 3D array, with dimensions
(number_of_name_entities, (size(batch.atten_mask)[2] - 2), number_of_sequencies)
So I use Flux.onehotbatch to encode labels. However, I obtain 2d Array with dimensions:
(number_of_name_entities, number_of_tokens_in_batch)
Is it possible in Transformers to produce labels 3d array, in accordance with batch.atten_mask?

I do use 3D array for ner label. Just wrap them with Basic.Vocabulary.

I can show you the script that I used to process the conll2003 dataset:

using JSON3
using Arrow

const datainfo = open(JSON3.read, "./datasets/conll2003/dataset_info.json")

const pos_labels = collect(datainfo.features.pos_tags.feature.names)
const chunk_labels = collect(datainfo.features.chunk_tags.feature.names)
const ner_labels = collect(datainfo.features.ner_tags.feature.names)

const pos_vocab = Vocabulary(pos_labels, ".")
const chunk_vocab = Vocabulary(chunk_labels, chunk_labels[1])
const ner_vocab = Vocabulary(ner_labels, ner_labels[1])

const trainset = Arrow.Table("./datasets/conll2003/conll2003-train.arrow")
const devset = Arrow.Table("./datasets/conll2003/conll2003-validation.arrow")
const testset = Arrow.Table("./datasets/conll2003/conll2003-test.arrow")

const train_num = length(trainset.id)
const dev_num = length(devset.id)
const test_num = length(testset.id)

function retoken(wp, tk, tokens)
    retokens = Array{String}(undef, 0)
    wordbounds = Array{Int}(undef, 0)
    _len = length(tokens)
    sizehint!(retokens, _len)
    sizehint!(wordbounds, _len)

    for (i, token) in enumerate(tokens)
        ntokens = wp(tk(token))
        append!(retokens, ntokens)
        foreach(_->push!(wordbounds, i), 1:length(ntokens))
    end

    sizehint!(retokens, length(retokens))
    sizehint!(wordbounds, length(wordbounds))

    # @assert wp(tk(join(tokens, ' '))) == retokens
    return retokens, wordbounds
end

function getbatch(dataset, ids)
    tks = dataset.tokens[ids]
    chks = dataset.chunk_tags[ids]
    poss = dataset.pos_tags[ids]
    ners = dataset.ner_tags[ids]
    return (token=tks, chunk=chks, pos=poss, ner=ners)
end

function relabel(wb, label, labels)
    relabels = Vector{String}(undef, 0)
    sizehint!(relabels, length(labels))
    base = 1
    @assert first(wb) == base
    for i in wb
        l = labels[i] + 1
        if base == i
            push!(relabels, label[l])
            base += 1
        else
            push!(relabels, replace(label[l], r"^B"=>'I'))
        end
    end

    return relabels
end

function preprocess(wordpiece, tokenizer, sample)
    token, wb = retoken(wordpiece, tokenizer, sample.token)
    chunk = relabel(wb, chunk_labels, sample.chunk)
    pos = relabel(wb, pos_labels, sample.pos)
    ner = relabel(wb, ner_labels, sample.ner)
    return (token = token, chunk = chunk, pos = pos, ner = ner, bounds = wb)
end

function preprocess_batch(wordpiece, tokenizer, sample)
    batch = length(sample.token)
    token = Vector{Vector{String}}(undef, batch)
    wb = Vector{Vector{Int}}(undef, batch)
    chunk = similar(token)
    pos = similar(token)
    ner = similar(token)

    for i = 1:batch
        token[i], wb[i] = retoken(wordpiece, tokenizer, sample.token[i])
        chunk[i] = relabel(wb[i], chunk_labels, sample.chunk[i])
        pos[i] = relabel(wb[i], pos_labels, sample.pos[i])
        ner[i] = relabel(wb[i], ner_labels, sample.ner[i])
    end

    return (token = token, chunk = chunk, pos = pos, ner = ner, bounds = wb)
end

addsstok(x, start_token = "[CLS]", sep_token = "[SEP]") = [start_token; x; sep_token]

function process(wordpiece, tokenizer, sample)
    batch = preprocess_batch(wordpiece, tokenizer, sample)
    token = batch.token
    tok = map(addsstok, token)

    mask = Basic.getmask(batch.token)
    atten_mask = Basic.getmask(tok)
    tok_id = vocab(tok)
    segment = ones(Int, size(tok_id))

    pos = Flux.onehot(pos_vocab, batch.pos)
    chunk = Flux.onehot(chunk_vocab, batch.chunk)
    ner = Flux.onehot(ner_vocab, batch.ner)

    bounds = Tuple(batch.bounds)

    return (input = (tok = tok_id, segment = segment), mask = mask, atten_mask = atten_mask,
            pos = pos, chunk = chunk, ner = ner, bounds = bounds)
end

2 Likes

Thanks for the help!

@chengchingwen
I trained my model and saved it:
@save joinpath(pwd(), "bert_model_ADAM_1.e-5.bson") bert_model

However, when I try to load BSON file, the following error occurs:
LoadError: UndefVarError: Transformers not defined Stacktrace: [1] (::BSON.var"#31#32")(m::Module, f::String) @ BSON C:\Users\User1\.julia\packages\BSON\N216E\src\extensions.jl:21

Also, does my saved model include wordpiece and tokenizer?

BSON.jl only save the data. You still need to using all the required packages (e.g. Transformers.jl, Flux.jl etc.). Besides, It would be problematic if you save the model with GPU data. I would recommend doing cpu_model = cpu(bert_model) and then BSON.@save the cpu_model.

No, since we only save the bert_model, but you can do @save joinpath(pwd(), "bert_model_ADAM_1.e-5.bson") bert_model tokenizer wordpiece to save them as one file. (remember you still need to using all the required packages as mentioned above)

@chengchingwen
Thanks again for the help.
I have a strange error when using saved bert_model. I load the trained model in another Julia project. Before running the model to make predictions, I perform some preprocessing and when I vectorized tokens:
E = model.embed(tok, segment)
following error occurs:
MethodError: no method matching (::Transformers.Basic.CompositeEmbedding

I have the same using as in the training project.