Using Transformers.jl for "is next sentence"

chengchingwen · March 23, 2021, 11:26pm

Currently the bert_model from pretrain"" and model from hgf"" works parallelly. And the api are a little different.

First of all, both pretrain"bert-uncased_L-12_H-768_A-12" and hgf"bert-base-uncased:fornextsentenceprediction" load the entire model, so we will have duplicate model weight here. This is avoidable. If you want to use the model from hgf, then load the wordpiece and tokenizer separately from pretrain"" like this:

model = hgf"bert-base-uncased:fornextsentenceprediction"
wordpiece = pretrain"bert-uncased_L-12_H-768_A-12:wordpiece"
tokenizer = pretrain"bert-uncased_L-12_H-768_A-12:tokenizer"

Second, model from hgf"" only support batched input, so we need to reshape the input into size (sequence length, batch size) (in Julia we choose the batch dimension as the last axis) like this:

token_indices = reshape(token_indices, length(token_indices), 1)
segment_indices = reshape(segment_indices, length(segment_indices), 1)

where the 1 means we only have 1 sentence for this minibatch.

Finally, call model with token_indices, but the segment_indices should be passed as keyword argument (which match the behavior of huggingface/transformer).

result = model(token_indices; token_type_ids=segment_indices)

and the result.logits is the prediction score you want.

The full code:

using Transformers
using Transformers.Basic
using Transformers.Pretrain
using Transformers.HuggingFace

ENV["DATADEPS_ALWAYS_ACCEPT"] = true

model = hgf"bert-base-uncased:fornextsentenceprediction"
wordpiece = pretrain"bert-uncased_L-12_H-768_A-12:wordpiece"
tokenizer = pretrain"bert-uncased_L-12_H-768_A-12:tokenizer"

vocab = Vocabulary(wordpiece)

text1 = "Aesthetic Appreciation and Spanish Art:" |> tokenizer |> wordpiece
text2 = "Insights from Eye-Tracking" |> tokenizer |> wordpiece
formatted_text = ["[CLS]"; text1; "[SEP]"; text2; "[SEP]"]

token_indices = vocab(formatted_text)
segment_indices = [fill(1, length(text1)+2); fill(2, length(text2)+1)]
token_indices = reshape(token_indices, length(token_indices), 1)
segment_indices = reshape(segment_indices, length(segment_indices), 1)

result = model(token_indices; token_type_ids=segment_indices)

Topic		Replies	Views
BERT models from huggingface - Transformers.jl Machine Learning package	1	1198	July 15, 2021
Transformers for NER classification Machine Learning transformers	9	1049	October 12, 2021
Running a pre-trained BERT on twitter data using Flux.jl Transformer.jl Machine Learning flux , nlp , transformers	17	2162	September 16, 2021
Training sentence transformers in Julia? Machine Learning question , transformers , sbert , bert , sentence-transformer	0	578	November 14, 2021
Sentence Embeddings using Transformers.jl Machine Learning	0	345	March 25, 2023

Using Transformers.jl for "is next sentence"

Related topics