Currently the bert_model from pretrain"" and model from hgf"" works parallelly. And the api are a little different.
First of all, both pretrain"bert-uncased_L-12_H-768_A-12" and hgf"bert-base-uncased:fornextsentenceprediction" load the entire model, so we will have duplicate model weight here. This is avoidable. If you want to use the model from hgf, then load the wordpiece and tokenizer separately from pretrain"" like this:
model = hgf"bert-base-uncased:fornextsentenceprediction"
wordpiece = pretrain"bert-uncased_L-12_H-768_A-12:wordpiece"
tokenizer = pretrain"bert-uncased_L-12_H-768_A-12:tokenizer"
Second, model from hgf"" only support batched input, so we need to reshape the input into size (sequence length, batch size) (in Julia we choose the batch dimension as the last axis) like this:
token_indices = reshape(token_indices, length(token_indices), 1)
segment_indices = reshape(segment_indices, length(segment_indices), 1)
where the 1 means we only have 1 sentence for this minibatch.
Finally, call model with token_indices, but the segment_indices should be passed as keyword argument (which match the behavior of huggingface/transformer).
result = model(token_indices; token_type_ids=segment_indices)
and the result.logits is the prediction score you want.
The full code:
using Transformers
using Transformers.Basic
using Transformers.Pretrain
using Transformers.HuggingFace
ENV["DATADEPS_ALWAYS_ACCEPT"] = true
model = hgf"bert-base-uncased:fornextsentenceprediction"
wordpiece = pretrain"bert-uncased_L-12_H-768_A-12:wordpiece"
tokenizer = pretrain"bert-uncased_L-12_H-768_A-12:tokenizer"
vocab = Vocabulary(wordpiece)
text1 = "Aesthetic Appreciation and Spanish Art:" |> tokenizer |> wordpiece
text2 = "Insights from Eye-Tracking" |> tokenizer |> wordpiece
formatted_text = ["[CLS]"; text1; "[SEP]"; text2; "[SEP]"]
token_indices = vocab(formatted_text)
segment_indices = [fill(1, length(text1)+2); fill(2, length(text2)+1)]
token_indices = reshape(token_indices, length(token_indices), 1)
segment_indices = reshape(segment_indices, length(segment_indices), 1)
result = model(token_indices; token_type_ids=segment_indices)