Hello, has anyone tried using Transformers.jl for time series classification?

is it possible?

@chengchingwen

I think it’s possible. You can the treat time series classification task as a sentence classification task, but I never try it before. No performance guarantee.

There do exist some papers using Transformer model for time series data, like this one: https://arxiv.org/pdf/2001.08317.pdf

hello @chengchingwen

so in the tutorial i could just try changing the length(vocab)

with the number of classes i want?

Not really. It depends on what kinds of time series data you’re working on. If it is real-valued time series, like stock prices, you probably don’t need the embedding layer. Which also means you can’t do cross entropy loss on the output.

hello @chengchingwen , ill see if i need one-got encoding anyway im just trying the tutorial now but the loss is not getting lower:

```
using Flux,Transformers, CUDA,Transformers.Basic, Transformers.Datasets: batched, Flux: gradient, Flux.Optimise: update!
using Flux: onecold, Flux: onehot
enable_gpu(true)
labels = collect(1:10)
startsym = 11
endsym = 12
unksym = 0
labels = [unksym, startsym, endsym, labels...]
vocab = Vocabulary(labels, unksym)
#function for generate training datas
sample_data() = (d = rand(1:10, 10); (d,d))
#function for adding start & end symbol
preprocess(x) = [startsym, x..., endsym]
sample = preprocess.(sample_data()) # = ([11, 5, 4, 2, 5, 2, 5, 5, 5, 7, 8, 12], [11, 5, 4, 2, 5, 2, 5, 5, 5, 7, 8, 12])
encoded_sample = vocab(sample[1])# = [2, 8, 7, 5, 8, 5, 8, 8, 8, 10, 11, 3]
#define a Word embedding layer which turn word index to word vector
embed = Embed(512, length(vocab)) |> gpu
#define a position embedding layer metioned above
pe = PositionEmbedding(512) |> gpu
#wrapper for get embedding
function embedding(x)
we = embed(x, inv(sqrt(512)))
e = we .+ pe(we)
return e
end
#define 2 layer of transformer
encode_t1 = Transformer(512, 8, 64, 2048) |> gpu
encode_t2 = Transformer(512, 8, 64, 2048) |> gpu
#define 2 layer of transformer decoder
decode_t1 = TransformerDecoder(512, 8, 64, 2048) |> gpu
decode_t2 = TransformerDecoder(512, 8, 64, 2048) |> gpu
#define the layer to get the final output probabilities
linear = Positionwise(Dense(512, length(vocab)), logsoftmax) |> gpu
function encoder_forward(x)
e = embedding(x)
t1 = encode_t1(e)
t2 = encode_t2(t1)
return t2
end
function decoder_forward(x, m)
e = embedding(x)
t1 = decode_t1(e, m)
t2 = decode_t2(t1, m)
p = linear(t2)
return p
end
enc = encoder_forward(encoded_sample)
probs = decoder_forward(encoded_sample, enc)
################################################################################
#trian:
function smooth(et)
sm = fill!(similar(et, Float32), 1e-6/size(embed, 2))
p = sm .* (1 .+ -et)
label = p .+ et .* (1 - convert(Float32, 1e-6))
label
end
#define loss function
function loss(x, y)
label = onehot(vocab, y) #turn the index to one-hot encoding
label = smooth(label) #perform label smoothing
enc = encoder_forward(x)
probs = decoder_forward(y, enc)
l = logkldivergence(label[:, 2:end, :], probs[:, 1:end-1, :])
return l
end
#collect all the parameters
ps = params(embed, pe, encode_t1, encode_t2, decode_t1, decode_t2, linear)
opt = ADAM(1e-4)
#function for created batched data
using Transformers.Datasets: batched
#flux function for update parameters
using Flux: gradient
using Flux.Optimise: update!
#define training loop
function train!()
@info "start training"
for i = 1:2000
data = batched([sample_data() for i = 1:32]) #create 32 random sample and batched
x, y = preprocess.(data[1]), preprocess.(data[2])
x, y = vocab(x), vocab(y)#encode the data
x, y = todevice(x, y) #move to gpu
l = loss(x, y)
grad = gradient(()->l, ps)
if i % 8 == 0
println("loss = $l")
end
update!(opt, ps, grad)
end
end
train!()
################################################################################
#test:
function translate(x)
ix = todevice(vocab(preprocess(x)))
seq = [startsym]
enc = encoder_forward(ix)
len = length(ix)
for i = 1:2len
trg = todevice(vocab(seq))
dec = decoder_forward(trg, enc)
#move back to gpu due to argmax wrong result on CuArrays
ntok = onecold(collect(dec), labels)
push!(seq, ntok[end])
ntok[end] == endsym && break
end
seq[2:end-1]
end
translate([5,5,6,6,1,2,3,4,7, 10])
```

any idea?

thanks

@lorrp1 Ah, sorry about that. The tutorial on the docs is kind of outdated since that was written during the pre-Zygote.jl era. The loss didn’t get lower because actually there’re no gradient value received. The problem is that `l = loss(x,y); grad = gradient(()->l, ps)`

no longer produce gradient since it’s not tracker-based AD anymore. You can find the newest example in the example folder. I’ll also update the tutorial later.

the fastest fix would be adding `Flux.@nograd smooth`

and changing `grad`

to `grad = gradient(()->loss(x, y), ps)`