Using Transformers.jl for time series classification?

Hello, has anyone tried using Transformers.jl for time series classification?
is it possible?
@chengchingwen

I think it’s possible. You can the treat time series classification task as a sentence classification task, but I never try it before. No performance guarantee.

There do exist some papers using Transformer model for time series data, like this one: https://arxiv.org/pdf/2001.08317.pdf

hello @chengchingwen
so in the tutorial i could just try changing the length(vocab)
with the number of classes i want?

Not really. It depends on what kinds of time series data you’re working on. If it is real-valued time series, like stock prices, you probably don’t need the embedding layer. Which also means you can’t do cross entropy loss on the output.

hello @chengchingwen , ill see if i need one-got encoding anyway im just trying the tutorial now but the loss is not getting lower:


using Flux,Transformers, CUDA,Transformers.Basic, Transformers.Datasets: batched, Flux: gradient, Flux.Optimise: update!
using Flux: onecold, Flux: onehot

enable_gpu(true)

labels = collect(1:10)
startsym = 11
endsym = 12
unksym = 0
labels = [unksym, startsym, endsym, labels...]
vocab = Vocabulary(labels, unksym)

#function for generate training datas
sample_data() = (d = rand(1:10, 10); (d,d))
#function for adding start & end symbol
preprocess(x) = [startsym, x..., endsym]

sample = preprocess.(sample_data()) # = ([11, 5, 4, 2, 5, 2, 5, 5, 5, 7, 8, 12], [11, 5, 4, 2, 5, 2, 5, 5, 5, 7, 8, 12])
encoded_sample = vocab(sample[1])# = [2, 8, 7, 5, 8, 5, 8, 8, 8, 10, 11, 3]

#define a Word embedding layer which turn word index to word vector
embed = Embed(512, length(vocab)) |> gpu
#define a position embedding layer metioned above
pe = PositionEmbedding(512) |> gpu

#wrapper for get embedding
function embedding(x)
  we = embed(x, inv(sqrt(512)))
  e = we .+ pe(we)
  return e
end

#define 2 layer of transformer
encode_t1 = Transformer(512, 8, 64, 2048) |> gpu
encode_t2 = Transformer(512, 8, 64, 2048) |> gpu

#define 2 layer of transformer decoder
decode_t1 = TransformerDecoder(512, 8, 64, 2048) |> gpu
decode_t2 = TransformerDecoder(512, 8, 64, 2048) |> gpu

#define the layer to get the final output probabilities
linear = Positionwise(Dense(512, length(vocab)), logsoftmax) |> gpu

function encoder_forward(x)
  e = embedding(x)
  t1 = encode_t1(e)
  t2 = encode_t2(t1)
  return t2
end

function decoder_forward(x, m)
  e = embedding(x)
  t1 = decode_t1(e, m)
  t2 = decode_t2(t1, m)
  p = linear(t2)
  return p
end


enc = encoder_forward(encoded_sample)
probs = decoder_forward(encoded_sample, enc)
################################################################################
#trian:
function smooth(et)
    sm = fill!(similar(et, Float32), 1e-6/size(embed, 2))
    p = sm .* (1 .+ -et)
    label = p .+ et .* (1 - convert(Float32, 1e-6))
    label
end

#define loss function
function loss(x, y)
  label = onehot(vocab, y) #turn the index to one-hot encoding
  label = smooth(label) #perform label smoothing
  enc = encoder_forward(x)
  probs = decoder_forward(y, enc)
  l = logkldivergence(label[:, 2:end, :], probs[:, 1:end-1, :])
  return l
end

#collect all the parameters
ps = params(embed, pe, encode_t1, encode_t2, decode_t1, decode_t2, linear)
opt = ADAM(1e-4)

#function for created batched data
using Transformers.Datasets: batched

#flux function for update parameters
using Flux: gradient
using Flux.Optimise: update!

#define training loop
function train!()
  @info "start training"
  for i = 1:2000
    data = batched([sample_data() for i = 1:32]) #create 32 random sample and batched
		x, y = preprocess.(data[1]), preprocess.(data[2])
    x, y = vocab(x), vocab(y)#encode the data
    x, y = todevice(x, y) #move to gpu
    l = loss(x, y)
    grad = gradient(()->l, ps)
    if i % 8 == 0
    	println("loss = $l")
    end
    update!(opt, ps, grad)
  end
end

train!()
################################################################################
#test:
function translate(x)
    ix = todevice(vocab(preprocess(x)))
    seq = [startsym]

    enc = encoder_forward(ix)

    len = length(ix)
    for i = 1:2len
        trg = todevice(vocab(seq))
        dec = decoder_forward(trg, enc)
        #move back to gpu due to argmax wrong result on CuArrays
        ntok = onecold(collect(dec), labels)
        push!(seq, ntok[end])
        ntok[end] == endsym && break
    end
  seq[2:end-1]
end

translate([5,5,6,6,1,2,3,4,7, 10])

any idea?
thanks

@lorrp1 Ah, sorry about that. The tutorial on the docs is kind of outdated since that was written during the pre-Zygote.jl era. The loss didn’t get lower because actually there’re no gradient value received. The problem is that l = loss(x,y); grad = gradient(()->l, ps) no longer produce gradient since it’s not tracker-based AD anymore. You can find the newest example in the example folder. I’ll also update the tutorial later.

the fastest fix would be adding Flux.@nograd smooth and changing grad to grad = gradient(()->loss(x, y), ps)