Using Transformers.jl for time series classification?

Hello, has anyone tried using Transformers.jl for time series classification?
is it possible?
@chengchingwen

I think it’s possible. You can the treat time series classification task as a sentence classification task, but I never try it before. No performance guarantee.

There do exist some papers using Transformer model for time series data, like this one: https://arxiv.org/pdf/2001.08317.pdf

hello @chengchingwen
so in the tutorial i could just try changing the length(vocab)
with the number of classes i want?

Not really. It depends on what kinds of time series data you’re working on. If it is real-valued time series, like stock prices, you probably don’t need the embedding layer. Which also means you can’t do cross entropy loss on the output.

hello @chengchingwen , ill see if i need one-got encoding anyway im just trying the tutorial now but the loss is not getting lower:


using Flux,Transformers, CUDA,Transformers.Basic, Transformers.Datasets: batched, Flux: gradient, Flux.Optimise: update!
using Flux: onecold, Flux: onehot

enable_gpu(true)

labels = collect(1:10)
startsym = 11
endsym = 12
unksym = 0
labels = [unksym, startsym, endsym, labels...]
vocab = Vocabulary(labels, unksym)

#function for generate training datas
sample_data() = (d = rand(1:10, 10); (d,d))
#function for adding start & end symbol
preprocess(x) = [startsym, x..., endsym]

sample = preprocess.(sample_data()) # = ([11, 5, 4, 2, 5, 2, 5, 5, 5, 7, 8, 12], [11, 5, 4, 2, 5, 2, 5, 5, 5, 7, 8, 12])
encoded_sample = vocab(sample[1])# = [2, 8, 7, 5, 8, 5, 8, 8, 8, 10, 11, 3]

#define a Word embedding layer which turn word index to word vector
embed = Embed(512, length(vocab)) |> gpu
#define a position embedding layer metioned above
pe = PositionEmbedding(512) |> gpu

#wrapper for get embedding
function embedding(x)
  we = embed(x, inv(sqrt(512)))
  e = we .+ pe(we)
  return e
end

#define 2 layer of transformer
encode_t1 = Transformer(512, 8, 64, 2048) |> gpu
encode_t2 = Transformer(512, 8, 64, 2048) |> gpu

#define 2 layer of transformer decoder
decode_t1 = TransformerDecoder(512, 8, 64, 2048) |> gpu
decode_t2 = TransformerDecoder(512, 8, 64, 2048) |> gpu

#define the layer to get the final output probabilities
linear = Positionwise(Dense(512, length(vocab)), logsoftmax) |> gpu

function encoder_forward(x)
  e = embedding(x)
  t1 = encode_t1(e)
  t2 = encode_t2(t1)
  return t2
end

function decoder_forward(x, m)
  e = embedding(x)
  t1 = decode_t1(e, m)
  t2 = decode_t2(t1, m)
  p = linear(t2)
  return p
end


enc = encoder_forward(encoded_sample)
probs = decoder_forward(encoded_sample, enc)
################################################################################
#trian:
function smooth(et)
    sm = fill!(similar(et, Float32), 1e-6/size(embed, 2))
    p = sm .* (1 .+ -et)
    label = p .+ et .* (1 - convert(Float32, 1e-6))
    label
end

#define loss function
function loss(x, y)
  label = onehot(vocab, y) #turn the index to one-hot encoding
  label = smooth(label) #perform label smoothing
  enc = encoder_forward(x)
  probs = decoder_forward(y, enc)
  l = logkldivergence(label[:, 2:end, :], probs[:, 1:end-1, :])
  return l
end

#collect all the parameters
ps = params(embed, pe, encode_t1, encode_t2, decode_t1, decode_t2, linear)
opt = ADAM(1e-4)

#function for created batched data
using Transformers.Datasets: batched

#flux function for update parameters
using Flux: gradient
using Flux.Optimise: update!

#define training loop
function train!()
  @info "start training"
  for i = 1:2000
    data = batched([sample_data() for i = 1:32]) #create 32 random sample and batched
		x, y = preprocess.(data[1]), preprocess.(data[2])
    x, y = vocab(x), vocab(y)#encode the data
    x, y = todevice(x, y) #move to gpu
    l = loss(x, y)
    grad = gradient(()->l, ps)
    if i % 8 == 0
    	println("loss = $l")
    end
    update!(opt, ps, grad)
  end
end

train!()
################################################################################
#test:
function translate(x)
    ix = todevice(vocab(preprocess(x)))
    seq = [startsym]

    enc = encoder_forward(ix)

    len = length(ix)
    for i = 1:2len
        trg = todevice(vocab(seq))
        dec = decoder_forward(trg, enc)
        #move back to gpu due to argmax wrong result on CuArrays
        ntok = onecold(collect(dec), labels)
        push!(seq, ntok[end])
        ntok[end] == endsym && break
    end
  seq[2:end-1]
end

translate([5,5,6,6,1,2,3,4,7, 10])

any idea?
thanks

@lorrp1 Ah, sorry about that. The tutorial on the docs is kind of outdated since that was written during the pre-Zygote.jl era. The loss didn’t get lower because actually there’re no gradient value received. The problem is that l = loss(x,y); grad = gradient(()->l, ps) no longer produce gradient since it’s not tracker-based AD anymore. You can find the newest example in the example folder. I’ll also update the tutorial later.

the fastest fix would be adding Flux.@nograd smooth and changing grad to grad = gradient(()->loss(x, y), ps)

@chengchingwen Hello, it now works fine like you said thank you,

im trying using a label like 0,1 (up down) in a time series to classify if v[ x]>v[x-1] or not after seeing both v[ x] and v[x-1].

i need the positional embedding to “sort” the input array but not the word embedding layer so:

im not sure if its correct to make the embedding like this:

> pe = PositionEmbedding(512) |> gpu
> function embedding(x)
>   return pe(x)
> end

I don’t quite understand what you intend to do. Maybe some small example would be better.

i mean a simple classification of a given array based on pre-defined labels for example here Getting Started — MLDataUtils.jl v0.1 documentation

using a dataset like this:

julia> getobs((X, Y), 30)
([4.7,3.2,1.6,0.2],“setosa”)

array → label

id need a PositionEmbedding

pe = PositionEmbedding(512) |> gpu
function embedding(x)
  return  pe(we)
end

then

ps = params(pe, encode_t1, encode_t2, decode_t1, decode_t2, linear)

would this work?

Because PositionEmbedding’s api have some assumption, you need to be careful of the input type.

That’s said we have a sequence data of length n, the easiest way to get the position embedding for it would be just passing the length n to it:

pe = PositionEmbedding(emb_size)
p = pe(n) # the position embeddings of size (emb_size, n)

Otherwise you would need to check if your input type is the same as the regular usage on word inputs.

btw, I don’t regularly check discourse. Pin me if I didn’t show up for a long time.