I’m fairly new to Knet and still trying to figure my way around. The project I’m working on right now is a simple “speech emotion recognizer”. Unfortunately, the RNN tutorial makes for a better introduction to NLP problems than signal processing so I wanted to ask a few questions after receiving
AssertionError: vec(value(x)) isa WTYPE
one too many times.
Basic Idea
Input + A Bidirectional Layer + Dense Layer => Output
A nice example in Flux can be found here:
Input Format
The dataset I’m using is https://smartlaboratory.org/ravdess/.
I’m only using 13 mel-frequency cepstrum coefficients, for each sample I take in a given recording (which is usually between 300 to 600 samples per recording). I’m training on the Neutral (01) and Happy (03) emotions.
So that’s 13 features, over sequences of 300 to 600 in length, to learn 2 classes.
Approach So Far
#imports and config
using BSON
ENV["COLUMNS"] = 72
using Pkg; for p in ("Knet","IterTools","Plots"); haskey(Pkg.installed(),p) || Pkg.add(p); end
using Random: shuffle!
using Base.Iterators: flatten
using IterTools: ncycle, takenth
using Knet: Knet, AutoGrad, param, param0, mat, RNN, relu, Data, adam, progress, nll, zeroone
# Usual Chain Definition
struct Chain
layers
Chain(layers...) = new(layers)
end
(c::Chain)(x) = (for l in c.layers; x = l(x); end; x)
(c::Chain)(x,y) = nll(c(x),y)
# Usual Dense Layer Definition
struct Dense; w; b; f; end
Dense(i::Int,o::Int,f=identity) = Dense(param(o,i), param0(o), f)
(d::Dense)(x) = d.f.(d.w * mat(x,dims=1) .+ d.b)
...
#After loading in my dataset from another file where it has been preprocessed
println.(summary.((Xs, Ys)));
> 288-element Array{Any,1}
> 288-element Array{Any,1}
#I get the feeling there's something wrong here, below is what the first entry in Xs and Ys are:
#Features
Xs[1]
> 328-element Array{Array{Float64,1},1}:
> [-158.44758646562016, -14.786369432867609, ... ]
> [-504.61429557613394, 16.563930341805474, ... ] ...
#Labels
Ys[1]
> 328-element Array{Array{Float64,1},1}:
> [1.0]
> [1.0]
> [1.0] ...
#For the sequence-batching, I tried following the tutorial, which admittedly not a very smart move
#Arbitrary, should probably be changed
BATCHSIZE = 32
SEQLENGTH = 16;
function seqbatch(x,y,B,T)
N = length(x) ÷ B
#println(N)
x = permutedims(reshape(x[1:N*B],N,B))
#println(x)
y = permutedims(reshape(y[1:N*B],N,B))
#println(y)
d = []; for i in 0:T:N-T
push!(d, (x[:,i+1:i+T], y[:,i+1:i+T]))
end
return d
end
allX = vcat((x->x[:,1]).(Xs)...)
allY = vcat((x->x[:,1]).(Ys)...);
d = seqbatch(allX, allY, BATCHSIZE, SEQLENGTH);
shuffle!(d)
dtst = d[1:10]
dtrn = d[11:end];
#Training Method
function trainresults(file,maker,savemodel)
model = maker()
results = ((nll(model,dtst), zeroone(model,dtst))
for x in takenth(progress(adam(model,ncycle(dtrn,5))),100))
results = reshape(collect(Float32,flatten(results)),(2,:))
Knet.save(file,"model",(savemodel ? model : nothing),"results",results)
Knet.gc() # To save gpu memory
println(minimum(results,dims=2))
return model,results
end
BIRNN(input,hidden,output)= # biRNN Tagger, Float64 instead of the default 32
Chain(RNN(input,hidden,rnnType=:relu,bidirectional=true,dataType=Float64),Dense(2hidden,output));
#I assume input corresponds to my feature count of 13, since I'm not trying anything funny with strides/dilations etc. and my output to 1 (though I'm worried I'm accidentally doing regression instead of classification here)
EMREC() = BIRNN(13,HIDDENSIZE,1)
(tEm,rEm) = trainresults("emrec.jld2",EMREC,true);
I can clearly see I have more than one issue going on here, but my real concern is what dimensionality my input should be. The tutorial uses an encoder for the words, but in my case that seems to be out of the question.
Am I feeding in the input properly? If so what’s the matter? Or is the issue strictly related to the typing of my minibatch?
Thanks in advance.