Is it correct (and do you have references I can use in order to "sit on the giant’s shoulders…) to use 1-D autoencoders output to analyse the heterogeneity of my data, a bit like the cluster analysis but clusters maps the data to a discrete space while autoregressors do on a continuous space, so I can “better” understand the structure of my data.
This is for example the output of the Iris dataset. In this case the colour reflect the true (known) classes, but in general I could run a cluster analysis and then see how the assigned classes distributes over the autoregressor output…
Can I interpret the ae output dimension as a distance? So for example I can say that the variance of setosa is smaller ?
Code
using DelimitedFiles, StatsPlots, BetaML
iris = readdlm(joinpath(dirname(Base.find_package("BetaML")),"..","test","data","iris_shuffled.csv"),',',skipstart=1)
x = convert(Array{Float64,2}, iris[:,1:4])
y = convert(Array{String,1}, iris[:,5])
int_map = Dict("setosa"=>1, "virginica"=>2,"versicolor"=>3)
intclasses = [int_map[i] for i in y]
function generate_bin_colours(data,classes,bins;)
# Creation of shares per bin
nbins = length(bins)
ndata = length(data)
shares = [(0.0,0.0,0.0) for i in 1:nbins-1]
for ib in 1:nbins-1
ndata_per_bean = 0
sums_rgb = [0,0,0]
bin_l = bins[ib]
bin_u = bins[ib+1]
for id in 1:ndata
if data[id] >= bin_l && data[id] < bin_u
sums_rgb[classes[id]] += 1
ndata_per_bean += 1
end
end
share = ndata_per_bean > 0 ? sums_rgb ./ ndata_per_bean : [1/3,1/3,1/3]
shares[ib] = (share...,)
end
return [RGB(x...) for x in shares]
end
ae = AutoEncoder(encoded_size=1, epochs=300)
y = fit!(ae,x)
idx = sortperm(y[:,1])
ysorted = y[idx]
clsorted = intclasses[idx]
bins = -4:0.1:2
histogram(fit!(Scaler(),ysorted),bins=bins,color=generate_bin_colours(fit!(Scaler(),ysorted),clsorted,bins),label="r: setosa\ng: virginica\nb:versicolor", title="1-D autoencoded data density")