One dimensional autoencoder as a way to refine (or evaluate) cluster analysis

Is it correct (and do you have references I can use in order to "sit on the giant’s shoulders…) to use 1-D autoencoders output to analyse the heterogeneity of my data, a bit like the cluster analysis but clusters maps the data to a discrete space while autoregressors do on a continuous space, so I can “better” understand the structure of my data.

This is for example the output of the Iris dataset. In this case the colour reflect the true (known) classes, but in general I could run a cluster analysis and then see how the assigned classes distributes over the autoregressor output…

Can I interpret the ae output dimension as a distance? So for example I can say that the variance of setosa is smaller ?


using DelimitedFiles, StatsPlots, BetaML

iris     = readdlm(joinpath(dirname(Base.find_package("BetaML")),"..","test","data","iris_shuffled.csv"),',',skipstart=1)
x        = convert(Array{Float64,2}, iris[:,1:4])
y        = convert(Array{String,1}, iris[:,5])
int_map = Dict("setosa"=>1, "virginica"=>2,"versicolor"=>3) 
intclasses = [int_map[i] for i in y]

function generate_bin_colours(data,classes,bins;)
    # Creation of shares per bin
    nbins   = length(bins)
    ndata = length(data)
    shares = [(0.0,0.0,0.0) for i in 1:nbins-1]
    for ib in  1:nbins-1
        ndata_per_bean = 0
        sums_rgb  = [0,0,0]
        bin_l = bins[ib]
        bin_u = bins[ib+1]
        for id in 1:ndata
            if data[id] >= bin_l &&  data[id] < bin_u
                sums_rgb[classes[id]] += 1
                ndata_per_bean += 1
        share = ndata_per_bean > 0 ? sums_rgb ./ ndata_per_bean : [1/3,1/3,1/3]
        shares[ib] = (share...,)
    return [RGB(x...) for x in shares]

ae       = AutoEncoder(encoded_size=1, epochs=300)
y        = fit!(ae,x)
idx      = sortperm(y[:,1])
ysorted  = y[idx]
clsorted = intclasses[idx]

bins = -4:0.1:2
histogram(fit!(Scaler(),ysorted),bins=bins,color=generate_bin_colours(fit!(Scaler(),ysorted),clsorted,bins),label="r: setosa\ng: virginica\nb:versicolor", title="1-D autoencoded data density") 

The latent variable (The bottleneck in Auto Encoder) can be considered as a non linear dimensionality reduction.

A trivial measure for dimensionality reduction operators is to reduce data to 1D / 2D / 3D and observe the data. Usually the next step is applying clustering.

You may have a look at A S Kovalenko, Y M Demyanenko - Image Clustering by AutoEncoders.

Remark: This is usually the “Hello World” of non linear dimensionality reduction.