One dimensional autoencoder as a way to refine (or evaluate) cluster analysis

sylvaticus · April 12, 2024, 12:44pm

Is it correct (and do you have references I can use in order to "sit on the giant’s shoulders…) to use 1-D autoencoders output to analyse the heterogeneity of my data, a bit like the cluster analysis but clusters maps the data to a discrete space while autoregressors do on a continuous space, so I can “better” understand the structure of my data.

This is for example the output of the Iris dataset. In this case the colour reflect the true (known) classes, but in general I could run a cluster analysis and then see how the assigned classes distributes over the autoregressor output…

Can I interpret the ae output dimension as a distance? So for example I can say that the variance of setosa is smaller ?

Code

using DelimitedFiles, StatsPlots, BetaML

iris     = readdlm(joinpath(dirname(Base.find_package("BetaML")),"..","test","data","iris_shuffled.csv"),',',skipstart=1)
x        = convert(Array{Float64,2}, iris[:,1:4])
y        = convert(Array{String,1}, iris[:,5])
int_map = Dict("setosa"=>1, "virginica"=>2,"versicolor"=>3) 
intclasses = [int_map[i] for i in y]

function generate_bin_colours(data,classes,bins;)
    # Creation of shares per bin
    nbins   = length(bins)
    ndata = length(data)
    shares = [(0.0,0.0,0.0) for i in 1:nbins-1]
    for ib in  1:nbins-1
        ndata_per_bean = 0
        sums_rgb  = [0,0,0]
        bin_l = bins[ib]
        bin_u = bins[ib+1]
        for id in 1:ndata
            if data[id] >= bin_l &&  data[id] < bin_u
                sums_rgb[classes[id]] += 1
                ndata_per_bean += 1
            end
        end
        share = ndata_per_bean > 0 ? sums_rgb ./ ndata_per_bean : [1/3,1/3,1/3]
        shares[ib] = (share...,)
    end
    return [RGB(x...) for x in shares]
 end

ae       = AutoEncoder(encoded_size=1, epochs=300)
y        = fit!(ae,x)
idx      = sortperm(y[:,1])
ysorted  = y[idx]
clsorted = intclasses[idx]

bins = -4:0.1:2
histogram(fit!(Scaler(),ysorted),bins=bins,color=generate_bin_colours(fit!(Scaler(),ysorted),clsorted,bins),label="r: setosa\ng: virginica\nb:versicolor", title="1-D autoencoded data density")

RoyiAvital · April 20, 2024, 11:54am

The latent variable (The bottleneck in Auto Encoder) can be considered as a non linear dimensionality reduction.

A trivial measure for dimensionality reduction operators is to reduce data to 1D / 2D / 3D and observe the data. Usually the next step is applying clustering.

You may have a look at A S Kovalenko, Y M Demyanenko - Image Clustering by AutoEncoders.

Remark: This is usually the “Hello World” of non linear dimensionality reduction.

Topic		Replies	Views
I am building a simple to use autoencoder model.. anyone interested? Machine Learning machine-learning , mlj , betaml	4	452	December 29, 2023
Flux, categorical arrays, roc curves, confusion matrices Machine Learning flux	14	1042	December 12, 2022
Encoding categorical variables within a matrix Machine Learning machine-learning	3	2720	December 28, 2019
Real-number metric for quantifying the quality of a clustering output Statistics question	2	396	August 2, 2023
Autoencoder for telecommunication (Constellation shaping) Machine Learning	4	867	December 11, 2019

One dimensional autoencoder as a way to refine (or evaluate) cluster analysis

Related topics