Wher to find toy examples (with data) to check my (toy as well) implementation of ML algorithms?

I am developing a toy ML package, but I would like to test it with some simple datasets and implementation in established packages… is there a repository with simple data/algorithm (not necessarily in Julia) to see how my library is performing ?

I am thinking to some ensemble of data (ideally no more than a few KB) and simple algorithm with a metric of “fitness”.

For example I have difficulties to make my Feed-forward Neural Network to produce anything useful except for very basic cases, so I don’t know at this stage if (1) there is some conceptual errors in my code, (2) the training algorithm is not performant or (3) I have no experience in tuning well my network structure/optimisation algorithm (e.g. I am over-fitting the network with too many layers, or choosing a learning rate too high…)

In MLJBase.jl there are functionalities for synthetic datasets in different settings (regression, classification, …) that you could use for sanity checking (eg make_blobs)

Otherwise there’s RDatasets that you can use to have a ton of datasets. If you want to compare performances you could have a look at the tutorials in DataScienceTutorials.jl where we have some end to end stuff on datasets which you could try to compare against

If you have more specific questions (eg how efficient is my X against an established package) then you’ll need to give some info on the “X”

Edit: from having a look at your package it seems you mostly have clustering algorithms and feed forward NN right? So the clustering you could compare with what’s in Clustering.jl, at least in terms of speed, FFNN, you can just compare with Flux or KNet on Mnist or FashionMnist

2 Likes

Google uci machine learning datasets