Why Julia machine learning is so unfriendly? Very "unsmooth" experience from foolish guy

kirtsar · February 25, 2019, 2:05pm

Ok, so let’s imagine that I’m a newcomer in a field of machine learning, and the very first task i want to do in Julia is something like knn-classifier. Nice! So let’s try to do it step by step, using all those cool libraries!

using RDatasets
data = dataset("datasets", "iris")
X = data[:, 1:4]
y = data[:, 5]

So far so good! Now I want to plot my data: something like scatter-plot, where points are colorized according to species.

Of course i can do one scatter-plot using Plots package… Can I?
The first intention must be something like:
plot(X[:, 1:2], color = y)
Oh no: my X is DataFrame, which is not accepted by plot function. OK then, maybe i should convert it first to Matrix (why should i do that? Is it absolutely necessary? Why?)

plot(convert(Matrix, X[:, 1 : 2]), color = y)
It does not work either, because for some reason categorical variable as color is not allowed.
OK then, you are trying to make some labels from this y, and you can end up with something like:

using MLDataUtils
ylabs = convertlabel(1 : nlabel(y),y)
Xmat = Matrix(X[:, 1 : 2])
scatter(Xmat, color = ylabs)

It works!.. But not as expected. One last thing!

scatter(Xmat[:, 1], Xmat[:, 2], color = ylab)

Now it works as expected!

OK, now you can say: well, you should use recipes for these things and all these nice macros!
Fine! Can I produce pair plots like in R just in one simple command? No! Only corrplot is somewhat near to the desired output, but you can not use different colors for different groups. So go read all of the docs and manuals man, master those scary macros and eventually write your own recipe.
But now you just want to see the data, is it really so hard? So you maybe end up with somewhat like this:

function pairplot(X, y)
    colnames = String.(names(X))
    classes = nlabel(y)
    n = size(X)[2]
    ylab = convertlabel(1 : classes, y)
    plotter = Matrix{Any}(undef, n,n)
    # with labels
    plotter[1, 1] = histogram(X[:, 1], 
                    ylabel = colnames[1],
                    title = colnames[1])
    for j in 2 : n
        Xi = X[:, 1]
        Xj = X[:, j]
        ylabel = colnames[j]
        plotter[1, j] = scatter(Xi, Xj, 
                        markercolor = ylab,
                        ylabel = ylabel)
        plotter[j, 1] = plot(title = colnames[j])
    end
            
    # diagonal
    for i in 2 : n
        plotter[i, i] = histogram(X[:, i])
    end

    # upper diagonal 
    for i in 1 : n
        for j in 2 : (i - 1)
            Xi = X[:, i]
            Xj = X[:, j]
            plotter[i, j] = plot()
        end
    end

    # lower diagonal 
    for i in 2 : n
        for j in (i + 1) : n
            Xi = X[:, i]
            Xj = X[:, j]
            plotter[i, j] = scatter(Xi, Xj, 
                        markercolor = ylab)
        end
    end
    plot(plotter..., 
        layout=grid(n,n),
        legend = false)
end

which produce something like:

I know that this is ugly ad-hoc solution to the problem. But it works as expected and produces some desired output.

Now you just trying to implement knn-classifier, using for example NearestNeighbours as a basis. First you should split your data into two parts… Wait, is there any package for train-test split? Such a basic thing to do, I’m 100% sure that there should be some… MLDataUtils looks nice! Let’s try it out!

using MLDataUtils
splitobs((X, y), at = 0.8)

This is the obvious use-case, X is DataFrame, y is labels. But… it’s not working! It will work only with that form:

splitobs((Matrix(X)', y), at = 0.8)

And so on… each function has its own distracting “properties”, and 99% of the workflow consists of endless converting between DataFrames, Matrices, transposed matrices. Categorical variables support is very very weak, every single machine learning package for some reason reimplement some form of one-hot and other encoding. Even the basic “describe” function is meaningless in terms of categorical variables. For instance it returns min and max for categorical column. Is there any sense? More fruitful information would be number of observations for each of the categories, for example.

Visualisation tools for trained models are also somewhat raw. For example, how one can inspect decision regions for classifier? You have to implement it by yourself, using other packages.

Evizero · February 25, 2019, 2:23pm

actually that s supposed to work. if not its a bug. probably because the code is out of sync with the developments in DataFrames.jl

kirtsar · February 25, 2019, 2:38pm

My problem is:

using RDatasets
using MLDataUtils
data = dataset("datasets", "iris")
X = data[:, 1:4]
y = data[:, 5]
splitobs(X, at = 0.8) # not working
# the error message is somewhat cryptic: 
# ERROR: BoundsError: attempt to access "invalid columns 1:120 selected"

Should i make an issue?

Tamas_Papp · February 25, 2019, 2:39pm

I understand your frustration, but I am not sure this kind of tone is constructive.

All Julia libraries that you are talking about are free software, written by people who have volunteered their time. Julia is a relatively new language (the 1.0 release is just half a year old), and underwent some major transitions recently. So, naturally, expect bugs and inconsistencies.

“Is it really so hard?” Yes, it possibly is. The kind of interface you seem to be expecting may take years to achieve, especially if you want the same level of polish as R, which is decades older.

If you have specific problems, you should

ask for advice,
open an issue,
ideally, make a PR to a package.

Every contribution helps the Julia ecosystem closer to the ideal you are expecting.

piever · February 25, 2019, 2:43pm

Support for dataframes (and tables in general) in Plots is via the @df macro in StatsPlots, for example:

using StatsPlots
@df data corrplot(cols(1:4))

see the README for more details. This should also take care of categorical variables. It seems like you are having difficulties because grouping doesn’t play well with the corrplot recipe. Feel free to open an issue on StatsPlots about that. Please try to be concise and respectful when describing your issue: some inconsistencies are to be expected in a quickly growing ecosystem but we are all working to smooth things out.

kirtsar · February 25, 2019, 3:24pm

You are 100% right, maybe i’m not so clear about what i’m talking (my english is not brilliant, and i didn’t want to make some “sharp corners” or blame someone). I just thought it is so basic that everyone just continue to re-implement basic things like one-hot or different splits and shuffles of data instead of having one unified package for these things.

It is even more fun, because some really complicated things like flux, distributions or ijulia are really mature and works perfectly, but these basic things like splitting and so on are not.

PR for me is something really scary: i sometimes look at the code of the libraries and see all this “macro-magic” and complicated expressions and realize that i will never be able to code like them

Tamas_Papp · February 25, 2019, 3:31pm

Julia programs can be extremely fast, but realizing the potential of the language occasionally requires a coding style and API which makes straightforward 1:1 ports from other languages difficult. So yes, a lot of programmers experiment with new ways of doing the same thing, some of these experiments pan out, some don’t. I agree that it can be confusing to a newcomer, but some of these experiments do yield approaches which are then incorporated into Base or major libraries, so it is a net win.

I think that as you learn the language it may become easier to contribute. Also, some packages and the Julia maintainers are very nice and helpful to newcomers making PRs, so you get a lot of code review and help.

Ajaychat3 · February 25, 2019, 3:46pm

I have been trying to fit a decisiontree classifier using a DataFrame and it does not work. I am supposed to convert the data into a matrix for this purpose. In python I am able to do this using dataframe alongside defining some columns as categorical ones. However as soon as I convert data into a matrix in Julia, I presume the info on categorical variables is lost and I need to covert the data into one hot coding to get consistent results. I fully accept that this could be my understanding gap of potential Julia functionality as I am relatively new to Julia.

pdeffebach · February 25, 2019, 4:13pm

Other people will be able to chime in on the specifics of your problem, but we recognize that its a frustrating situation.

There is some discussion tabular formats for ML models here and there are a number of works in progress to make the construction of ML models from a variety of table types seamless.

datnamer · February 25, 2019, 5:30pm

Still wip, but this should solve your issues: https://github.com/alan-turing-institute/MLJ.jl

00vareladavid · February 25, 2019, 7:55pm

Even if you can’t contribute code, filling clear and thoughtful issues in the relevant package repositories (perhaps after some discussion on discourse) is immensely helpful for package authors to understand the needs of their users.

Zach_Christensen · February 26, 2019, 11:58am

It may be worth noting that if you dig far enough into any language you find this. I stopped using R because when I needed speed the only “friendly” solution was reimplementing with Rcpp. I stopped using Python (quickly) because if you’re not doing machine learning or a glm you’re responsible for reinventing a performant solution or use someone else’s hackish one.

There is some comfort in knowing that what you’re doing right now is paying dues in learning a new language, and not necessarily representative of how difficult it will always be for you.

Side note: I never could get completely use to how R handles categorical variables. I’d be happily developing something and then try to use my fall back baseline for ML code, glmnet. Now I have to make my own design matrices again. A lot of highly regarded packages don’t support Rs categorical variables.

bjarthur · March 2, 2019, 12:55pm

@kirtsar the Gadfly plotting package supports DataFrames quite well. see this example for how to plot one of your subplots above.

kirtsar · March 2, 2019, 6:12pm

Thanks. Notice that it is also easy to produce only one subplot with Plots:

using Plots
using RDatasets
data = dataset("datasets", "iris")
scatter(data[:SepalLength], data[:SepalWidth], group = data[:Species])

phelipe · March 3, 2019, 3:30pm

You can also see DataVoyager.jl

mkborregaard · March 3, 2019, 4:19pm

It should be relatively easy to modify corrplot to do what you requested (here Pairplot with different colors for each group · Issue #217 · JuliaPlots/StatsPlots.jl · GitHub).
I’ve simply not had the time to follow, as I’ve got a little too much at work these days.

Topic		Replies	Views
Suggestion: move DataFrames, plotting into standard distribution Internals & Design proposal , plotting , dataframes	45	3821	February 21, 2018
Recent experience with Julia as the main data science driver General Usage	18	3615	August 8, 2021
What's the current (spring 2024) canonical approach to data science in Julia? General Usage dataframes	34	4162	April 8, 2024
How do DataFrames.jl compare to R's? And Interoperability between R and Julia General Usage	23	6504	January 3, 2018
Failing to plot correlograms in Julia: Makie vs. AoG vs StatsPlots New to Julia plotting , statsplots , makie , algebraofgraphics	30	1400	December 30, 2023

Why Julia machine learning is so unfriendly? Very "unsmooth" experience from foolish guy

Related topics