I have checked the documentation and it seems very functional and nice to use. I will check the package with my last problem at hand, and I will send you my feedback.
hi,
I started looking into LearningHorse.jl and it looks great.
I’m working with DataFrames.jl and I noticed that the function DataSplitter() support Dataframe, but the OneHotEncoder() only works with integers.
I wrote a small multi-dispatch to the OHE() function and I post it here.
If you think it can contribute to your package fill free to take it.
p.s. I’m not sure it’ll cover all the cases but for my case it works, and it can be a good start.
function (OHE::OneHotEncoder)(data::AbstractVector{T}; prifex="") where {T<:String}
if !isempty(prifex)
prifex *= "_"
else
prifex = "OHE_"
end
unqs = unique(data)
out = DataFrame(Dict([string(prifex,unq)=>zeros(length(data)) for unq in unqs]...))
for (ind, k) in enumerate(unqs)
out[findall(isequal(k), data),ind] .= 1
end
return out
end
function (OHE::OneHotEncoder)(df::DataFrame, cols::Vector{T} ) where T
out = deepcopy(df)
for col in cols
if typeof(col) == Int
prifex = names(df)[col]
elseif typeof(col) == Symbol
prifex = string(col)
else
prifex = col
end
data = out[!,col]
out = select(out, Not([col]))
out = hcat(out, OHE(data, prifex=prifex))
end
return out
end
using my dataframe with the first row data as:
DataFrameRow
Row │ total_bill tip sex smoker day time size
│ Float64 Float64 String String String String Int64
─────┼────────────────────────────────────────────────────────────
1 │ 16.99 1.01 Female No Sun Dinner 2
The docs are nice, and the “getting started” section does indeed make it look very straightforward to get started!
Can you elaborate on how LearningHorse is different (in philosophy, goals, strengths, etc.) than other machine learning packages/ecosystems such as Flux and MLJ?
Thank you for great suggestion! I’ll take this code.
I found some bugs in LearningHorse.Preprocessing, so I’ll also fix it and release.
LearningHorse hasn’t yet guaranteed the operation using DataFrame, I would like to support to DataFrame soon.
You can easily use various models in Julia
I aim to be a powerful library that can build various models.
Less dependencies
When I use a library, I often get errors in dependencies(Only me?). So I don’t think we should make too many dependencies in the library.
For those who learn machine learning with Julia
Python has not only a very advanced library dedicated to neural networks, but also libraries that can build various models that are easy to use for beginners. I want LearningHorse to become such a library in Julia.
In other words, I want make LearningHorse a simple library that is powerful enough to build various models and easy to use even for beginners.
I didn’t know about this library. And I’m not sure about the specifications of this package, it looks like it’s going to be a decision tree analysis from the name or code, but if you look at the document, it seems that regression is also supported.
However, in this library, it seems to be different in that it can be fine settings such as the loss function used to train the model of the decision tree.
I’m developing LearningHorse by myself, so there are many things that each models is still incomplete., but I want to fix it in the feature
such efforts are great (especially for the author(s)) as learning tools for elementary ML; to have pure Julia code implementing simple models that can be read fairly easily
often the authors tend to abandon these efforts after a while because they move on to learning other stuff and because it quickly becomes (very) hard to maintain a reasonable number of models, if you look at sklearn they’ve managed to do this because there’s institutional backing behind it and a lot of contribs as a result (+ now a ton of users)
That shouldn’t mean the effort is not worthwhile and interesting! But it helps explain the difference with MLJ: MLJ had the ambition of providing a backbone for ML, not quite the models themselves but the chaining of ML steps within a full workflow from data ingestion to prediction including hyper parameter tuning etc. The models are provided by dedicated libraries in Julia (such as EvoTrees or DecisionTrees) or in other languages (ScikitLearn.jl, LightGBM etc). These dedicated packages are easier to maintain and to bring full performance to rather than editing one sprawling code base.
This is both a blessing and a curse though; while the specific model packages are lacking in Julia; people might think that MLJ is just some wrapper around ScikitLearn since you would use ScikitLearn models if there’s no dedicated or properly working package in Julia; however this is the vision in which MLJ was built and in the long term we can be hopeful that there will be more and better pure-Julia model packages (or other languages, it doesn’t really matter and in some cases it makes more sense to just interface with an existing high-perf lib such as XGB or LGBM) which can be added to the ecosystem.
this is just for context and answers the question above about the distinction with MLJ; the aim is just different; MLJ by itself does not really aim to provide the basic ML models; rather it helps working with packages that implement such models in a unified workflow.
Thank you, @tlienart for the information and the comment (I did know BetaML but not the other one).
Yes, the philosophy of MLJ is very different to the other ones, giving an common interface to other packages that implement different models. I have used in my ML testing sucessfully but always approach are welcome, specially when their code is so clear and it is well-documented (I teach ML in academia using R and Python and I would to be able to recommend Julia also for that area, in my opinion DataFrames is great, but the Machine Learning packages need a bit more of time).
Both approach has its advantages and disadvantages. For instead, compared the BetaML and MLJ/DecisionTree.jl the last one is a lot faster (EvoTrees seems nice, also). Also, a more direct package sometimes could be easy to learn.
It is true that when it is an individual work it is difficult to maintain (any package) so it could be nice to have more people working in them (feedbacks, features, …). I think the first step is people working/using packages know about the other ones.