ANN: MLDataPattern.jl



Hello everyone!

I am really happy to finally announce the next JuliaML package reaching a stable state: MLDataPattern.jl


and with it the long overdue update of MLDataUtils.jl, which now uses MLDataPattern as one of its back-ends; thus serving as a meta package. It took us months to finally get here. The last tag of MLDataUtils was right around the 0.5 release. Since then we have completely redesigned the data munging functionality and just recently - because of code complexity - outsourced them into their own package MLDataPattern. With this change, the original package MLDataUtils will now serve as a convenient end-user facing package that reexports all data related functionality of JuliaML



MLDataPattern is a long running effort from a few of us to design and implement a package for common ML data access pattern in a Julian manner. As such you may find it a bit unintuitive at first if you are used to other frameworks from other languages. Yet we think the benefits are worth it. Most notably the package provides a number of pattern for lazy shuffling, partitioning, and resampling data sets of various types and origin. At its core, the package was designed around the key requirement of allowing any user-defined type to serve as a custom data source and/or access pattern in a first class manner. We tried to accomplish this by designing the package to be as data container agnostic as we could.

Check it out! The documentation is very comprehensive.


Closing Words

Let me know what you think. Any kind of feedback or criticism is very welcome!

Big thanks to @tbreloff @oxinabox for design and code contributions to the data access pattern!


This looks really useful. From looking at it, it looks as if much of the functionality could extend way beyond machine learning, as a general data handling functionality. Perhaps by defining this on new user types?
Yet you write that it explicitly belongs under the framework of machine-learning - can you expand a bit on why that is? Useful before I consider applying to another purpose.


Well, really MLDataPattern is about nesting data sub-setting operations. This is a common theme in machine learning, which is one of the very few interesting areas that I know something about. I emphasize in the documentation that it is “machine learning specific” to make clear, that MLDataPattern package has nothing to do with any select/groupby/summarize kind of operation. It doesn’t care about the data itself, or even what it represents (other than potential prediction targets). It only cares about sub-setting.

Does that answer your question? If there is some specific use case you are uncertain of, please also feel free to message me personally on gitter.


Thanks, I’ll catch you on gitter :slight_smile:


(I can’t find a link to the gitter room. Where is it? Maybe worth linking to it too.)



We do have a section on “Getting help” where it is listed. Maybe I should add it to the readme as well


This looks like an elegant approach to easing some very common operations. Nice work!