Then for somebody starting to use Julia… which of that many options should we use for our new projects if we want to work with tabular data with real numbers, categorical variables and missing data as fast as possible?
I think all you really need to deal with, in your case, would be DataFrames.
But many other are supposed to be faster. I need that and compatibility with the most common packages.
On R I was using data.table and matrices.
Have you tried DataFrames? What is your specific use case? What did you find lacking? If you just need to use tabular data, DataFrames is perfect.
@Juan I need… compatibility with the most common packages.
Do you mean other packages within the Julia ecosystem? Well Julia is highly composable. Which means that when you use DataFrames when you have read that data in it should be compatible with all the other packages.
Look at the short section on Composable at https://www.julialang.org
Yes I mean that, DataFrames is compatible with other packages within the Julia ecosystem, but it’s slow. Then I need something faster but also compatible with packages such as GLM, Plot, MixedModels… I mean that I could feed their inputs directly with data with the chosen structure/framework.
Common interfaces such as
may eventually do this seamlessly, but currently that is work in progress.
Will it be faster than dataframes? Will its syntax be much more complex/verbose?
you haven’t explained exactly what operations are slow on DataFrames, or shown us the code you are using to perform the operations.
We might be able to help you improve performance if you give us something more concrete we can help you with.
Do you mean using DataFrames
, ie loading the package? Yes, this is indeed slow at the moment. The solution will be a near-future version of the julia compiler that will be able to pre-compile, store, and load packages.
After the one-time load cost, DataFrames seem pretty fast and efficient.
regards,
/iaw
No, I mean using DataFrames is slow, traversing its elements, and working with them.
I will create a MWE and post it here to try several things.
For now we can see other’s benchmarks. For example StaticArrays can be much faster than dataframes:
============================================
Benchmarks for 3×3 Float64 matrices
============================================
Matrix multiplication -> 8.2x speedup
Matrix multiplication (mutating) -> 3.1x speedup
Matrix addition -> 45x speedup
Matrix addition (mutating) -> 5.1x speedup
Matrix determinant -> 170x speedup
Matrix inverse -> 125x speedup
Matrix symmetric eigendecomposition -> 82x speedup
Matrix Cholesky decomposition -> 23.6x speedup
That means that dataframes can be very slow.
It would be great to see other benchmarks comparing speeds for different sizes and type of elements and some recommendations about what to use for each situation.
StaticArray
s and DataFrame
s are completely different things, one can’t really compare them.
If you want to get help on this, it is best to post an MWE.