[ANN] A new lightning fast package for data manipulation in pure Julia

Haven’t you considered adding the radix sort algorithm?
From discussion in other libraries (such as data.table and dataframes.jl) I’ve seen it can be faster.

3 Likes

Usually in the IMD’s manual, for functions with similar name as in Base Julia, if I am referring to functions from Base I include Base, otherwise I am referring to IMD’s functions. In groupby case, I listed the accepted keywords arguments right after the sentence you are mentioning, but I think it might be better to add a link for clarification.

Unfortunately, I am not very clear about what format you are looking for here, because those two examples are very different. Would you please elaborate on this a little more?

IMD always use the Union Type of missing and T for data sets’ columns, and my expectation was that those extra exported functions that you are mentioning would be used in the context of data manipulation. However, I think it would be better to make those extra functions more general.

You are right.
For huge data sets, using the String type for a column is a dead end. I think for IMD we need an efficient and flexible fixed length String type. There is an issue about this on IMD’s github.

1 Like

You can track/contribute this issue on github.

Fixed on master.

2 Likes

Yes, however, I couldn’t find any implementation of radix sort in Julia that fits to my needs.
Although the sort operations (e.g. the groupby function) in IMD are very fast, there is room for improvement.

2 Likes

I was wondering if you would write a paper/blog about the sort operations in IMD as reading the source is not the most efficient way for users who would like to learn/understand this new package. (I found introduction blogs by either Hadley (for dplyr years ago) or @bkamins in the past two years are very useful, just to mention a few.)

2 Likes

I will do this in due course.

1 Like

For the expected format part I would like to have similar output if (some of) the elements of a table column are tables themselves.
This example resulted from reading several CSV files whose names were in the columns of a table. Then it was asked to add a column with the corresponding contents of the csv files.


df = DataFrame(a=1:3)
df1=DataFrame(b=11:15)
df2=DataFrame(f=["f1","f2","f3"],oc=1:3,sdf=[df,df,df1])

3×3 DataFrame
 Row │ f       oc     sdf
     │ String  Int64  DataFrame     
─────┼──────────────────────────────
   1 │ f1          1  3×1 DataFrame 
   2 │ f2          2  3×1 DataFrame 
   3 │ f3          3  5×1 DataFrame 

julia> 

For the question of the representation of the result of a groupby done on a formatted column, I observed that if a kwarg like by of the Base.sort function was used, the resulting key column would have the original data and not the formatted ones.
The advantage, in my opinion, would be to have a clearer view of the starting situation.
One effect of applying a formatting function that is not invertible is to map two different values to the same value.
For example, in the abs case, when I see a sequence of 4 I don’t know which of these was +4 or -4.

1 Like

I see. Fixed it in master.

You can use removeformat! to remove unwanted formats. Setting formats to columns doesn’t change the actual values, moreover, setting and removing formats are instantaneous.

3 Likes

I read somewhere that polars needs 4times size of data memory to work smoothly.

1 Like

I find it suspicious that a one-person project can provide more features and better performance than DataFrames.jl and CSV.jl. Have you submitted your benchmarks at Database-like ops benchmark?

1 Like

I have nothing to do with either package, and I find that somewhat inelegant. All packages are free to use and test, you are free to test whatever you find suspicious. Also it seem clear that the packages have different development constraints, so it is always possible that a new package that aims a different subset of functionalities or is not constrained by some compatibility goals can achieve better results in specific cases.

13 Likes

Those benchmarks are no maintained anymore. The runtime is very long.
Several users have sent them improved versions of Julia codes but the site doesn’t update the benchmarks.

1 Like

What about stack/unstack ?

4 Likes

Yes.

1 Like