[ANN] A new lightning fast package for data manipulation in pure Julia

the following program works amazingly fast for my data (0.3 seconds) but there are some incorrect rows in the output, though it works fine for MWE, any idea?

julia> using InMemoryDatasets
julia> df = Dataset(v1 = [1.0,2.1,3.0], v2 = [1,3,3], v3 = [missing,2.1,3.0])
julia> f(x,y) = ismissing(x) || ismissing(y) ? true : x == y
julia> byrow(df, isless, [:v1, :v2, :v3], with = :v1, lt = f)
3-element Vector{Bool}:
 1
 0
 1
3 Likes

Your program produces incorrect results for rows where :v1 is missing, use

julia> byrow(df, isless, [:v1, :v2, :v3], with = byrow(df, coalesce, :), lt = f)
5 Likes

Nice package & benchmarks, one thing that those benchmarks doesn’t show is how memory crazy polars is. Recently I have benchmarked 1e8 case on my 16G mac and DataFrames was about 2ice faster than polars! how? simply because mac allows using hard as memory and because polars was very hungry for it, it needed to write and read from hard frequently.
good job, keep at it.

1 Like

Ah, you are right. It works like a charm, thanks. The running time is below 0.5 seconds, better than my wish.

3 Likes

you can squeeze out the last bit of performance by changing isless to issorted, since you wouldn’t need byrow(ds, coalesce, :)

julia> byrow(df, issorted, [:v1, :v2, :v3], lt = !f)
3-element Vector{Bool}:
 1
 0
 1
3 Likes

@sl-solution it is really a cool package! I like the data transpose possibilities

3 Likes

i like the fact that unique allows to keep only the nonunique rows

julia> data=Dataset(x=[1,1,2,3,3])
5Γ—1 Dataset
 Row β”‚ x        
     β”‚ identity 
     β”‚ Int64?   
─────┼──────────
   1 β”‚        1
   2 β”‚        1
   3 β”‚        2
   4 β”‚        3
   5 β”‚        3

julia> unique(data,keep=:only)
4Γ—1 Dataset
 Row β”‚ x        
     β”‚ identity 
     β”‚ Int64?   
─────┼──────────
   1 β”‚        1
   2 β”‚        1
   3 β”‚        3
   4 β”‚        3
6 Likes

Congratulation on your new package, I was monitoring JULIA for a while, and in comparison to β€œR dplyr” its packages was hurting from the lack of features. I am glad to see it is changing. good luck.

1 Like

I read quickly (perhaps too much) the documentation of the groupby function and tried to use the kwargs by =.

But obviously I didn’t quite understand how it works.
Nor did I get better results with the replace function.

if this can help in the analysis of the replace case …

using InMemoryDatasets
g1 = repeat(1:6, inner = 4)
g2 = repeat(1:4, 6)

ds = Dataset(g1 = g1, g2 = g2)

function fortofor(x)
    x==4 ? - 4 : x
end
mds=map(ds, x->fortofor(x), :g1)

mds=map(ds, x->replace([x],4=>-4), :g1)

julia> mds=map(ds, x->replace([x],4=>-4), :g1)
24Γ—2 Dataset
 Row β”‚ g1        g2       
     β”‚ identity  identity
     β”‚ Union…?   Int64?
─────┼────────────────────
   1 β”‚ [1]              1
   2 β”‚ [1]              2
   3 β”‚ [1]              3
   4 β”‚ [1]              4
   5 β”‚ [2]              1
   6 β”‚ [2]              2
   7 β”‚ [2]              3
   8 β”‚ [2]              4
   9 β”‚ [3]              1
  10 β”‚ [3]              2
  11 β”‚ [3]              3
  12 β”‚ [3]              4
  13 β”‚ [-4]             1
  14 β”‚ [-4]             2
  15 β”‚ [-4]             3
  16 β”‚ [-4]             4
  17 β”‚ [5]              1
  18 β”‚ [5]              2
  19 β”‚ [5]              3
  20 β”‚ [5]              4
  21 β”‚ [6]              1
  22 β”‚ [6]              2
  23 β”‚ [6]              3
  24 β”‚ [6]              4

g1 = repeat(1:6, inner = 4)
g2 = repeat(1:4, 6)
ds = Dataset(g1 = g1, g2 = g2)

function fortofor(x)
    x==4 ? - 4 : x
end
mds=map(ds, x->fortofor(x), :g1) # works

modify!(ds,:g1=>x->fortofor.(x)) # works

modify!(ds,:g1=>x-> @. -(x==4) * x + x * !(x==4) ) #works

byrow(modify!(ds,:g1=>x->x==4 ? - 4 : x)) # doesn't work

the latter perhaps because I performed the following instructions in the following order

julia> modify!(ds,:g1=>byrow(x->x==4 ? - 4 : x))
24Γ—2 Dataset
 Row β”‚ g1                                 g2       
     β”‚ identity                           identity
     β”‚ Array…?                            Int64?
─────┼─────────────────────────────────────────────
   1 β”‚ Expr[:($(Expr(:BYROW, Union{Miss…         1
   2 β”‚ Expr[:($(Expr(:BYROW, Union{Miss…         2
julia> byrow(modify!(ds,:g1=>x->x==4 ? - 4 : x))
1-element Vector{Expr}:
 :($(Expr(:BYROW, 24Γ—2 Dataset
 Row β”‚ g1                                 g2       
     β”‚ identity                           identity
     β”‚ Array…?                            Int64?
─────┼─────────────────────────────────────────────
   1 β”‚ Expr[:($(Expr(:BYROW, Union{Miss…         1
   2 β”‚ Expr[:($(Expr(:BYROW, Union{Miss…         2

One of the key benefit s of having multiple packages in one area is competition. This is great for Julia ecosystem. let’s face it DataFrames.jl has been out there for many years but it still lacks lots of features (comparing to all competitors including IMD), IMD may wake DF developers and force them to shake things up. The point is that at the end the whole Julia ecosystem will enjoy the benefits.

1 Like

IMO instead of working on a competitor it would have been of more benefit to just work on the missing features. In a world where developers are working for free and aren’t massiv in numbers competition isn’t necessarily the best strategy for creating the best result for a ecosystem.
(This isn’t a critic on IMD, because the creators decide as they wish, I respect that, no bad thoughts, just the point that competition is not what the Julia ecosystem needs right now)

15 Likes

Fair competition is good and it can bring diversity and diversity always works and should be welcome. Look at the surface of the new package Iam feeling lots of cool stuffs -many of them seems unique to this package - is included in the new package which can attract new users to start learning Julia. Beside I think fresh rewriting of old packages sometimes work better than patch the packages with new features. Because the old packages usually written when Julia wasn’t mature and they are written for filling the needs at the time and now with a mature language writing up from the scratch may work better.

6 Likes

that’s quite the claim

2 Likes

Some remarks:

  • In IMD you should use map/map! to call a function on each value of selected columns. IMD documentation has more details about this.
  • groupby by default use the formatted values for grouping observations, thus, to group based on abs values, set it as your selected columns format.
  • Use modify! or modify to call a function on a column as whole. These functions are for modifying columns of a data set.
using InMemoryDatasets
g1 = repeat(1:6, inner = 4)
g2 = repeat(1:4, 6)
ds = Dataset(g1 = g1, g2 = g2)
mds = modify(ds, :g1 => x->replace(x, 4 => 0, count=3)) # you are applying replace on whole column not individual values
g1 = rand(-4:4, 24)
g2 = repeat(1:4, 6)
ds = Dataset(g1=g1, g2=g2)
groupby(ds, :g1)
setformat!(ds, :g1 => abs)
groupby(ds, :g1)  # use formatted values, i.e. abs values
2 Likes

The byrow function is a stand-alone function with byrow(ds, fun, cols) as its general syntax. You may use ?byrow to see a general documentation about it and use ?byrow(fun) for specific documentation of byrow(ds, fun, cols), e.g. ?byrow(sum). In your code

you are modifying ds by modify! and pass it as the first argument of byrow, however, your code is missing the second argument of byrow, thus you have a syntax error there. I recommend using the Chain package for having a better structure of the operations that a user does on a data set.

The only time that you can use byrow without ds and cols arguments is inside the modify/! or combine functions, since those arguments are derived from the modify/!/combine arguments.

g1 = repeat(1:6, inner = 4)
g2 = repeat(1:4, 6)
ds = Dataset(g1 = g1, g2 = g2)
modify!(ds,:g1=>byrow(x->x==4 ? - 4 : x)) # here ds is modifying and if you want to call byrow on it use byrow(ds, fun, cols) syntax
1 Like

In the documentation of the groupby function it says that you can use all the kwargs of the sort function and so I thought I could also use; by which is a kwarg of the Basic sort function though.

setformat!(ds, :g1 => abs)
groupby(ds, :g1) 

sort(g1, by=abs)

I’d still rather have a result like this

m=rand(-4:4, (20,5))
ds1=Dataset(m,:auto)
n=names(ds1)
cols=eachcol(ds1)
Dataset(map(t->(;zip(Symbol.(n),t)...), sort(tuple.(cols...), by=abs ∘ first)))
julia> Dataset(map(t->(;zip(Symbol.(n),t)...), sort(tuple.(cols...), by=abs ∘ first)))
20Γ—5 Dataset
 Row β”‚ x1        x2        x3        x4        x5 β‹―
     β”‚ identity  identity  identity  identity  id β‹―
     β”‚ Int64?    Int64?    Int64?    Int64?    In β‹―
─────┼─────────────────────────────────────────────
   1 β”‚        0         3        -1         2     β‹―
   2 β”‚        0         0        -2        -3      
   3 β”‚        0        -3         0         1      
   4 β”‚        1         4         0        -1      
   5 β”‚       -1         3        -4         3     β‹―
   6 β”‚       -1        -3         4        -4      
   7 β”‚       -1         1         1        -2      
   8 β”‚       -2         0        -1        -3      
   9 β”‚       -2         0        -4        -2     β‹―
  10 β”‚       -2        -3         0         1      
  11 β”‚        2         0        -2         0      
  12 β”‚       -2         0         2        -3      
  13 β”‚       -2        -1         1         1     β‹―
  14 β”‚       -2        -4        -3        -4      
  15 β”‚        3         0         1        -4      
  16 β”‚       -3         4         4         1      
  17 β”‚       -3        -3         4         4     β‹―
  18 β”‚       -4         0        -1        -3      
  19 β”‚        4         2        -4         4      
  20 β”‚       -4        -1         4         4    

with an output like the following (perhaps nicely formatted) 

keys=unique(abs.(dsg.x1))
sds=map(k->filter(dsg, :x1, by= x->abs(x)==k), keys)

julia> g=Dataset(k=keys,groups=sds)
5Γ—2 Dataset
 Row β”‚ k         groups
     β”‚ identity  identity
     β”‚ Int64?    Dataset?
─────┼─────────────────────────────────────────────
   1 β”‚        0  \e[1m3Γ—5 Dataset\e[0m\n\e[1m Row…
   2 β”‚        1  \e[1m4Γ—5 Dataset\e[0m\n\e[1m Row…
   3 β”‚        2  \e[1m7Γ—5 Dataset\e[0m\n\e[1m Row…
   4 β”‚        3  \e[1m3Γ—5 Dataset\e[0m\n\e[1m Row…
   5 β”‚        4  \e[1m3Γ—5 Dataset\e[0m\n\e[1m Row…

… rather than like this

ulia> groupby(ds, :g1) 
24Γ—2 View of Grouped Dataset, Grouped by: g1
 g1      g2       
 abs     identity
 Int64?  Int64?
──────────────────
      0         2
      0         1
      1         1
      1         3
      1         4
      1         2
      1         3
      1         4
      1         1
      1         2
      1         3
      1         3
      2         3
      2         2
      2         1
      2         4
      3         1
      3         4
      3         2
      4         4
      4         2
      4         1
      4         3
      4         4
1 Like

Thank you.
I tried using the help and was intrigued by the cummax function (which I had never seen before) and did the following test …

julia>          cummax([1,3,2,1,4,3,2,-1,5,6,5])
ERROR: MethodError: no method matching cummax(::Vector{Int64})      
Closest candidates are:
  cummax(::AbstractArra

is this expected?
The following seems work as expected

julia> cummax([1,3,2,1,4,3,2,missing,-1,5,6,5])
12-element Vector{Union{Missing, Int64}}:
 1
 3
 3
 3
 4
 4
 4
 4
 4
 5
 6
 6

Don’t you think with your attitude Julia itself hasn’t been created in the first place?
BTW AFAIK Juliaecosystem is full of competitions: look plotting for example.

It’s strange Polars uses so much memory.
Isn’t Polars internally a Rust library and isn’t Rust supposed to use memory more efficiently than Julia?