[ANN] A new lightning fast package for data manipulation in pure Julia

Honarvaghtan · March 23, 2022, 12:55pm

the following program works amazingly fast for my data (0.3 seconds) but there are some incorrect rows in the output, though it works fine for MWE, any idea?

julia> using InMemoryDatasets
julia> df = Dataset(v1 = [1.0,2.1,3.0], v2 = [1,3,3], v3 = [missing,2.1,3.0])
julia> f(x,y) = ismissing(x) || ismissing(y) ? true : x == y
julia> byrow(df, isless, [:v1, :v2, :v3], with = :v1, lt = f)
3-element Vector{Bool}:
 1
 0
 1

DataFrames · March 24, 2022, 12:21am

Your program produces incorrect results for rows where :v1 is missing, use

julia> byrow(df, isless, [:v1, :v2, :v3], with = byrow(df, coalesce, :), lt = f)

deburko2 · March 24, 2022, 4:05am

Nice package & benchmarks, one thing that those benchmarks doesn’t show is how memory crazy polars is. Recently I have benchmarked 1e8 case on my 16G mac and DataFrames was about 2ice faster than polars! how? simply because mac allows using hard as memory and because polars was very hungry for it, it needed to write and read from hard frequently.
good job, keep at it.

Honarvaghtan · March 24, 2022, 4:09am

Ah, you are right. It works like a charm, thanks. The running time is below 0.5 seconds, better than my wish.

sl-solution · March 24, 2022, 11:07am

you can squeeze out the last bit of performance by changing isless to issorted, since you wouldn’t need byrow(ds, coalesce, :)

julia> byrow(df, issorted, [:v1, :v2, :v3], lt = !f)
3-element Vector{Bool}:
 1
 0
 1

ab2z · March 25, 2022, 8:31am

@sl-solution it is really a cool package! I like the data transpose possibilities

mostafa1342004 · March 25, 2022, 2:02pm

i like the fact that unique allows to keep only the nonunique rows

julia> data=Dataset(x=[1,1,2,3,3])
5×1 Dataset
 Row │ x        
     │ identity 
     │ Int64?   
─────┼──────────
   1 │        1
   2 │        1
   3 │        2
   4 │        3
   5 │        3

julia> unique(data,keep=:only)
4×1 Dataset
 Row │ x        
     │ identity 
     │ Int64?   
─────┼──────────
   1 │        1
   2 │        1
   3 │        3
   4 │        3

peace · March 25, 2022, 6:25pm

Congratulation on your new package, I was monitoring JULIA for a while, and in comparison to “R dplyr” its packages was hurting from the lack of features. I am glad to see it is changing. good luck.

rocco_sprmnt21 · March 27, 2022, 12:22pm

I read quickly (perhaps too much) the documentation of the groupby function and tried to use the kwargs by =.

But obviously I didn’t quite understand how it works.
Nor did I get better results with the replace function.

if this can help in the analysis of the replace case …

using InMemoryDatasets
g1 = repeat(1:6, inner = 4)
g2 = repeat(1:4, 6)

ds = Dataset(g1 = g1, g2 = g2)

function fortofor(x)
    x==4 ? - 4 : x
end
mds=map(ds, x->fortofor(x), :g1)

mds=map(ds, x->replace([x],4=>-4), :g1)

julia> mds=map(ds, x->replace([x],4=>-4), :g1)
24×2 Dataset
 Row │ g1        g2       
     │ identity  identity
     │ Union…?   Int64?
─────┼────────────────────
   1 │ [1]              1
   2 │ [1]              2
   3 │ [1]              3
   4 │ [1]              4
   5 │ [2]              1
   6 │ [2]              2
   7 │ [2]              3
   8 │ [2]              4
   9 │ [3]              1
  10 │ [3]              2
  11 │ [3]              3
  12 │ [3]              4
  13 │ [-4]             1
  14 │ [-4]             2
  15 │ [-4]             3
  16 │ [-4]             4
  17 │ [5]              1
  18 │ [5]              2
  19 │ [5]              3
  20 │ [5]              4
  21 │ [6]              1
  22 │ [6]              2
  23 │ [6]              3
  24 │ [6]              4

rocco_sprmnt21 · March 27, 2022, 1:25pm

g1 = repeat(1:6, inner = 4)
g2 = repeat(1:4, 6)
ds = Dataset(g1 = g1, g2 = g2)

function fortofor(x)
    x==4 ? - 4 : x
end
mds=map(ds, x->fortofor(x), :g1) # works

modify!(ds,:g1=>x->fortofor.(x)) # works

modify!(ds,:g1=>x-> @. -(x==4) * x + x * !(x==4) ) #works

byrow(modify!(ds,:g1=>x->x==4 ? - 4 : x)) # doesn't work

the latter perhaps because I performed the following instructions in the following order

julia> modify!(ds,:g1=>byrow(x->x==4 ? - 4 : x))
24×2 Dataset
 Row │ g1                                 g2       
     │ identity                           identity
     │ Array…?                            Int64?
─────┼─────────────────────────────────────────────
   1 │ Expr[:($(Expr(:BYROW, Union{Miss…         1
   2 │ Expr[:($(Expr(:BYROW, Union{Miss…         2

julia> byrow(modify!(ds,:g1=>x->x==4 ? - 4 : x))
1-element Vector{Expr}:
 :($(Expr(:BYROW, 24×2 Dataset
 Row │ g1                                 g2       
     │ identity                           identity
     │ Array…?                            Int64?
─────┼─────────────────────────────────────────────
   1 │ Expr[:($(Expr(:BYROW, Union{Miss…         1
   2 │ Expr[:($(Expr(:BYROW, Union{Miss…         2

deburko2 · March 27, 2022, 8:24pm

One of the key benefit s of having multiple packages in one area is competition. This is great for Julia ecosystem. let’s face it DataFrames.jl has been out there for many years but it still lacks lots of features (comparing to all competitors including IMD), IMD may wake DF developers and force them to shake things up. The point is that at the end the whole Julia ecosystem will enjoy the benefits.

oheil · March 27, 2022, 8:51pm

IMO instead of working on a competitor it would have been of more benefit to just work on the missing features. In a world where developers are working for free and aren’t massiv in numbers competition isn’t necessarily the best strategy for creating the best result for a ecosystem.
(This isn’t a critic on IMD, because the creators decide as they wish, I respect that, no bad thoughts, just the point that competition is not what the Julia ecosystem needs right now)

deburko2 · March 27, 2022, 9:23pm

Fair competition is good and it can bring diversity and diversity always works and should be welcome. Look at the surface of the new package Iam feeling lots of cool stuffs -many of them seems unique to this package - is included in the new package which can attract new users to start learning Julia. Beside I think fresh rewriting of old packages sometimes work better than patch the packages with new features. Because the old packages usually written when Julia wasn’t mature and they are written for filling the needs at the time and now with a mature language writing up from the scratch may work better.

lawless-m · March 28, 2022, 9:38am

that’s quite the claim

sl-solution · March 28, 2022, 9:40am

Some remarks:

In IMD you should use map/map! to call a function on each value of selected columns. IMD documentation has more details about this.
groupby by default use the formatted values for grouping observations, thus, to group based on abs values, set it as your selected columns format.
Use modify! or modify to call a function on a column as whole. These functions are for modifying columns of a data set.

using InMemoryDatasets
g1 = repeat(1:6, inner = 4)
g2 = repeat(1:4, 6)
ds = Dataset(g1 = g1, g2 = g2)
mds = modify(ds, :g1 => x->replace(x, 4 => 0, count=3)) # you are applying replace on whole column not individual values
g1 = rand(-4:4, 24)
g2 = repeat(1:4, 6)
ds = Dataset(g1=g1, g2=g2)
groupby(ds, :g1)
setformat!(ds, :g1 => abs)
groupby(ds, :g1)  # use formatted values, i.e. abs values

sl-solution · March 28, 2022, 10:01am

The byrow function is a stand-alone function with byrow(ds, fun, cols) as its general syntax. You may use ?byrow to see a general documentation about it and use ?byrow(fun) for specific documentation of byrow(ds, fun, cols), e.g. ?byrow(sum). In your code

you are modifying ds by modify! and pass it as the first argument of byrow, however, your code is missing the second argument of byrow, thus you have a syntax error there. I recommend using the Chain package for having a better structure of the operations that a user does on a data set.

The only time that you can use byrow without ds and cols arguments is inside the modify/! or combine functions, since those arguments are derived from the modify/!/combine arguments.

g1 = repeat(1:6, inner = 4)
g2 = repeat(1:4, 6)
ds = Dataset(g1 = g1, g2 = g2)
modify!(ds,:g1=>byrow(x->x==4 ? - 4 : x)) # here ds is modifying and if you want to call byrow on it use byrow(ds, fun, cols) syntax

rocco_sprmnt21 · March 28, 2022, 10:04pm

In the documentation of the groupby function it says that you can use all the kwargs of the sort function and so I thought I could also use; by which is a kwarg of the Basic sort function though.

setformat!(ds, :g1 => abs)
groupby(ds, :g1) 

sort(g1, by=abs)

I’d still rather have a result like this

m=rand(-4:4, (20,5))
ds1=Dataset(m,:auto)
n=names(ds1)
cols=eachcol(ds1)
Dataset(map(t->(;zip(Symbol.(n),t)...), sort(tuple.(cols...), by=abs ∘ first)))
julia> Dataset(map(t->(;zip(Symbol.(n),t)...), sort(tuple.(cols...), by=abs ∘ first)))
20×5 Dataset
 Row │ x1        x2        x3        x4        x5 ⋯
     │ identity  identity  identity  identity  id ⋯
     │ Int64?    Int64?    Int64?    Int64?    In ⋯
─────┼─────────────────────────────────────────────
   1 │        0         3        -1         2     ⋯
   2 │        0         0        -2        -3      
   3 │        0        -3         0         1      
   4 │        1         4         0        -1      
   5 │       -1         3        -4         3     ⋯
   6 │       -1        -3         4        -4      
   7 │       -1         1         1        -2      
   8 │       -2         0        -1        -3      
   9 │       -2         0        -4        -2     ⋯
  10 │       -2        -3         0         1      
  11 │        2         0        -2         0      
  12 │       -2         0         2        -3      
  13 │       -2        -1         1         1     ⋯
  14 │       -2        -4        -3        -4      
  15 │        3         0         1        -4      
  16 │       -3         4         4         1      
  17 │       -3        -3         4         4     ⋯
  18 │       -4         0        -1        -3      
  19 │        4         2        -4         4      
  20 │       -4        -1         4         4    

with an output like the following (perhaps nicely formatted) 

keys=unique(abs.(dsg.x1))
sds=map(k->filter(dsg, :x1, by= x->abs(x)==k), keys)

julia> g=Dataset(k=keys,groups=sds)
5×2 Dataset
 Row │ k         groups
     │ identity  identity
     │ Int64?    Dataset?
─────┼─────────────────────────────────────────────
   1 │        0  \e[1m3×5 Dataset\e[0m\n\e[1m Row…
   2 │        1  \e[1m4×5 Dataset\e[0m\n\e[1m Row…
   3 │        2  \e[1m7×5 Dataset\e[0m\n\e[1m Row…
   4 │        3  \e[1m3×5 Dataset\e[0m\n\e[1m Row…
   5 │        4  \e[1m3×5 Dataset\e[0m\n\e[1m Row…

… rather than like this

ulia> groupby(ds, :g1) 
24×2 View of Grouped Dataset, Grouped by: g1
 g1      g2       
 abs     identity
 Int64?  Int64?
──────────────────
      0         2
      0         1
      1         1
      1         3
      1         4
      1         2
      1         3
      1         4
      1         1
      1         2
      1         3
      1         3
      2         3
      2         2
      2         1
      2         4
      3         1
      3         4
      3         2
      4         4
      4         2
      4         1
      4         3
      4         4

rocco_sprmnt21 · March 28, 2022, 10:07pm

Thank you.
I tried using the help and was intrigued by the cummax function (which I had never seen before) and did the following test …

julia>          cummax([1,3,2,1,4,3,2,-1,5,6,5])
ERROR: MethodError: no method matching cummax(::Vector{Int64})      
Closest candidates are:
  cummax(::AbstractArra

is this expected?
The following seems work as expected

julia> cummax([1,3,2,1,4,3,2,missing,-1,5,6,5])
12-element Vector{Union{Missing, Int64}}:
 1
 3
 3
 3
 4
 4
 4
 4
 4
 5
 6
 6

mostafa1342004 · March 29, 2022, 7:05am

Don’t you think with your attitude Julia itself hasn’t been created in the first place?
BTW AFAIK Juliaecosystem is full of competitions: look plotting for example.

Juan · March 29, 2022, 3:39pm

It’s strange Polars uses so much memory.
Isn’t Polars internally a Rust library and isn’t Rust supposed to use memory more efficiently than Julia?

Topic		Replies	Views
How is the data ecosystem right now for large datasets? Data	35	6723	July 13, 2017
The state of DataFrames.jl H2O benchmark Package Announcements dataframes	53	9377	January 1, 2025
Who does "better" than DataFrames? Performance dataframes	43	2021	April 6, 2023
[ANN] DataFrameDBs.jl Data package , announcement	60	4050	May 2, 2020
Julia performs poorly on group-by benchmarks Data performance	48	5804	January 23, 2019

[ANN] A new lightning fast package for data manipulation in pure Julia

Related topics