Grouped Data Frames

I find querying a JuliaDB IndexedTable is 20% faster than a DataFrame. But selecting a group from a GroupedDataFrame (GDF) is 200 times faster. So I’d like to work more with GDFs. I notice

  • New columns in the underlying DF will flow through to the GDF which is great :slight_smile:

  • Using push!() on the DF breaks the GDF.
    Would it be possible to allow push!() and pop!() without breaking the GDF?

  • Would it be possible to allow a GDF to be written to and from a folder of CSVs. (one file per group).

(I think CSV.jl used to allow reading multiple CSVs but the functionality was removed ? )

What exactly operation do you refer to? Maybe you use incorrect underlying column type?

This is possible as described in this section of the manual.

However, your grouped data frame then will not reflect these added rows. It will only hold the old rows.
If you add rows to a data frame you need to re-group it manually.

You can write one file per group using CSV.jl.
If you read these mutliple files back you will need to re-group.

Reading multiple files is supported, see Examples · CSV.jl.

1 Like

Thank Bogumil

re DataFrame vs IndexedTable

N=5_000_000
I = 20

i = repeat(1:I, N )
t = repeat(1:N, I )
x = rand(N*I)
 
D  = DataFrame( i=i, t=t, x=x )
DI = table( (   i=i, t=t, x=x), pkey=[:i,:t] )


@rsubset   D  :i==13   &&   3_000_000 < :t  < 3_001_000              #filter DataFrame

filter(r -> r.i ==13   &&   3_000_000 < r.t < 3_001_000, DI)         #filter IndexedTable ~about 20% faster

re Reading and writing multiple .CSV files.
I don’t think this is supported anymore. The example you provide mentions multiple files but doesn’t give an example and I couldn’t get one to work.

The example is right there in Bogumil’s link? It says:

# I can just pass a `Vector` of inputs, in this case `IOBuffer(::String)`, but it
# could also be a `Vector` of any valid input source, like `AbstractVector{UInt8}`,
# filenames, `IO`, etc.

so with filienames it would be:

julia> using CSV, DataFrames

julia> [CSV.write("test$i.csv", (x1 = rand(3), x2 = rand(3))) for i ∈ 1:3]
3-element Vector{String}:
 "test1.csv"
 "test2.csv"
 "test3.csv"

julia> CSV.read(ans, DataFrame)
9×2 DataFrame
 Row │ x1         x2
     │ Float64    Float64
─────┼─────────────────────
   1 │ 0.382386   0.879372
   2 │ 0.933627   0.902434
   3 │ 0.640617   0.549931
   4 │ 0.720291   0.41746
   5 │ 0.0956776  0.662153
   6 │ 0.323271   0.69621
   7 │ 0.0708334  0.761502
   8 │ 0.856979   0.884265
   9 │ 0.951412   0.24919
2 Likes

@Lincoln_Hannah are you talking about total time (with compilation) or only execution time (without compilation). The reason I am asking is that a lot of the time spent in your operation is compilation as you can see here:

julia> f(D) = @rsubset   D  :i==13 && 3_000_000 < :t  < 3_001_000
f (generic function with 1 method)

julia> @time @rsubset   D  :i==13 && 3_000_000 < :t  < 3_001_000; # you need to compile
  0.170909 seconds (106.69 k allocations: 112.886 MiB, 29.29% compilation time)

julia> @time f(D); # you need to compile but only once
  0.183519 seconds (136.17 k allocations: 114.482 MiB, 32.84% compilation time)

julia> @time f(D); # now all is compiled
  0.123281 seconds (415 allocations: 107.350 MiB)

So the question is what are you trying to optimize (cost of one run where compilation time matters, or you are running this operation thousands of times where compilation time can be most likely ignored)

2 Likes

Thanks Nils.
While I had IndexedTables installed it interfered with CSV such that Pkg> update installed V0.8.5

I’ve removed IndexedTables now and the issue has gone away, and your example works perfectly :slight_smile:

with IndexedTables installed, installing a latter version of CSV gives:

(@v1.8) pkg> add CSV@0.9.0
   Resolving package versions...
ERROR: Unsatisfiable requirements detected for package WeakRefStrings [ea10d353]:
 WeakRefStrings [ea10d353] log:
 ├─possible versions are: 0.5.1-1.4.2 or uninstalled
 ├─restricted by compatibility requirements with CSV [336ed68f] to versions: 1.2.0-1.4.2
 │ └─CSV [336ed68f] log:
 │   ├─possible versions are: 0.3.0-0.10.9 or uninstalled
 │   └─restricted to versions 0.9.0 by an explicit requirement, leaving only versions 0.9.0
 └─restricted by compatibility requirements with IndexedTables [6deec6e2] to versions: 0.5.1-0.6.2 — no versions left  
   └─IndexedTables [6deec6e2] log:
     ├─possible versions are: 0.7.4-1.0.0 or uninstalled
     └─restricted to versions * by an explicit requirement, leaving only versions 0.7.4-1.0.0

Thanks Bogumił

If I’m reading correctly, once compiled, IndexedTables does the query maybe 2* faster than DataFrames. which is good but nothing like the 200* speed up you can get with a GroupedDataFrame.

I was hoping to feed a data-stream into a GDF with push!

Perhaps the best approach is to push! the data onto a separate DF then periodically create a merged GDF and start a new empty DF.

You might also want to consider GitHub - andyferris/AcceleratedArrays.jl: Arrays with acceleration indices as column type if you really need fast lookup.

1 Like