Grouped Data Frames

Lincoln_Hannah · February 6, 2023, 4:12am

I find querying a JuliaDB IndexedTable is 20% faster than a DataFrame. But selecting a group from a GroupedDataFrame (GDF) is 200 times faster. So I’d like to work more with GDFs. I notice

New columns in the underlying DF will flow through to the GDF which is great
Using push!() on the DF breaks the GDF.
Would it be possible to allow push!() and pop!() without breaking the GDF?
Would it be possible to allow a GDF to be written to and from a folder of CSVs. (one file per group).

(I think CSV.jl used to allow reading multiple CSVs but the functionality was removed ? )

bkamins · February 6, 2023, 8:12am

What exactly operation do you refer to? Maybe you use incorrect underlying column type?

This is possible as described in this section of the manual.

However, your grouped data frame then will not reflect these added rows. It will only hold the old rows.
If you add rows to a data frame you need to re-group it manually.

You can write one file per group using CSV.jl.
If you read these mutliple files back you will need to re-group.

Reading multiple files is supported, see Examples · CSV.jl.

Lincoln_Hannah · February 6, 2023, 10:23am

Thank Bogumil

re DataFrame vs IndexedTable

N=5_000_000
I = 20

i = repeat(1:I, N )
t = repeat(1:N, I )
x = rand(N*I)
 
D  = DataFrame( i=i, t=t, x=x )
DI = table( (   i=i, t=t, x=x), pkey=[:i,:t] )


@rsubset   D  :i==13   &&   3_000_000 < :t  < 3_001_000              #filter DataFrame

filter(r -> r.i ==13   &&   3_000_000 < r.t < 3_001_000, DI)         #filter IndexedTable ~about 20% faster

re Reading and writing multiple .CSV files.
I don’t think this is supported anymore. The example you provide mentions multiple files but doesn’t give an example and I couldn’t get one to work.

nilshg · February 6, 2023, 12:01pm

The example is right there in Bogumil’s link? It says:

# I can just pass a `Vector` of inputs, in this case `IOBuffer(::String)`, but it
# could also be a `Vector` of any valid input source, like `AbstractVector{UInt8}`,
# filenames, `IO`, etc.

so with filienames it would be:

julia> using CSV, DataFrames

julia> [CSV.write("test$i.csv", (x1 = rand(3), x2 = rand(3))) for i ∈ 1:3]
3-element Vector{String}:
 "test1.csv"
 "test2.csv"
 "test3.csv"

julia> CSV.read(ans, DataFrame)
9×2 DataFrame
 Row │ x1         x2
     │ Float64    Float64
─────┼─────────────────────
   1 │ 0.382386   0.879372
   2 │ 0.933627   0.902434
   3 │ 0.640617   0.549931
   4 │ 0.720291   0.41746
   5 │ 0.0956776  0.662153
   6 │ 0.323271   0.69621
   7 │ 0.0708334  0.761502
   8 │ 0.856979   0.884265
   9 │ 0.951412   0.24919

bkamins · February 6, 2023, 1:18pm

@Lincoln_Hannah are you talking about total time (with compilation) or only execution time (without compilation). The reason I am asking is that a lot of the time spent in your operation is compilation as you can see here:

julia> f(D) = @rsubset   D  :i==13 && 3_000_000 < :t  < 3_001_000
f (generic function with 1 method)

julia> @time @rsubset   D  :i==13 && 3_000_000 < :t  < 3_001_000; # you need to compile
  0.170909 seconds (106.69 k allocations: 112.886 MiB, 29.29% compilation time)

julia> @time f(D); # you need to compile but only once
  0.183519 seconds (136.17 k allocations: 114.482 MiB, 32.84% compilation time)

julia> @time f(D); # now all is compiled
  0.123281 seconds (415 allocations: 107.350 MiB)

So the question is what are you trying to optimize (cost of one run where compilation time matters, or you are running this operation thousands of times where compilation time can be most likely ignored)

Lincoln_Hannah · February 7, 2023, 2:01am

Thanks Nils.
While I had IndexedTables installed it interfered with CSV such that Pkg> update installed V0.8.5

I’ve removed IndexedTables now and the issue has gone away, and your example works perfectly

with IndexedTables installed, installing a latter version of CSV gives:

(@v1.8) pkg> add CSV@0.9.0
   Resolving package versions...
ERROR: Unsatisfiable requirements detected for package WeakRefStrings [ea10d353]:
 WeakRefStrings [ea10d353] log:
 ├─possible versions are: 0.5.1-1.4.2 or uninstalled
 ├─restricted by compatibility requirements with CSV [336ed68f] to versions: 1.2.0-1.4.2
 │ └─CSV [336ed68f] log:
 │   ├─possible versions are: 0.3.0-0.10.9 or uninstalled
 │   └─restricted to versions 0.9.0 by an explicit requirement, leaving only versions 0.9.0
 └─restricted by compatibility requirements with IndexedTables [6deec6e2] to versions: 0.5.1-0.6.2 — no versions left  
   └─IndexedTables [6deec6e2] log:
     ├─possible versions are: 0.7.4-1.0.0 or uninstalled
     └─restricted to versions * by an explicit requirement, leaving only versions 0.7.4-1.0.0

Lincoln_Hannah · February 7, 2023, 2:04am

Thanks Bogumił

If I’m reading correctly, once compiled, IndexedTables does the query maybe 2* faster than DataFrames. which is good but nothing like the 200* speed up you can get with a GroupedDataFrame.

Lincoln_Hannah · February 7, 2023, 2:22am

I was hoping to feed a data-stream into a GDF with push!

Perhaps the best approach is to push! the data onto a separate DF then periodically create a merged GDF and start a new empty DF.

bkamins · February 7, 2023, 8:59am

You might also want to consider GitHub - andyferris/AcceleratedArrays.jl: Arrays with acceleration indices as column type if you really need fast lookup.

Topic		Replies	Views
Data structure for convenient access to tabular data General Usage dataframes	5	389	February 13, 2023
Grouped Data Frame -- Two different types General Usage dataframes	3	672	August 6, 2022
JuliaDB versus Data question	12	2255	June 18, 2019
How to correct the contents of GroupedDataFrame to update it? New to Julia question	7	250	September 8, 2022
Iterating over (key, value) with grouped DataFrame General Usage data , dataframes	8	6034	October 18, 2022

Grouped Data Frames

Related topics