I find querying a JuliaDB IndexedTable is 20% faster than a DataFrame. But selecting a group from a GroupedDataFrame (GDF) is 200 times faster. So I’d like to work more with GDFs. I notice
New columns in the underlying DF will flow through to the GDF which is great
Using push!() on the DF breaks the GDF.
Would it be possible to allow push!() and pop!() without breaking the GDF?
Would it be possible to allow a GDF to be written to and from a folder of CSVs. (one file per group).
(I think CSV.jl used to allow reading multiple CSVs but the functionality was removed ? )
What exactly operation do you refer to? Maybe you use incorrect underlying column type?
This is possible as described in this section of the manual.
However, your grouped data frame then will not reflect these added rows. It will only hold the old rows.
If you add rows to a data frame you need to re-group it manually.
You can write one file per group using CSV.jl.
If you read these mutliple files back you will need to re-group.
N=5_000_000
I = 20
i = repeat(1:I, N )
t = repeat(1:N, I )
x = rand(N*I)
D = DataFrame( i=i, t=t, x=x )
DI = table( ( i=i, t=t, x=x), pkey=[:i,:t] )
@rsubset D :i==13 && 3_000_000 < :t < 3_001_000 #filter DataFrame
filter(r -> r.i ==13 && 3_000_000 < r.t < 3_001_000, DI) #filter IndexedTable ~about 20% faster
re Reading and writing multiple .CSV files.
I don’t think this is supported anymore. The example you provide mentions multiple files but doesn’t give an example and I couldn’t get one to work.
The example is right there in Bogumil’s link? It says:
# I can just pass a `Vector` of inputs, in this case `IOBuffer(::String)`, but it
# could also be a `Vector` of any valid input source, like `AbstractVector{UInt8}`,
# filenames, `IO`, etc.
@Lincoln_Hannah are you talking about total time (with compilation) or only execution time (without compilation). The reason I am asking is that a lot of the time spent in your operation is compilation as you can see here:
julia> f(D) = @rsubset D :i==13 && 3_000_000 < :t < 3_001_000
f (generic function with 1 method)
julia> @time @rsubset D :i==13 && 3_000_000 < :t < 3_001_000; # you need to compile
0.170909 seconds (106.69 k allocations: 112.886 MiB, 29.29% compilation time)
julia> @time f(D); # you need to compile but only once
0.183519 seconds (136.17 k allocations: 114.482 MiB, 32.84% compilation time)
julia> @time f(D); # now all is compiled
0.123281 seconds (415 allocations: 107.350 MiB)
So the question is what are you trying to optimize (cost of one run where compilation time matters, or you are running this operation thousands of times where compilation time can be most likely ignored)
Thanks Nils.
While I had IndexedTables installed it interfered with CSV such that Pkg> update installed V0.8.5
I’ve removed IndexedTables now and the issue has gone away, and your example works perfectly
with IndexedTables installed, installing a latter version of CSV gives:
(@v1.8) pkg> add CSV@0.9.0
Resolving package versions...
ERROR: Unsatisfiable requirements detected for package WeakRefStrings [ea10d353]:
WeakRefStrings [ea10d353] log:
├─possible versions are: 0.5.1-1.4.2 or uninstalled
├─restricted by compatibility requirements with CSV [336ed68f] to versions: 1.2.0-1.4.2
│ └─CSV [336ed68f] log:
│ ├─possible versions are: 0.3.0-0.10.9 or uninstalled
│ └─restricted to versions 0.9.0 by an explicit requirement, leaving only versions 0.9.0
└─restricted by compatibility requirements with IndexedTables [6deec6e2] to versions: 0.5.1-0.6.2 — no versions left
└─IndexedTables [6deec6e2] log:
├─possible versions are: 0.7.4-1.0.0 or uninstalled
└─restricted to versions * by an explicit requirement, leaving only versions 0.7.4-1.0.0
If I’m reading correctly, once compiled, IndexedTables does the query maybe 2* faster than DataFrames. which is good but nothing like the 200* speed up you can get with a GroupedDataFrame.