Hello together,
I’m trying to migrate old code from another language to Julia and doing my first steps in Julia by that. I have a csv-File (approx 100 columns, 6000 rows) that comes from a database with primary keys. What’s the fastest way of selecting (an setting) cell entries based on the primary keys?
I’m currently reading in the csv-file as Dataframes
using CSV
using Dataframes
techmat_regio = CSV.read(fn_file, DataFrame, delim = ",", header = 2, types = types)
and than select the data with primaray key (Komponente, Daten) from column year as Float (selecting with [1] as it returns an 1-element-array)
value =
techmat_regio[
(techmat_regio.Komponente.==komp).&(techmat_regio.Daten.==daten),
string(jahr),
][1]
However, selecting by an index
value = techmat_regio[index, string(jahr)]
would be much faster.
Is there a way to select the data using the primary key as fast as when I select by the index?
Maybe what you are looking for is subset
subset(techmat_regio,"Komponente" => ByRow(==(komp)),"Daten" => ByRow(==(daten)))
unfortunately, subset is even slower. When I run my full code and use subset for accessing values, I get
10.275881 seconds (28.25 M allocations: 4.702 GiB, 6.20% gc time)
the way I did it above with techmat_regio.Komponente.==komp
it takes
1.966738 seconds (1.44 M allocations: 293.769 MiB, 1.57% gc time, 0.45% compilation time)
and when I select data with an index, i.e. techmat_regio[index, string(jahr)]
it takes
0.027593 seconds (308.88 k allocations: 13.226 MiB)
which is what I am aming for However I usually don’t know the row-index, but the tuple (komp, daten)
.
So my current approach will be to create a dictionary that maps (komp, daten) --> row-index
and use it for selecting values. Another idea would be to look at JuliaDB instead of Dataframes.jl - but I had the impression that JuliaDB isn’t really maintained anymore and Dataframes is more widely used. But I’m open to any suggestions!
You have two options:
- create pk in this data frame using
groupby
:
- use GitHub - andyferris/AcceleratedArrays.jl: Arrays with acceleration indices for pk columns
The way to use groupby
to create pk is:
julia> df = DataFrame(pk1=rand(1:10, 10^6), pk2=rand('a':'z', 10^6), value=rand(10^6));
julia> gdf = groupby(df, [:pk1, :pk2]);
julia> using BenchmarkTools
julia> @btime $gdf[(1, 'a')];
3.257 μs (10 allocations: 30.56 KiB)
julia> @btime $df[($df.pk1 .== 1) .&& ($df.pk2 .== 'a'), :];
1.015 ms (29 allocations: 233.97 KiB)
so as you can see in this case it is ~300 times faster.
1 Like
Thank you for the suggestions!
With GroupedDataframes I have two problems:
- selection with
:
doesn’t seem to work in case I need all elements that belong to one part of the primary key, i.e. in your example gdf[(:, 'a')]
- I need be able to restore the original structure of the Dataframe to rewrite it in the csv-file. I couldn’t figure out how to “ungroup” the dataframe.
Finally I implemented the selection with a combination of AcceleratedArrays and a Dictionary. Maybe not the cleanest code, but as fast as I need it:
# define a struct containing the indizes
struct Techmat
df::DataFrame
komp::AcceleratedVector
data::AcceleratedVector
komp_data::Dict
end
# initialize the struct ones
# techmat_regio_df is the original Dataframe with keys "Komponente" and "Daten"
komp = accelerate(techmat_regio_df.Komponente, HashIndex)
data = accelerate(techmat_regio_df.Daten, HashIndex)
rows_komp_data = unique(techmat_regio_df[:, [:Komponente, :Daten]], view = true)
komp_data =
Dict(Tuple.(eachrow(rows_komp_data)) .=> parentindices(rows_komp_data)[1])
# initialize the Techmat-struct
techmat_regio[i_region] = Techmat(techmat_regio_df[i_region], komp, data, komp_data)
# Data with both primary keys can be queried via the dictionary (faster than AcceleratedArray)
value = techmat_regio.df[
techmat_regio.komp_data[komp, daten],
"$column_index",
]
# All data for just one (part of) the primary key can be queried via the AcceleratedArray
values =
techmat_regio.df[
(findall(
isequal(komp),
techmat_regio.komp,
)),
"$column_index",
]