Fast way of selecting rows in DataFrames / csv-file with primary keys

pjuergens · July 20, 2022, 1:06pm

Hello together,

I’m trying to migrate old code from another language to Julia and doing my first steps in Julia by that. I have a csv-File (approx 100 columns, 6000 rows) that comes from a database with primary keys. What’s the fastest way of selecting (an setting) cell entries based on the primary keys?

I’m currently reading in the csv-file as Dataframes

using CSV
using Dataframes
techmat_regio = CSV.read(fn_file, DataFrame, delim = ",", header = 2, types = types)

and than select the data with primaray key (Komponente, Daten) from column year as Float (selecting with [1] as it returns an 1-element-array)

value =
    techmat_regio[
        (techmat_regio.Komponente.==komp).&(techmat_regio.Daten.==daten),
        string(jahr),
    ][1]

However, selecting by an index

value = techmat_regio[index, string(jahr)]

would be much faster.

Is there a way to select the data using the primary key as fast as when I select by the index?

tbeason · July 20, 2022, 1:21pm

Maybe what you are looking for is subset

subset(techmat_regio,"Komponente" => ByRow(==(komp)),"Daten" => ByRow(==(daten)))

pjuergens · July 20, 2022, 7:18pm

unfortunately, subset is even slower. When I run my full code and use subset for accessing values, I get

10.275881 seconds (28.25 M allocations: 4.702 GiB, 6.20% gc time)

the way I did it above with techmat_regio.Komponente.==komp it takes

1.966738 seconds (1.44 M allocations: 293.769 MiB, 1.57% gc time, 0.45% compilation time)

and when I select data with an index, i.e. techmat_regio[index, string(jahr)] it takes

0.027593 seconds (308.88 k allocations: 13.226 MiB)

which is what I am aming for However I usually don’t know the row-index, but the tuple (komp, daten).

So my current approach will be to create a dictionary that maps (komp, daten) --> row-index and use it for selecting values. Another idea would be to look at JuliaDB instead of Dataframes.jl - but I had the impression that JuliaDB isn’t really maintained anymore and Dataframes is more widely used. But I’m open to any suggestions!

bkamins · July 20, 2022, 8:11pm

You have two options:

create pk in this data frame using groupby:
use GitHub - andyferris/AcceleratedArrays.jl: Arrays with acceleration indices for pk columns

The way to use groupby to create pk is:

julia> df = DataFrame(pk1=rand(1:10, 10^6), pk2=rand('a':'z', 10^6), value=rand(10^6));

julia> gdf = groupby(df, [:pk1, :pk2]);

julia> using BenchmarkTools

julia> @btime $gdf[(1, 'a')];
  3.257 μs (10 allocations: 30.56 KiB)

julia> @btime $df[($df.pk1 .== 1) .&& ($df.pk2 .== 'a'), :];
  1.015 ms (29 allocations: 233.97 KiB)

so as you can see in this case it is ~300 times faster.

pjuergens · July 25, 2022, 10:41am

Thank you for the suggestions!

With GroupedDataframes I have two problems:

selection with : doesn’t seem to work in case I need all elements that belong to one part of the primary key, i.e. in your example gdf[(:, 'a')]
I need be able to restore the original structure of the Dataframe to rewrite it in the csv-file. I couldn’t figure out how to “ungroup” the dataframe.

Finally I implemented the selection with a combination of AcceleratedArrays and a Dictionary. Maybe not the cleanest code, but as fast as I need it:

# define a struct containing the indizes
struct Techmat
    df::DataFrame
    komp::AcceleratedVector
    data::AcceleratedVector
    komp_data::Dict
end
# initialize the struct ones
# techmat_regio_df is the original Dataframe with keys "Komponente" and "Daten"
komp = accelerate(techmat_regio_df.Komponente, HashIndex)
data = accelerate(techmat_regio_df.Daten, HashIndex)
rows_komp_data = unique(techmat_regio_df[:, [:Komponente, :Daten]], view = true)
komp_data =
    Dict(Tuple.(eachrow(rows_komp_data)) .=> parentindices(rows_komp_data)[1])
# initialize the Techmat-struct
techmat_regio[i_region] = Techmat(techmat_regio_df[i_region], komp, data, komp_data)

# Data with both primary keys can be queried via the dictionary (faster than AcceleratedArray)
value = techmat_regio.df[
    techmat_regio.komp_data[komp, daten],
    "$column_index",
]

# All data for just one (part of) the primary key can be queried via the AcceleratedArray
values = 
    techmat_regio.df[
        (findall(
            isequal(komp),
            techmat_regio.komp,
        )),
        "$column_index",
    ]

Topic		Replies	Views
Selecting arbitrary columns from DataFrameRow Data dataframes	2	340	November 8, 2022
Large dataframe. fast row selection Data query , dataframes	5	2410	September 13, 2018
Why is it so complicated to access a row in a DataFrame? General Usage dataframes	13	1968	August 25, 2023
Can indexes to DataFrame column be added to inprove selection performances? Data	2	383	December 28, 2020
[DataFrames Question]: hash-based row indexing for DataFrames package Data question , suggestions	16	2247	October 16, 2019

Fast way of selecting rows in DataFrames / csv-file with primary keys

Related topics