Store DataFrame or filepath for CSV.read?

RicardoFR · June 28, 2022, 10:07pm

Hello all!

I’m working with some data using CSV.read and the DataFrame type. The data in its raw form is seldom needed in my code, which got me thinking if I indeed need to store it in my mutable struct in the first place. However, this got me thinking that, should I choose to store it, would it be more efficient to store it as a DataFrame or as a sort of “pointer” to a CSV.read instruction?

Here’s the overall idea:

mutable struct foo{D<:DataFrame}
    a::D
    otherfields
end

mutable struct bar{F<:Function}
    b::F
    otherfields
end

function baz(filepath::String)
    data = CSV.read(filepath,DataFrame)
    otherfields = ... # not real code
    foo(data,otherfields)
end

function qux(filepath::String)
    @eval data = () -> CSV.read($filepath,DataFrame)
    otherfields = ... # not real code
    bar(data,otherfields)
end

Bearing in mind that these structs would be repeatedly passed down in my code, is there anything I should consider efficiency-wise?

I tried benchmarking both routes separately from the rest of my code and achieved slightly less memory and less allocations by storing the actual DataFrame, which came to me as quite a surprise.

awasserman · June 29, 2022, 12:20am

If you rarely need the actual data content of the csv files, why not just store the filepath? It looks like you need to reference field names of the csv file. You could do a partial read of the csv file to just get the header and store that along with the filepath. For example:

mutable struct CSVFileWrapper
    file::String
    header::Vector{String}
    df::DataFrame
    is_loaded::Bool
    function CSVFileWrapper(file; limit = 0, kwargs...)
        df = CSV.read(file, DataFrame; limit = limit, kwargs...)
        header = names(df)
        return new(file, header, df, false)
    end
end

function load!(f::CSVFileWrapper; kwargs...)
    if !(f.is_loaded)
        df = CSV.read(f.file, DataFrame; kwargs...)
        f.df = df
        f.is_loaded = true
    end
    return f.df
end

RicardoFR · June 29, 2022, 4:19pm

Thank you very much for the answer! I wasn’t familiar with inner constructors, which seem like the last piece of my puzzle. Cheers!

Topic		Replies	Views
When should I choose a struct, mutable struct, Dict, named tuple or DataFrame? General Usage	17	8076	August 27, 2021
CSV reading in mutable struct error reading Int as string General Usage question , package	3	343	July 9, 2021
Issues reading CSV file with array elements General Usage dataframes , csv	4	1774	September 6, 2021
Using CSV.read() to import data from a data input file into a DataFrame General Usage question , dataframes , csv	27	6673	March 1, 2022
Save Dataframe in file and read it again General Usage question	4	3799	May 28, 2020

Store DataFrame or filepath for CSV.read?

Related topics