Store DataFrame or filepath for CSV.read?

Hello all!

I’m working with some data using CSV.read and the DataFrame type. The data in its raw form is seldom needed in my code, which got me thinking if I indeed need to store it in my mutable struct in the first place. However, this got me thinking that, should I choose to store it, would it be more efficient to store it as a DataFrame or as a sort of “pointer” to a CSV.read instruction?

Here’s the overall idea:

mutable struct foo{D<:DataFrame}
    a::D
    otherfields
end

mutable struct bar{F<:Function}
    b::F
    otherfields
end

function baz(filepath::String)
    data = CSV.read(filepath,DataFrame)
    otherfields = ... # not real code
    foo(data,otherfields)
end

function qux(filepath::String)
    @eval data = () -> CSV.read($filepath,DataFrame)
    otherfields = ... # not real code
    bar(data,otherfields)
end

Bearing in mind that these structs would be repeatedly passed down in my code, is there anything I should consider efficiency-wise?

I tried benchmarking both routes separately from the rest of my code and achieved slightly less memory and less allocations by storing the actual DataFrame, which came to me as quite a surprise.

If you rarely need the actual data content of the csv files, why not just store the filepath? It looks like you need to reference field names of the csv file. You could do a partial read of the csv file to just get the header and store that along with the filepath. For example:

mutable struct CSVFileWrapper
    file::String
    header::Vector{String}
    df::DataFrame
    is_loaded::Bool
    function CSVFileWrapper(file; limit = 0, kwargs...)
        df = CSV.read(file, DataFrame; limit = limit, kwargs...)
        header = names(df)
        return new(file, header, df, false)
    end
end

function load!(f::CSVFileWrapper; kwargs...)
    if !(f.is_loaded)
        df = CSV.read(f.file, DataFrame; kwargs...)
        f.df = df
        f.is_loaded = true
    end
    return f.df
end
1 Like

Thank you very much for the answer! I wasn’t familiar with inner constructors, which seem like the last piece of my puzzle. Cheers!