Lazily fetch/load data into a DataFrame

What’s the best way to lazily fetch/load data into a DataFrame, only if/when needed? For example, if a user calls a function use_data_set1(), then I want to fetch/load data_set1 from some remote URL and make it available for the rest of that user’s session. However, if the user never calls a function that relies on data_set1, I don’t want to fetch & load it. If the data is fetched/loaded once, I don’t want to fetch/load it again on subsequent function calls that utilize that data…

There are Julia packages which formalize this process more, that other people can mention. But here is a quick and dirty solution using a global Dict to store data sets.

julia> using DataFrames

julia> global DATA_DICT = Dict();

julia> function load_state_data()
           url = "url_to_state_data"
           DataFrame(state = 1:50, value = 100 .* rand(10))

julia> function load_county_data()
           url = "url_to_county_data"
           DataFrame(county = 1:10, value = rand(10))

julia> function analyze_county()
           county_data = get!(load_county_data, DATA_DICT, :county_data)
analyze_county (generic function with 1 method)

julia> function analyze_state()
           county_data = get!(load_county_data, DATA_DICT, :county_data)
analyze_state (generic function with 1 method)

julia> analyze_county()
10×2 DataFrame
 Row │ county  value    
     │ Int64   Float64  
   1 │      1  0.759557
   2 │      2  0.383699
   3 │      3  0.851332
   4 │      4  0.928275
   5 │      5  0.433502
   6 │      6  0.691074
   7 │      7  0.619731
   8 │      8  0.475289
   9 │      9  0.347691
  10 │     10  0.163557

julia> DATA_DICT
Dict{Any, Any} with 1 entry:
  :county_data => 10×2 DataFrame…
1 Like

It’s not necessarily a Julia solution but I use duckdb to create a table view of multiple files using glob file names:
CREATE VIEW users AS SELECT * FROM ‘/*/test.parquet’;

Alternatively for urls you can use the httpfs extension for json or parquet files. For example:
SELECT * FROM read_parquet(‘s3://bucket/*.parquet’);

You could then get the output of the query as a DataFrame if you need to do further processing in memory.

1 Like

Is something like the following any better (or is it worse?) than using the global Dict?

using DataFrames

const data = Ref{Union{Nothing, DataFrame}}(nothing)

function get_data()
    if isnothing(data[])
        data[] = DataFrame(a=rand(10),b=rand(10))

This way I have other functions that call get_data(), but if it’s already been called, it won’t do anything.

I’m not sure if you already considered this but you might want to look at either package for memoization like Memoization.jl