Lazily fetch/load data into a DataFrame

mthelm85 · November 15, 2023, 9:48pm

What’s the best way to lazily fetch/load data into a DataFrame, only if/when needed? For example, if a user calls a function use_data_set1(), then I want to fetch/load data_set1 from some remote URL and make it available for the rest of that user’s session. However, if the user never calls a function that relies on data_set1, I don’t want to fetch & load it. If the data is fetched/loaded once, I don’t want to fetch/load it again on subsequent function calls that utilize that data…

pdeffebach · November 15, 2023, 10:33pm

There are Julia packages which formalize this process more, that other people can mention. But here is a quick and dirty solution using a global Dict to store data sets.

julia> using DataFrames

julia> global DATA_DICT = Dict();

julia> function load_state_data()
           url = "url_to_state_data"
           DataFrame(state = 1:50, value = 100 .* rand(10))
       end;

julia> function load_county_data()
           url = "url_to_county_data"
           DataFrame(county = 1:10, value = rand(10))
       end;

julia> function analyze_county()
           county_data = get!(load_county_data, DATA_DICT, :county_data)
           println(county_data)
       end
analyze_county (generic function with 1 method)

julia> function analyze_state()
           county_data = get!(load_county_data, DATA_DICT, :county_data)
           println(county_data)
       end
analyze_state (generic function with 1 method)

julia> analyze_county()
10×2 DataFrame
 Row │ county  value    
     │ Int64   Float64  
─────┼──────────────────
   1 │      1  0.759557
   2 │      2  0.383699
   3 │      3  0.851332
   4 │      4  0.928275
   5 │      5  0.433502
   6 │      6  0.691074
   7 │      7  0.619731
   8 │      8  0.475289
   9 │      9  0.347691
  10 │     10  0.163557

julia> DATA_DICT
Dict{Any, Any} with 1 entry:
  :county_data => 10×2 DataFrame…

rdavis120 · November 16, 2023, 12:50am

It’s not necessarily a Julia solution but I use duckdb to create a table view of multiple files using glob file names:
CREATE VIEW users AS SELECT * FROM ‘/*/test.parquet’;

Alternatively for urls you can use the httpfs extension for json or parquet files. For example:
SELECT * FROM read_parquet(‘s3://bucket/*.parquet’);

You could then get the output of the query as a DataFrame if you need to do further processing in memory.

mthelm85 · November 17, 2023, 5:43pm

Is something like the following any better (or is it worse?) than using the global Dict?

using DataFrames

const data = Ref{Union{Nothing, DataFrame}}(nothing)

function get_data()
    if isnothing(data[])
        data[] = DataFrame(a=rand(10),b=rand(10))
    end
end

This way I have other functions that call get_data(), but if it’s already been called, it won’t do anything.

rdavis120 · November 17, 2023, 10:35pm

I’m not sure if you already considered this but you might want to look at either package for memoization like Memoization.jl

Topic		Replies	Views
Lazy columns in dataframes? Data question , dataframes	6	478	January 27, 2023
Easier access to DataFrame's elements General Usage	6	1795	September 28, 2018
Nested access of DataFrame New to Julia dataframes	8	655	June 25, 2021
How to load a dataset from another website using a the website link New to Julia	4	410	April 5, 2022
Iterating through rows in julia Dataframe General Usage dataframes	15	793	July 20, 2022

Lazily fetch/load data into a DataFrame

Related topics