What’s the best way to lazily fetch/load data into a DataFrame, only if/when needed? For example, if a user calls a function use_data_set1()
, then I want to fetch/load data_set1
from some remote URL and make it available for the rest of that user’s session. However, if the user never calls a function that relies on data_set1
, I don’t want to fetch & load it. If the data is fetched/loaded once, I don’t want to fetch/load it again on subsequent function calls that utilize that data…
There are Julia packages which formalize this process more, that other people can mention. But here is a quick and dirty solution using a global Dict
to store data sets.
julia> using DataFrames
julia> global DATA_DICT = Dict();
julia> function load_state_data()
url = "url_to_state_data"
DataFrame(state = 1:50, value = 100 .* rand(10))
end;
julia> function load_county_data()
url = "url_to_county_data"
DataFrame(county = 1:10, value = rand(10))
end;
julia> function analyze_county()
county_data = get!(load_county_data, DATA_DICT, :county_data)
println(county_data)
end
analyze_county (generic function with 1 method)
julia> function analyze_state()
county_data = get!(load_county_data, DATA_DICT, :county_data)
println(county_data)
end
analyze_state (generic function with 1 method)
julia> analyze_county()
10×2 DataFrame
Row │ county value
│ Int64 Float64
─────┼──────────────────
1 │ 1 0.759557
2 │ 2 0.383699
3 │ 3 0.851332
4 │ 4 0.928275
5 │ 5 0.433502
6 │ 6 0.691074
7 │ 7 0.619731
8 │ 8 0.475289
9 │ 9 0.347691
10 │ 10 0.163557
julia> DATA_DICT
Dict{Any, Any} with 1 entry:
:county_data => 10×2 DataFrame…
It’s not necessarily a Julia solution but I use duckdb to create a table view of multiple files using glob file names:
CREATE VIEW users AS SELECT * FROM ‘/*/test.parquet’;
Alternatively for urls you can use the httpfs extension for json or parquet files. For example:
SELECT * FROM read_parquet(‘s3://bucket/*.parquet’);
You could then get the output of the query as a DataFrame if you need to do further processing in memory.
Is something like the following any better (or is it worse?) than using the global Dict
?
using DataFrames
const data = Ref{Union{Nothing, DataFrame}}(nothing)
function get_data()
if isnothing(data[])
data[] = DataFrame(a=rand(10),b=rand(10))
end
end
This way I have other functions that call get_data()
, but if it’s already been called, it won’t do anything.
I’m not sure if you already considered this but you might want to look at either package for memoization like Memoization.jl