I’m working on a package that’s primarily for accessing a specific and large data set.
I want part of its functionality to involve summary information (basic stats and plots) on subsections of the data. I currently have functionality for loading parts of the data (most of which is in many seperate .txt files), but I’d also like to be able to provide an easy way for someone to create plots of the data, without worrying about specifically loading bits of data that they may not know the location of.
Current Approach and Problem
My current approach is some sort of dictionary structure that can find the variables. No problem, I can even automate creation of this with a little metaprogramming. I could do something like:
function userplot(variable_a) file = look_up_table[variable_a] df = data_loader(file) plot(df[variable_a]) end
However, whenever the user would call a plot function this would have to reload the entire subset of the data that contains the variable, or at least parse the whole thing over again. The only thing I can think of is creating some sort of persistent structure in the background that is only updated when a new file is loaded in a given session. This seems overly involved and I don’t like the idea of caching extra data in the background as multiple large files are loaded.
Just for reference, the current size of all the txt files in memory is ~7 Gb. However, this number will increase by +7 every year for the next 8 years. It also doesn’t include associated MR images or genomic data that I’d hope to connect to it.