Managing large dataset in package intended for beginners use

Background

I’m working on a package that’s primarily for accessing a specific and large data set.
I want part of its functionality to involve summary information (basic stats and plots) on subsections of the data. I currently have functionality for loading parts of the data (most of which is in many seperate .txt files), but I’d also like to be able to provide an easy way for someone to create plots of the data, without worrying about specifically loading bits of data that they may not know the location of.

Current Approach and Problem

My current approach is some sort of dictionary structure that can find the variables. No problem, I can even automate creation of this with a little metaprogramming. I could do something like:

function userplot(variable_a)
    file = look_up_table[variable_a]
    df = data_loader(file)
    plot(df[variable_a])
end

However, whenever the user would call a plot function this would have to reload the entire subset of the data that contains the variable, or at least parse the whole thing over again. The only thing I can think of is creating some sort of persistent structure in the background that is only updated when a new file is loaded in a given session. This seems overly involved and I don’t like the idea of caching extra data in the background as multiple large files are loaded.

Just for reference, the current size of all the txt files in memory is ~7 Gb. However, this number will increase by +7 every year for the next 8 years. It also doesn’t include associated MR images or genomic data that I’d hope to connect to it.

Hello Zach. Forgive me if I do not understand your problem.
I would take a step back/ I think you are facing the classic “meaningful directory names” problem, this is where the information about what is in a file is contained in the directory name and file name.
I may be very wrong here.
Have you though about using a database to keep the metadata? Even better keep the data in S3 buckets.

Consider also JuliaDB https://juliadb.org/

For your basic stats and plots you could perhaps look at the Queryverse

Zach, we would be interested in finding out more about your research.
Before jumping in to coding, I would take a step back and think about what you are trying to achieve. Look at how others in your field are handling problems like these - perhaps using Python or other libraries.

Have you looked at IRODS https://irods.org/

And also Galaxy Galaxy (computational biology) - Wikipedia
https://galaxyproject.org/

This is only for use in my lab so I’m unlikely to receive any sort of significant funding for this. I’m only trying to optimize and simplify aggressively because I really don’t want to have to give people data full time.

I’m familiar with the queryverse, JuliaDB, and OnlineStats. I think I remember reading a while ago that JuliaDB has the goal of eventually supporting more than CSV but I need to support more than that right now. The support in Python for MRI data is great if you don’t care about speed and work with less than 3 Gb of data.

The AWS storage has been discussed before in my lab, but it would ultimately become useless as soon as we use imaging data because that’s already over 50 Tb.

Ultimately, I would rather not do anything that involves completely writing a database, because several times a year I receive an updated 261 text files in one folder and other folders for imaging data. The text files are also formatted with metadata throughout (for some unknown reason) so I’ve had to use a little trickery with CSVFiles. I also have no desire to rewrite the MRI metadata to a new standard as it is complicated and we really shouldn’t be creating new MRI formats if we can help it.

I’d mostly like a way to overcome the latency for reloading large text files repeatedly. I have a pretty good idea of how I’ll intertwine the images effectively once everything else is working well.

If you can create a summary statistic that is necessary to construct the plot, perhaps you could cache that. I see nothing wrong with that strategy.

Otherwise, you may parse the data into a format you can read with mmap, eg with HDF5.jl or Feather.jl.

I said S3 storage, not specifically Amazon. There are S3 compatible storage options which you can use in your lab. The most popular being CEPH.
I would really advise you to speak with your friendly lab computer admin and see what facilities are already in place.

Caching the summary stats sounds like the ideal solution here. I had only considered caching the entire set of loaded data, not just the summary stats.

BTW, I’m probrably going to usemmap with the imaging data.

@johnh, sorry if I came off sounding curt. This problem is actually born in part from problems with our HPC IT department and hence why I’m hesitant to set up a database (which would probably involve our HPC IT department to do so). I’m sure this would be the ideal way to handle a lot of this but given my interactions over the last 2 years with them I’m not sure they could handle it.

FWIW, I don’t think that using mmap should matter much for summary stats (if it does, then that’s stretching the concept of “summary” :wink:)

I think that keeping it simple is the right choice here, if possible.

Late to this party, I realize, and given the discussion I feel I may be misunderstanding something, but this sentence to me screams DataDeps.jl

1 Like

I’ll probably use DataDeps.jl once I work out the rest of the details just so it conforms to some sort of standard.

HPC types are cuddly and friendly. I do know we can be grumpy bears. I suggest bringing along cookies.

Another interesting possibility (not exactly Julia focused) that I’ve been playing with recently is data version control. It’s a git-like interface that builds on top of your already existing git repositories. I haven’t used it much at scale, but it seems really promising. Especially if your lab is gonna be doing lots of analysis based on changing versions of the same dataset, it might help to also track which version was made with which dataset.

My understanding is that you can use basically any external storage type to hold/distribute the data. (I’ve got mine working with with both S3 buckets and local storage). And if your sysadmins are…hard to work with, then all it requires them doing is installing a copy of the dvc software for your lab to use. Might be worth taking a look.

1 Like