I think this is a common problem: there are many data files gathered in several folders on my machine. And at some point I want to start new ML/DS project and pick some files from that common folders. Usually, I just create a new folder for this new project and copy-paste those files. But, with growing number of projects/researches, this slowly becomes a mess of duplicated files scattered all over the filesystem.
Another point, sometimes I have to mark or remember some specific observations / oultier events / problems, encoutered at some location within a single file. And later, when I have a sufficient amount of such observations from many files, I can start to investigate it.
So, it would be nice to have some database with the ability to tag specific info to some files (or group them by some criteria) without copying them, and arrange those tags into a tree, just like folders tree, but with multiple grouping. And then select file lists from any group for further processing.
I wonder if somebody had similar problems, and if so, what file database did he used for such tasks?
There are, of course, an infinite number of solutions to the “metadata” and “deduplication” problems that you mention. I don’t currently need to deal with these sorts of problems, but I will soon (for the exact same reasons as you pose here), so for both our benefits I will list some solutions that could work:
- If on a Unix system (Linux/BSD/OSX/etc.), use symlinks to point to a shared, common directory containing all your files, from within each project folder that you want to include some files from. You could also use hardlinks instead of symlinks, but this carries a number of risks if you aren’t careful (and can only be done by the root user, and only done to files, at least on Linux).
- Use HDF5 to store the data and metadata of these files, as HDF5 supports tons of features like external links to other HDF5 files, metadata, arbitrary structuring of datasets (just like a filesystem), and much, much more.
- Use an alternative DB like SQLite, stored in the shared directory, for any common metadata that you want to share amongst your projects, and have separate tables in that DB, or separate DBs in your project folders, for metadata specific to a certain project.
- Store as blobs within any DB that supports blobs - this is a terrible idea probably, unless you have a good reason to do this.
- Store everything on a server, serve the files/directories from a webserver like Nginx/Apache/Mux/HTTP, and just embed URLs to resources within your projects either manually or with something like DataDeps.jl.
- Use filesystem extended attributes (xattrs) to store metadata on the files themselves, which makes the metadata portable across filesystems which support xattrs (most Linux filesystems used today).
Any of the above can be mixed-and-matched as desired, of course.