Specifically - should I back it up on an external HDD?
I am asking because I didn’t know where else to and I know lots of people here do scientific research like me. I am currently a PhD student and I have a large knowledgebase I’ve been building in the form of a repo. It also includes all the papers/confs we’ve submitted to and all my simulations (julia code) and results.
I’m wondering if I should safeguard that work by maintaining an up to date clone of the repo on an external HDD. But I’m not sure I need to because I have the repo in several places already including:
My desktop
My laptop
My university HPC
Seems like I have redundancy covered but I wanted to check and see what others thought. At least one problem would be making sure these are all up to date clones. It would be nice to have a system that automatically git fetchs at midnight or something every night but I don’t know how common that is.
Hopefully someone has some advice on this to share.
Any data that is not proper for git (large data files) could be backed up in a (hopefully university-run) cloud service (next cloud, one drive, whatever the university has), for any git-repository I would also recommend to have (as exchange between those computer and remote backup) a private git repository, e.g. on GitHub/gitlab/bitbucket/… or on a university gitlab-server, if they have such.
That should not only be enough, but also be easier to handle than an external HDD (you forget it at home and then you do not backup,…).
Also if you push regularly (every evening when you stop working for example) there is no need for an automated way.
I’m currently using a cheap VPS from liteserver.nl (with 512GB of storage for 6 euro a month and can pay in crypto). I have everything encrypted with gocryptfs locally on my laptop and then just periodically rsync the encrypted directory over.
Another thing that might be relevant: Many institutions/universities/journals/… have policies about retaining data and/or code that was used to produce published work for some minimum time. I think 10 years is the required minimum in many cases – so best check with your university if there is some way to archive the publication data somewhere centrally. Ideally such that even when you leave or can’t access the data for other reasons, it can still be retrieved.
I work entirely in a Onedrive directory. I can still have git repos and such and I also can seamlessly switch between my home and my office computer. It’s 2€ a month if your university doesn’t provide it already.
I think it makes sense to create a backup HDD monthly or quarterly and storing it off-site (at family, friends, etc.). You don’t want to lose years of research and few hours monthly is not a big price to pay to save you from this.
I want to mention the 3-2-1 backup strategy as a good practice to implement. You have three copies, you probably have at least two mediums (SSD, HDD) and the locations are also different. I presume that you often synchronize the 3 copies, which means if somehow the data gets corrupted without notice, you could easily “spread it” to the other copies. A monthly backup may protect from this.
Do you put git repositories into your one drive? At least with Dropbox and Nextcloud I experienced quite some issues with such synchronisation of git repositories over cloud services – so I keep 2 folders: (1) git repos – which are not synced (but they are on GitHub/GitLab/… anyways) and (2) a one drive for everything else – I am not the largest fan of one drive, but I have it for free from my university.
Yes I do. I have experienced issues, for example if I close my computer too fast after saving a change that thus cannot be synced. Then said change is not available on my other workstation. I also had occasional issues where OD could not overwrite a file that I save and thus creates a duplicate. Other than that, it has served me well. Never had a major issue and I don’t have to commit changes that I don’t want to just to synchronize my code.
I use pcloud as cloud storage, and created svn repositories for my work there (the free plan is 10GB). The cloud can be mounted as normal drive, so for svn it looks like a local repository (but it is not). Thus, I work with one-and-the-same repo on different machines.
One advantage compared to github is that you are not limited by traffic per month, thus, you can also commit/checkout large files if needed. There is, however, a great disadvantage: If you want to share your work with collaborators, they need to have a pcloud account as well. At least I did not find any workaround.
Interesting, the problem with not overwriting files and creating copies happened to me regularly with files in .git/ or a repository and sometimes destroyed the git repository in the sense that it was not recognised as a git repository any longer.
In addition to the recommendations above, you could look at DVC for versioning data, models, and experiments, or you could look at Git LFS for versioning binary files.
Yes, all great points. And I had forgotten about some of my simulation results which are indeed too large for git (.hdf5 files). As you say, my university has a cloud run service I can use for the large files.
Interesting - was not aware of git lfs. Will look into it. That would be more convenient I think than always remembering to backup my simulation results in onedrive, separate from my repo (since hdf5 files are not tracking in my repos).
Thanks for the responses everyone! My main concern was if I really needed an HDD for backup or if maintaining current clones of my repo in several places was enough redundancy, which I feel it is now after reading the responses in this thread. Thanks!
Another criterion for choosing a cloud provider (at least for me) was whether it is WebDAV-compatible in order to sync my Zotero library. It works very well with pCloud, and as a bonus you can also choose to host your data in the EU instead of the US
Yes. Of course. And I speak as a person whose laptop died 2 days before I had to take my PhD thesis for binding, and I had no backup whatsoever.[1]
You should always backup everything into the “cloud”, using a locally encrypted, incremental, and deduplicating backup. The best practical solution at the moment is borgbackup. Don’t use anything else.[2] You can mount backups of any moment as a remote directory.
Make it automatic, running in the background every 15 minutes.
The “cloud” may be a server on which you have some space, or ideally a provider like borgbase.
Don’t be tempted to wait with backup until you sort out the files or whatever. Even if you take up 5x the space, it is irrelevant when compared to having an up-to-date backup. Backup now, pruning the repo will take care of freeing up space in the long run.
University IT people extracted by HDD, which survived, and copied the contents to a USB stick, so everything worked out. ↩︎
I know the question is solved, I am replying because no one mentioned borgbackup as far as I can see. ↩︎
Just so I understand - you are saying I should backup on an external HDD and do the borgbackup thing on the cloud?
The conclusion I took away from the previous comments was having multiple redundancy by having my repo on my laptop, desktop, and university HPC account was sufficient. The only catch with that is I have to make sure they are all kept up to date.
That’s up to you, but external HDDs are not a good backup medium (no background automated backup, usually share location with primary data, etc) so I would just backup to a cloud provider.
I am watching it with interest too, but for backup I am usually conservative and would go with a project that had widely used stable releases for years. Borgbackup is more than sufficient for a personal backup.