The advantage/motivation for this could be that some universities provide compute resources for “free” (to a certain extent) to their members, thus allowing to use the university’s cluster as a backend for package testing. Is there public documentation for such a setup or even best practices/lessons learned that one could benefit from?
Out of curiosity, why not using GitHub Actions self-hosted runners? It isn’t that much different, is it? However there is a problem of doing it in a safe way so that an attacker can’t run malicious code on your machine with a PR (bit it probably isn’t much different with GitLab?)
You sometimes have special hardware requirements (GPU, FPGA, multiple HPC nodes, etc) and might need direct hardware access (no vm business). Personal examples: Threadpinning.jl and LIKWID.jl.
(Sorry, overlooked that you’re talking about self-hosted GitHub runners… It’s in the middle of the night here in Dallas )
I think the big issue with self-hosted Github Action runners was that isolation is ill-defined and you had to do it yourself.
For JuliaGPU we moved away from Gitlab (@maleadt probably remembers better than I do why) to Buildkite. For Clima we are also usng Buildkite to run things on Caltech HPC system.
The biggest problem was that GitLab does not support mirroring changes from forks, so we couldn’t run CI on external PRs (without something like bors as a workaround). Furthermore, the mirroring was easy to break, and annoying to set-up.
However, compared to BuildKite, isolation was actually better (with built-in support for running jobs in Docker), and we now have to roll our own version of a couple of fairly basic features that GitLab supported out of the box (e.g. secrets).
Thanks everyone for all the replies so far, this was already very helpful and right on the spot!
Indeed, @vchuravy identified the core issue, namely the security concerns with GitHub self-hosted runners: they already warn in the GitHub docs (e.g., here) that using self-hosted runners can pose a security problem, since the runners natively run as daemon-like tasks under a regular user account. There is no built-in isolation between the host system and the CI job, nor between subsequent CI jobs. Further, our university does not allow to put credentials to its internal systems on an external service provider’s systems (although I believe that this wouldn’t actually be an issue, since the GH runners connect one-way to the repo and not the other way round).
@maleadt@simonbyrne Do you use Buildkite in its free version, do you pay for it, or did you get some open source/public research discount? As far as I can tell, the 10,000 mins/month limit on their free tier is going to be used up very fast with our current test setup (when using GH Actions hardware, this amounts to roughly only 7-10 full test cycles).
@maleadt Is there some public documentation available on what you need to do to make Buildkite work for you? Was it “just” secrets or also issues like running in a sealed environment like Docker? I’d be very interested in learning more about it.
@simonbyrne I believe you already mentioned to me once on Slack that you use Buildkite + HPC clusters, but unfortunately I didn’t document it before the 90-day limit kicked in. Could you maybe share again the setup you use for using Buildkite to run tests on an HPC cluster system?
Buildkite is free to use for open source projects. You can start out on our Free plan, and if you need additional usage or inclusions for your open source project, you can email support and we’ll be happy to help.
Thanks again for these great answers! Thus, just so I understand correctly: The Buildkite runners do not provide any isolation either, thus if you have multiple buildkite runners under the same username, they can access each others’ files and all other files that are accessible by that user?
@simonbyrne What, if I may ask, is then the advantage over, say, using the GitHub self-hosted runners? For instance, I see that in the CliMA/slurm-buildkite repo some efforts have gone into making sure that the buildkite runner gets its job assignments, since they are usually triggered by webhooks. GH runners, on the other hand, would not need this since they are happy as long as they can reach GH, as described here.
For JuliaGPU, I understand the reason was that buildkite runners allow you to use custom images, did I understand this correctly? Or was there another motivation?
Basically, yes: the only isolation provided by default is to run each job in a new directory.
Our requirements were a bit different:
we wanted to use the Slurm scheduler to create worker slurm jobs on demand, rather than the CI scheduler (which typically keeps a pool of workers running)
we wanted to allow different jobs to request different resources (CPUs, GPUs, memory, etc), which means we needed a way to
a. specify these requirements, and
b. make sure that each job ran on the correct runner.
@simonbyrne Thanks a lot for the detailed explanation, this certainly helps a lot!
@vchuravy Thank you as well for that hint! While it’s not clear to me whether this is the way to for HPC workloads, it seems like an interesting option for general runners that would also be somewhat safe to run on forks.