I need to come up with a system for sharing data between different laboratories located around the world. One possibility would be to use something like JSON but since I will be developing the data analysis code in Julia it makes sense to also store the data as a Julia code file. That is, what I am imagining is that a data file would have statements like:
magnet1 = quadrupole(k1 = 0.34, l = 0.6)
And these statements would fill in the data structures that then would be analyzed. Since these files would be shared among various laboratories, the concern has been raised that some bad actor could compose a data file with malicious code. To prevent this, my idea is to construct a parser to first read in a data file and then use Meta.parse() to analyze the code to see if there are any problems. Specifically, disallowed would be:
1) Loading any module (The parser will setup the appropriate data structures to be filled in).
2) Writing to a file.
3) Using ccall.
4) Using Libc or Libdl
My question is how safe is this? Would it be possible to construct code that could hide doing something disallowed? Is there anything else that should be disallowed?
Note: I do not want to use Docker.
I appreciate any info or suggestions. Thanks!
not worth it, malicious actors can easily get around, I suggest you just trust your collaborators or use Docker/VMs…
I support jling’s answer. It is so difficult to do sandboxing right, that you are almost destined to fail. So either treat data as data and not eval them, or use proper docker. From a security perspective, I would opt for the first solution and treat data as data and avoid evaluating them.
I also would suggest sharing data as data, for instance in an S3 bucket using a common format like Arrow or Parquet.
Sandboxing / Virtualisation is not security. You need to think of every possible way it can go wrong. An attacker only needs to find one.
I don’t know of a system that wasn’t escaped from at least once.
As Theo says, rather colourfully:
Subject: Re: About Xen: maybe a reiterative question but …
From: Theo de Raadt <deraadt () cvs ! openbsd ! org>
Virtualization seems to have a lot of security benefits.
You’ve been smoking something really mind altering, and I think you should share it.
x86 virtualization is about basically placing another nearly full kernel, full of new bugs, on top of a nasty x86 architecture which barely has correct page protection. Then running your operating system on the other side of this brand new pile of shit.
You are absolutely deluded, if not stupid, if you think that a worldwide collection of software engineers who can’t write operating systems or applications without security holes, can then turn around and suddenly write virtualization layers without security holes.
You’ve seen something on the shelf, and it has all sorts of pretty colours, and you’ve bought it.
That’s all x86 virtualization is.
I’d say using
Meta.parse(...) is completely fine, as long as you never use
eval. It would be extremely unsafe to try to create a “blacklist” of disallowed commands and then running whatever is not on the list.
On the other hand, using a “whitelist” of commands that get called by your code can be fine, as long as you are careful about how those calls are made.
It doesn’t make sense. Not only because of security concerns but also because it makes your analysis quite unflexible as data and code is tangled. You end up with a system where even changing the data will be a major problem.
What makes sense is to separate data and code completely. json is fine but probably problematic for other coworkers if they can’t do well with json and want to use other tools and for larger data it’s quite human unreadable and errors are very difficult to spot.
Why not using just a simple common data format like tabular data (tsv, tab separated) and reading it in using the common tools/functions?
There is no absolute security anyway. I’ve opened more than one locked door in my life without a key, and I know a “professional” knows hundred times more ways to penetrate into a locked space - still locking the door when leaving adds some security, and I mostly do it.
Actually there are many reasons to expect a scientific collaborator not to try to infect intentionally your computer, but virtualisation/sandboxing/data checking/whatever would probably add some peace of mind.
Still inventing my own data format (which is what you actually propose) would be my very last option. Here are the questions you might ask yourself and your collaborators:
- How the data are produced?
- What kind of software used for local data postprocessing?
- What is the data volume?
- What type of data? (e.g. complex numbers are not supported by JSON)
- Units used (metric vs. imperial )
- Hierarchical or flat data structures?
- All data sent to you, or there would be data exchange between other labs?
- In what far can you influence the choice of your collaborators?
- I’m sure I forgot something important…
Then look for a suitable data format (JSON/BSON, CSV/TSV, Arrow, HDF5, JLD2, SQL, whatever - there has been some discussions of data exchage and storage formats on the Discourse).
I think Joanna Rutkowska thought of all the ways designing Qubes OS. If I were paranoid I would use that, and that was before seeing Snowden’s tweet:
If you’re serious about security, @QubesOS is the best OS available today. It’s what I use, and free. Nobody does VM isolation better.
I read about the OS before https://meltdownattack.com/ so I guess now all bets are off, unless you use (an older, slower) in-order CPU.
What you could do, is make a Julia-to-Java transpiler. In case you missed the sarcasm, Java was meant to be designed to be secure (its JVM sandbox), while having a long history of security issues, most recently Log4j attack “the single biggest, most critical vulnerability of the last decade”.
- How much data are we talking about?
- Does the data need to transfer in real-time?
- Are manual actions allowed or should the transfer go automatically?
- Depending on other answers: can all clients connect to some NAT punching service? Or are they publicly available?
- Are the client machines trustworthy or should the data be stored on encrypted on the disk?
In any case, try to always use existing systems because writing secure code for things like parsing or authentication is hard.
For more info about why Docker is not secure, see for example Container Isolation: Is a Container a Security Boundary? or Container Isolation is not Safety - Container Journal.
I think you should define what exactly you are worried about. For example, making sure you prevent data loss is a good thing regardless of the cause—“bad actor” or whatever else.
Checksums and/or sharing using version management systems (like git/github, subversion) can be a solution:
Checksums provide data integrity of all files, data and code.
Version management provides easy check of differences to last versions.