That amount of data is definitely large enough to warrant considering Hadoop or Spark. Given that I don’t know much about Hadoop or Spark, however, the only negative thing I can say about them is that I’ve heard that they’re a pain to setup and maintain, and they aren’t super performant. But if you need great fault tolerance, Hadoop is apparently something of a gold standard.
Given that you don’t care about fault tolerance as much, the amount of data you’ll process at one time is not very large, and your data being mostly immutable (no in-place changes by Julia), JuliaDB might be something to consider. There are of course some rough edges to it right now, namely the tendency for it to crash and not recover on large, distributed datasets (which I’m working on fixing in Dagger), and its large memory usage (which someone is already planning on tackling in the next few weeks).
However, if you didn’t want to use JuliaDB/Dagger, having all of your processing already done in Julia means that you have a high cost of switching to another system and language (namely something that runs on the JVM), and unless you have extensive experience with Hadoop/Spark, I don’t see a good rationale for switching your processing code over to use them. You’re probably better off in the short- and long-term with making Julia’s parallel capabilities work for you, in my opinion.
I would recommend starting with setting up a small-scale proof-of-concept of your desired pipeline using JuliaDB or similar, and benchmarking it with example queries to see if it handles the load you need. If you find it does not handle the load adequately, then we can try to work with you to see if we can tweak JuliaDB et. al to better handle your usecase. If that doesn’t pan out, then it’s safe to consider investing time in setting up a Hadoop/Spark stack proof-of-concept, and see if that’s more suitable for your usecase.