When will Julia compete with Spark?

My org uses Spark + EMR to query and transform static files on S3 and output to files that either get loaded to a DB or used as input for analysis.

It seems like Julia could also do this and do it well, of course Spark as a project has a tremendous head start.

Is this a realistic idea? What about non-production work, where I need to read and transform large distributed static files and run ML batches over the extract?

1 Like

I think you misunderstand something: Apache Spark is a platform for robust distributed computing, while Julia is a programming language. It’s not clear how they could/should “compete”.

Are you looking for

?

2 Likes

It’s a good point. Perhaps a better question might be whether there is an implementation in Julia that achieves the same purpose but better.

I think the ecosystem for doing distributed computation on very large data sets is not particularly well-developed in general. Spark can be really unintuitive and unpleasant to use, and other tools well-versed for large data (sort of like SAS) are even worse.

For a long time now I’ve wanted to build something to handle econometrics out-of-core for data too large to fit in memory in a distributed way. JuliaDB sort of handles out-of-core data, but I don’t really think it’s robust in the ways I might like for econometrics data. I’d have to really work with it to get it to handle my kinds of operations.

Another question is whether Julia is actually the correct language to build this stuff in. Yes, it’s fast and has good typing and does a whole bunch of stuff well, but it’s not really a good systems programming language like I imagine you would need to build a Spark competitor. Spark is a really good system, in that it’s quite robust and moderately easy to scale as long as you have good infrastructure. It would be a big undertaking to write something similar in Julia.

I would like to see something like it, where everything is intuitive and just works. We could probably get that out of something based in Julia, but it would be a huge effort. It’d need industrial buy-in and a lot of developer-hours.

I should note that we have a lot of the skeleton already:

10 Likes

@Tamas_Papp Tamas_papp: Mea culpa, I admit the headline is somewhat tongue in cheek.

What I do wish for is a demonstration of a similar basic use case, write some Julia to:

  • read some static files from S3 in parallel
  • transform the data using compute nodes in parallel
  • write the transformed data to S3

I think Julia can do this, and I want to do this for my work, to build skill with Julia and awareness in my team about Julia.

I can obviously implement all this at scale using Spark and the python API, but I want to see a julia implementation. I’m looking through the posted response now, thank you @cpfiffer

3 Likes

OK I’m pretty excited about JuliaDB, seems like this alone would satisfy the simple use-case I have in mind.

The OnlineStats package can already do some statistics on out-of-core data. For regressions, there seem no answer.

2 Likes

I actually started some really crappy work on this in BigStats, which was my very poor attempt at getting StatsModels to work with JuliaDB and SQLite. I haven’t worked on it in a while, but I was able to get some basic regressions working on an iterative basis for JuliaDB. I have a local branch which handles SQLite databases but it’s pretty terrible.

4 Likes

If you define the right methods, regressions should work with distributed arrays, no?

I don’t know about the solvers etc though

Yes, but I believe that distributed arrays are all in memory. It might be nice to have some like ArrayStream types that dynamically loads distributed arrays into memory.

Those are fairly straightforward. OLS can basically be reduced to a whole bunch of vector multiplications and sums, which is really easily parallelized. It does come at a cost though, since you do a lot more operations overall.

2 Likes

I don’t see why Julia wouldn’t be ideal for big data processing, although maybe my unfamiliarity with Spark is why I can’t see why this is. While systems programming may be a bit trickier to do in certain circumstances (such as where low latency or tight control over syscalls is required), I don’t think this is something that Julia can’t be adapted to handle; I believe that it should be possible to “swap out” some of Julia’s managed infrastructure (like libuv and garbage collection) for the “plain” sorts of operations you get with libc that are so amenable to systems programming.

Regarding robustness, I’d also like to see JuliaDB (and anything else using the underlying Dagger framework) to become more robust to failure modes and latencies, and I’d like to improve JuliaDB’s ability to be used on larger datasets at a reduced or fixed memory cost (which is a known pain point around JuliaDB right now). I hope that you’ll file issues on the appropriate repos where you’ve found things problematic (and if you find something lacking in Dagger’s scheduling performance, feel free to ping me directly in the issue, since I’m actively working on some improvements to the scheduler).

Finally, I think that the up-and-coming MLJ.jl will be a big game-changer for the Julia big data ecosystem once some of the known issues and missing features are addressed. There’s a lot of potential in their planned approach to machine learning, and that it’ll see a lot of usage together with Flux.

3 Likes

Thanks.

Though what you said about in memory is true and would preclude out of core stuff, but spark is generally used for multi node distributed processing, right?

Yeah, but Spark kinda-sorta has out of core abilities. It mostly chunks data up into partitions that separate workers can run on. Much like how JuliaDB works.

Even if its not better, TBH, for certain tasks it would be great not to change gears and stay working in a Julia.

Can anyone point me to recommended tutorials covering parallel read from S3, filter or join, then write to S3?

Almost anything can be done, the question is at what cost, and whether someone will invest it in free software.

Spark itself is comparable (if not necessarily equal, these things are hard to quantify) to Julia in the number of contributors and complexity.

While working with native Julia libraries is nice and has a lot of advantages, it is not always the best solution, especially initially. This is why interfacing with foreign libraries was such a core feature since the initial days of Julia.

4 Likes

?.. if you want to be restricted to Spark’s obsession with writing to disk or other nuances … go ahead — but there is nothing that Julia can’t do. JuliaDB is functional, but straight Julia with distributed processes does everything. We’re using it in production, distributed across multiple servers at scale. Boom!

3 Likes

Okay Glen I need more information about this, in the context of:

  • I really enjoy working with julia as a GP language, but professionally I’m in a Spark/Scala ecosystem.
  • I’m new to using julia in professional work, adapting it into my workflow needs a very simple and focused set of working solutions. I can build on it once I’m able to deliver reliably.

So here is what I need to do, which it sounds like you have achieved:

  1. Read distributed and partitioned data from s3, mostly in parquet format but sometimes CSV.
  2. Compute, usually just aggregations, filters or shape transformations.
  3. Write back to s3 for storage, downstream processing, or analysis work.

If you could share some of your process code it would be great!

2 Likes

Hi Merlin,

The SparkSQL.jl package is designed for those that want to use Apache Spark from Julia. SparkSQL.jl is intended for tabular data on premise and in the cloud. It works with CSV, Parquet, JSON, and Delta lake.

From Julia, SparkSQL.jl lets you read & write files, create delta tables, and compute in Spark. You only need to know Julia and SQL. Python, Scala, or Java knowledge is not required to use the package. Your SparkSQL.jl based Julia app shows up as a proper Spark application in Spark (it is not JDBC which requires a thrift server). You can even pass settings to your spark session from Julia.

SparkSQL.jl also connects Julia’s wonderful data science packages to Spark data. The package has functions that move data easily between Spark and Julia as DataFrames. Tutorials and project documentation can be found here:

Tutorial page:

Project page:

8 Likes