Is Julia a good choice for Data Engineering?

My apologies for the generic question. Unfortunately after searching around I haven’t been able to find a direct answer.

I would like to know the communities opinion on using Julia for Data Engineering tasks. You know pipelines, streaming data into a warehouse, etc.

I’ve built many pipelines and data engineering projects with python. As more of my work will be heading in the line of data science and taking advantage of some of the pythons I’ve built in Julia. I’m reaching a point where I need to decide if it is worth the time to convert a pipeline to Julia from python. Some of the ETL work I need to do to prepare models will make the down the line work more efficient if I’m using the same toolset wherever possible.

3 Likes

I don’t have experience making an actual data pipeline in Julia (since I work on a larger team I usually play safe and go to languages such as Python and Scala), but from what I’ve seen from Julia right now it kinda isn’t unless you need to do some very customized and math oriented logic where Julia libraries and high performance will offset the lesser support you’ll get from pretty much every integration you’ll do in the pipeline (for example pubsubs, databases, cloud services, creating APIs). You still can do all of that, but you might end up having to go polyglot and use python or C libraries (which Julia supports really well).

I think Julia can be a very strong competitor in this area in the future, as it already does very well the “hard part” of easy to use and high performance data processing and has strong FFI requiring just nice wrappers for everything, and we are already seeing people starting important projects in the area (such as user friendly web frameworks, an actors library for more reliable distributed processing and I even seen discussions about an “airflow” for Julia), but right now the ecosystem is still lacking for that particular purpose.

5 Likes

In the abstract, I think Julia is an excellent language for this use. It’s integration with the shell and excellent reproducibility of environments, to say nothing of the code for actually manipulating data, make it ideal.

But unfortunately, I have to agree with @fborda that the libraries aren’t there yet. A lot of the pieces are, and as i say the strengths of the language should make it relatively easy, but someone needs to put in the work to make it accessible. I’m hoping to spend some time on this if I can get some room to breathe in the fall, but at the moment I still have a number of snakemake pipelines that I can’t quite get away from, despite my strong desire to do so.

2 Likes

I too have the same feeling but it’s unclear which areas exactly are lacking. It would be nice if people doing data engineering in other languages/ecosystems can write up a list here and prioritize it so the community can tackle them in a more coordinated fashion.

I can volunteer to put together an official list if everyone chimes in and comments here first.

Cc: @logankilpatrick

3 Likes

This thread aggregated a bunch of things.

2 Likes

The problem is that even when a tool is available, it’s too barebones. For example, if you want to use Kafka, there is RDKafka.jl, which technically cover the bare minimum (the ability to read and write to kafka), but for a proper data pipeline you need a lot more than the minimum. You need ways to guarantee an exactly once semantics (such as transactions, or some fine grained control of consumer commit) which you can implement with those primitives, but it’s added effort. In the same way you probably want a lot of consumers working in parallel, which you’ll have to manage by yourself, while many other libraries allow you to simply define a callback for each message/batch for the library and it will manage all the kafka consumer groups. And then I have no idea what errors/exceptions can happen and would probably have to hunt down the C library return codes.

Same thing for databases, LibPQ is a very solid wrapper, but most advanced features you’ll have to write yourself, like declaring cursors to implement database streaming, starting transactions and commit/rollbacks, using raw SQL for any kind of database reflection. And for contrast it has the Tables.jl integration that is absolutely amazing and has more features than pretty much every database library I used in terms of reading the results (and you can’t even figure that out by the documentation).

And that’s the ones that have libraries. It would be a giant effort to integrate Julia with Prometheus for metrics/alerts, since it requires creating an http endpoint that will receive information from all over the application so prometheus can probe them. It’s a similar work if you want to integrate with the kubernetes probes (liveness and readiness). This kind of architecture is much easier to implement with a solid actors library (you just create a server actor that will receive messages from all over the application and manage the endpoint), though I’m kinda biased in this because of Elixir and Scala.

If we want to be ambitious, imagine a library that abstracts Kafka in an “array” interface, and allows you to compose transformations using broadcast and the library automatically maps the events from the source topic to a target topic automatically managing the exactly once semantics and distributed processing. Or if the Kafka supported a Tables.jl interface and you could load the events directly into a DataFrame or JuliaDB and then serialize it back to another topic. For databases as well, multiple dispatch would allow for a fantastic query builder library that is much cleaner and more extensible than SQLAlchemy.

In my opinion, more than trying to support everything at once (like every pubsub including gcloud pubsub for example, or databases like oracle, bigquery, presto…), the ideal way to bootstrap Julia is supporting one of each very well (well documented, feature complete so people can do everything, good use of Julia’s strengths to sell the language) even if they initially have some infra-structure lock-in. And since Julia’s main strength is composability, then hopefully the high level tools for pubsubs or databases or endpoints can have their backends easily exchanged to support all other vendors with much less effort.

I made a prototype of a SQL builder for Julia a while ago, but I got overwhelmed with work right after. I really want to go back and make it a thing. And maybe something over the Kafka wrapper.

3 Likes

I had a use case for Kafka as well. Given that we don’t have a full featured and stable implementation, I ended up going with PyCall + pykafka. It works beautifully but definitely less satisfying than a 100% Julia solution.

2 Likes

@tk3369 I finally got around to exploring this more and wrote up some thoughts here: Data Engineering in Julia, everything you need to start creating Data Pipelines 🧑‍🔧🪠 | by Logan Kilpatrick | Jan, 2022 | Medium

TLDR; Julia can do a lot and is a natural choice for DE, there are still sharp edges but it doesn’t appear there are many massive gaps.

9 Likes

Oh @logankilpatrick !
One thing that I’d also suggest you highlight is a great project that has been a game changer for me called FunSQL.jl.
Basically you can write a query in a Julia Domain Specific Language, use all the niceties of Julia, and then convert your query to nearly any type of SQL flavor.
It has blown my mind and I use it in a variety of projects!

6 Likes

That’s a great writeup! Thanks, i really enjoyed reading it. When it comes to data engineering i feel that Kedro Kedro: A New Tool For Data Science | by Jo Stichbury | Towards Data Science is really something to aspire to. Seems like a data engineers dream. I would love to see something like this materialize in the Julia community too. I actually believe Julia has a bright future for data engineering purposes. :muscle:t2:

1 Like

At my last job, I had a very successful project that built a distributed data pipeline. The entire process is written in Julia and runs on Kubernetes. I certainly think that Julia is the right tool for data engineering jobs.

6 Likes

I think that, depending on what you actually need to do, the answer can be yes or no.

If you wish to connect to everything that Apache Airflow can connect to (just to pick an example), the answer may be no. Documentation | Apache Airflow

In many other situations where the alternative solution ends up doing things in python, and is inefficient, the answer may be yes.

It’s not clear where the boundary lies. Julia does not need to be able to do everything using Julia packages to be successful in this domain, that’s certain; its interoperability is good.

2 Likes