Re-writing a ruby side project into julia to learn

I have a few side projects which are running fine, but never let stability get in the way of learning a new language! I’ve been using julia for two weeks now and I figure an easy walk into learning more is to port the projects. In a few hours on Friday night, I ported the code and ran some performance comparisons between the current scripts in ruby and now in julia. This may be comparing apples to trains, and none of this is statistically valid.

The Background

[my desktop with 48 cores]⇋[gigeth lan]⇋[database server]
                               ⇲⇋{internet}⇋[HTTPS API]

Above: a poor text diagram of the physical setup.

The script which looks up geographic information about IP addresses against an API and inserts the resulting JSON into a database. Using a few functions (methods in ruby parlance) to create a todo list of ip addresses to resolve, fetch the geoip info, and insert into db. Repeat this millions of times.

For performance, the API server supports up to 50 connections per API key, so I use the parallel gem to spin up 50 processes in front of an each iterator:

Parallel.each(queue, in_threads: 50) do ipaddress
  get_results(ipaddress)
end

I pretty much ported it to julia last night. The parallel part is on the todo list as the database driver for julia isn’t thread-safe, Time to write a connection pool in julia! Results for a later post. I think a simple vector as the queue or actually using dequeue from DataStructures.jl. Another option is to write the JSON results to a queue and have a function pull from the queue and write to the db. It seems HTTP.jl is thread-safe.

The Performance Comparison

This is all running on the same setup, same hardware, same everything, except languages.

Ruby

On average, the data load from the db takes 35 minutes for 50 million ip addresses. Each call to/from the db takes around 119 ms. The actual get_results(ipaddress) averages 154 ms. With 50 processes running, it takes 80 hours or so to complete the llist.

Julia

On average (according to BenchmarkTools.jl), the data load from the db takes 2:30 (MM:SS) for the same 50 million ip addresses. Each call to/from the db is around 1 ms. The actual get_results(ipaddress) averages 52ms. With one process running, ProgressMeter is estimating 24 days to resolve.

I’m amazed how much faster is julia overall. Ruby’s array/set/hash memory allocation is painfully slow, and I think that’s where the 35 minutes is spent. Although, the db connection being 119x faster surprised me, as did the 3x speedup in https api call/fetch.

Now to write a connection pool in julia.

7 Likes

That’s pretty amazing. If you had just described the problem to me, I would have guessed that it’d be about equally fast in any mainstream language – there’s nothing that stands out as a case where Julia should have an advantage.