I have a few side projects which are running fine, but never let stability get in the way of learning a new language! I’ve been using julia for two weeks now and I figure an easy walk into learning more is to port the projects. In a few hours on Friday night, I ported the code and ran some performance comparisons between the current scripts in ruby and now in julia. This may be comparing apples to trains, and none of this is statistically valid.
The Background
[my desktop with 48 cores]⇋[gigeth lan]⇋[database server]
⇲⇋{internet}⇋[HTTPS API]
Above: a poor text diagram of the physical setup.
The script which looks up geographic information about IP addresses against an API and inserts the resulting JSON into a database. Using a few functions (methods in ruby parlance) to create a todo list of ip addresses to resolve, fetch the geoip info, and insert into db. Repeat this millions of times.
For performance, the API server supports up to 50 connections per API key, so I use the parallel gem to spin up 50 processes in front of an each iterator:
Parallel.each(queue, in_threads: 50) do ipaddress
get_results(ipaddress)
end
I pretty much ported it to julia last night. The parallel part is on the todo list as the database driver for julia isn’t thread-safe, Time to write a connection pool in julia! Results for a later post. I think a simple vector as the queue or actually using dequeue from DataStructures.jl. Another option is to write the JSON results to a queue and have a function pull from the queue and write to the db. It seems HTTP.jl is thread-safe.
The Performance Comparison
This is all running on the same setup, same hardware, same everything, except languages.
Ruby
On average, the data load from the db takes 35 minutes for 50 million ip addresses. Each call to/from the db takes around 119 ms. The actual get_results(ipaddress)
averages 154 ms. With 50 processes running, it takes 80 hours or so to complete the llist.
Julia
On average (according to BenchmarkTools.jl), the data load from the db takes 2:30 (MM:SS) for the same 50 million ip addresses. Each call to/from the db is around 1 ms. The actual get_results(ipaddress)
averages 52ms. With one process running, ProgressMeter is estimating 24 days to resolve.
I’m amazed how much faster is julia overall. Ruby’s array/set/hash memory allocation is painfully slow, and I think that’s where the 35 minutes is spent. Although, the db connection being 119x faster surprised me, as did the 3x speedup in https api call/fetch.
Now to write a connection pool in julia.