There’s been a challenge in Java community that generated a considerable amount of interest I’d say: The One Billion Row Challenge “A fun exploration of how quickly 1B rows from a text file can be aggregated with Java”. While evaluation part of the challenge is not available to other languages, there were a few submissions in Show and tell section using all kinds of languages (and AWK ).
Long story short, it’s about aggregating and calculating an average temperature for weather stations and you’re not allowed to use external dependencies.
Baseline version of the program (single threaded, standard components) runs in about 3min 15sec on my laptop, and using standard Java components (standard hash map, standard hash code etc) and paying attention to the constraints I was able to get it to around 9 seconds on my laptop and then to 7.563 seconds on the evaluation machine (so pretty comparable; evaluation machine is 8 cores, same as my laptop). Best results are under 2 seconds.
Trying to make it fast in Julia sounds like a fun/useful exercise? Although it seems like “no external dependencies allowed” goes against the language
Starting with a baseline though, what would it be? Code below is a version that passes all tests of the challenge, when plugged in appropriately (more on that at the bottom of the post if you make it there). For the 1B row file it takes about 6min 45sec on my laptop, does it look like a plausible baseline version or does it have some problems that no reasonable baseline would have anyway?
mutable struct MeasurementsStats
min::Float32; max::Float32; sum::Float64; count::Int64
end
roundjava(it, digits) = round(it, RoundNearestTiesUp; digits = digits)
roundjava(it) = roundjava(it, 1)
function calculate_average(measurements)
station_measurements = Dict{String, MeasurementsStats}()
open(measurements, "r") do file
for measurement in eachline(file)
station, temperature = split(measurement, ";")
temperature = parse(Float32, temperature)
if haskey(station_measurements, station)
station_stats = station_measurements[station]
station_stats.min = min(station_stats.min, temperature)
station_stats.max = max(station_stats.max, temperature)
station_stats.sum += temperature
station_stats.count += 1
else
station_measurements[station] = MeasurementsStats(temperature, temperature, temperature, 1)
end
end
end
results::Vector{Any} = collect(station_measurements)
sort!(results, by = it -> it[1])
map!(results, results) do (station, stats)
average = roundjava(stats.sum/stats.count, 2)
"$station=$(roundjava(stats.min))/$(roundjava(average))/$(roundjava(stats.max))"
end
println("{", join(results, ", "), "}")
end
calculate_average(isempty(ARGS) ? "./measurements.txt" : ARGS[1])
And if you made it this far, perhaps you’re interested in exploring the challenge. I have a fork of it that (hopefully) makes it easy for non Java people to explore it:
-
sourcing prepare_3j5a_julia.sh should install Java and build the project
-
./test.sh 3j5a_julia should run the tests then
-
./create_measurements.sh N should create sample measurements.txt file. Note for 1B rows one it’s about 13G
-
Julia code can be placed under src/main/julia
-
to plug Julia implementation into tests: copy prepare_3j5a_julia.sh replacing 3j5a_julia with your handle; copy calculate_average_3j5a_julia.sh again using your handle and update the content of the file to use your handle instead of 3j5a after dev.morling.onebrc.CalculateAverage_Julia (latter is just a Java class calling Julia using the handle to identify the file to run)