The One Billion Row Challenge

3j5a · January 31, 2024, 10:00pm

There’s been a challenge in Java community that generated a considerable amount of interest I’d say: The One Billion Row Challenge “A fun exploration of how quickly 1B rows from a text file can be aggregated with Java”. While evaluation part of the challenge is not available to other languages, there were a few submissions in Show and tell section using all kinds of languages (and AWK ).

Long story short, it’s about aggregating and calculating an average temperature for weather stations and you’re not allowed to use external dependencies.

Baseline version of the program (single threaded, standard components) runs in about 3min 15sec on my laptop, and using standard Java components (standard hash map, standard hash code etc) and paying attention to the constraints I was able to get it to around 9 seconds on my laptop and then to 7.563 seconds on the evaluation machine (so pretty comparable; evaluation machine is 8 cores, same as my laptop). Best results are under 2 seconds.

Trying to make it fast in Julia sounds like a fun/useful exercise? Although it seems like “no external dependencies allowed” goes against the language

Starting with a baseline though, what would it be? Code below is a version that passes all tests of the challenge, when plugged in appropriately (more on that at the bottom of the post if you make it there). For the 1B row file it takes about 6min 45sec on my laptop, does it look like a plausible baseline version or does it have some problems that no reasonable baseline would have anyway?

mutable struct MeasurementsStats
    min::Float32; max::Float32; sum::Float64; count::Int64
end

roundjava(it, digits) = round(it, RoundNearestTiesUp; digits = digits)
roundjava(it) = roundjava(it, 1)

function calculate_average(measurements)
    station_measurements = Dict{String, MeasurementsStats}()    
    open(measurements, "r") do file
        for measurement in eachline(file)
            station, temperature = split(measurement, ";")
            temperature = parse(Float32, temperature)
            if haskey(station_measurements, station)
                station_stats = station_measurements[station]
                station_stats.min = min(station_stats.min, temperature)
                station_stats.max = max(station_stats.max, temperature)
                station_stats.sum += temperature
                station_stats.count += 1
            else
                station_measurements[station] = MeasurementsStats(temperature, temperature, temperature, 1)
            end        
        end
    end
    results::Vector{Any} = collect(station_measurements)
    sort!(results, by = it -> it[1])
    map!(results, results) do (station, stats)
         average = roundjava(stats.sum/stats.count, 2)
        "$station=$(roundjava(stats.min))/$(roundjava(average))/$(roundjava(stats.max))"
    end        
    println("{", join(results, ", "), "}")                  
end

calculate_average(isempty(ARGS) ? "./measurements.txt" : ARGS[1])

And if you made it this far, perhaps you’re interested in exploring the challenge. I have a fork of it that (hopefully) makes it easy for non Java people to explore it:

sourcing prepare_3j5a_julia.sh should install Java and build the project
./test.sh 3j5a_julia should run the tests then
./create_measurements.sh N should create sample measurements.txt file. Note for 1B rows one it’s about 13G
Julia code can be placed under src/main/julia
to plug Julia implementation into tests: copy prepare_3j5a_julia.sh replacing 3j5a_julia with your handle; copy calculate_average_3j5a_julia.sh again using your handle and update the content of the file to use your handle instead of 3j5a after dev.morling.onebrc.CalculateAverage_Julia (latter is just a Java class calling Julia using the handle to identify the file to run)

abraemer · February 1, 2024, 9:59am

Looks reasonable to me except for this part. The type instability seems unnecessary:

3j5a:

    results::Vector{Any} = collect(station_measurements)
    sort!(results, by = it -> it[1])
    map!(results, results) do (station, stats)
         average = roundjava(stats.sum/stats.count, 2)
        "$station=$(roundjava(stats.min))/$(roundjava(average))/$(roundjava(stats.max))"
    end

You could either allocate a new vector for the strings or even use a generator to never materialize the vector of formatted strings:

results = collect(station_measurements)
sort!(results, by = it -> it[1])
# perhaps define this helper function on toplevel to keep things a bit tidier
javaformat((station, stats)) = let average = roundjava(stats.sum/stats.count, 2)
    "$station=$(roundjava(stats.min))/$(roundjava(average))/$(roundjava(stats.max))"
end
print("{")
join(stdout, (javaformat(r) for r in results), ", ") # prints directly instead of forming a giant string first
println("}")

Depending on how many stations there are this could save some seconds, but I don’t expect it to be a major factor.
Also is there a reason you roundjava the average twice?

Dan · February 1, 2024, 10:31am

Instead of:

Try:

            pos = findfirst(';', measurement)
            station = @view(measurement[1:pos-1])
            temperature = parse(Float32, @view(measurement[pos+1:end]))

Dan · February 1, 2024, 10:44am

Instead of:

3j5a:

if haskey(station_measurements, station)
                station_stats = station_measurements[station]
                station_stats.min = min(station_stats.min, temperature)
                station_stats.max = max(station_stats.max, temperature)
                station_stats.sum += temperature
                station_stats.count += 1
            else
                station_measurements[station] = MeasurementsStats(temperature, temperature, temperature, 1)
            end

Try:

stats = get!(station_measurements, station) do
    MeasurementsStats(temperature, temperature, 0, 0)
end
stats.min = min(stats.min, temperature)
stats.max = max(stats.max, temperature)
stats.sum += temperature
stats.count += 1

3j5a · February 1, 2024, 11:10am

ah, @view right, seems really reasonable even for a naive/baseline version. thnx

3j5a · February 1, 2024, 11:42am

That roundjava twice is there to let the tests pass essentially.

$ ./test.sh 3j5a_julia
.....
Validating calculate_average_3j5a_julia.sh -- src/test/resources/samples/measurements-rounding.txt
1c1
< ham;14.6;25.4;33.6
---
> ham;14.6;25.5;33.6

Bottom is what original Java implementation expects for the average - 25.5, and the top 25.4 is what the version I’ve provided ends being in Julia without that second round call. A better way would probably be just use something like

roundjava(10*stats.sum/stats.count)/10

but that’s still 2 round calls.

jules · February 1, 2024, 11:57am

Here’s one version I had made with external dependencies. It would just kind of be annoying to build your own dict just to use the token update approach. One Billion Rows Challenge Julia · GitHub

3j5a · February 1, 2024, 12:11pm

I had a feeling @view and unicode strings may be up to something

Validating calculate_average_3j5a_julia.sh -- src/test/resources/samples/measurements-20.txt
ERROR: LoadError: StringIndexError: invalid index [23], valid nearby indices [21]=>'️', [24]=>';'
Stacktrace:
  [1] string_index_err(s::String, i::Int64)
    @ Base ./strings/string.jl:12
  [2] SubString{String}(s::String, i::Int64, j::Int64)
    @ Base ./strings/substring.jl:35
  [3] SubString{String}(s::String, i::Int64, j::Int64)
    @ Base ./strings/substring.jl:41 [inlined]
  [4] SubString{String}(s::String, i::Int64, j::Int64)
    @ Base ./strings/substring.jl:43 [inlined]
  [5] view
    @ ./strings/substring.jl:53 [inlined]
  [6] (::var"#1#5"{Dict{String, MeasurementsStats}})(file::IOStream)
    @ Main /media/konstantin/work/tmp-1brc/1brc/src/main/julia/CalculateAverage_3j5a.jl:18
  [7] open(::var"#1#5"{Dict{String, MeasurementsStats}}, ::String, ::Vararg{String}; kwargs::@Kwargs{})
    @ Base ./io.jl:396
  [8] open
    @ ./io.jl:393 [inlined]
  [9] calculate_average(measurements::String)
    @ Main /media/konstantin/work/tmp-1brc/1brc/src/main/julia/CalculateAverage_3j5a.jl:15
 [10] top-level scope
    @ /media/konstantin/work/tmp-1brc/1brc/src/main/julia/CalculateAverage_3j5a.jl:36
in expression starting at /media/konstantin/work/tmp-1brc/1brc/src/main/julia/CalculateAverage_3j5a.jl:36
0a1,20
> Abéché1️⃣🐝🏎️;27.3;27.3;27.3
> Almaty1️⃣🐝🏎️;15.3;15.3;15.3
> Baghdad1️⃣🐝🏎️;26.0;26.0;26.0
> Bangkok1️⃣🐝🏎️;25.6;25.6;25.6

kristoffer.carlsson · February 1, 2024, 12:13pm

https://twitter.com/thomaswue/status/1752817903483724013 suggests they got it down to 300ms in Java.

3j5a · February 1, 2024, 12:16pm

but then I think it’s fair to say it’s more about hacking the problem hard than Java per se, although it’s still what the language gives you sort of out of the box (and GraalVM does its for sure)

jules · February 1, 2024, 12:56pm

I feel like these io-bound problems always end up boiling down to OS wizardry because defaults for page sizes or whatever are suboptimal and can be tweaked for the problem at hand.

3j5a · February 1, 2024, 1:24pm

it actually stops being an io-bound problem pretty soon as you dig into it imho

mkoculak · February 1, 2024, 1:33pm

Yeah, that’s the part confusing me 13GB in 300ms would mean ~40GB/s read speeds, which does not sound plausible. Can someone explain what is the trick here? I looked through the top two solutions yesterday, but since in Java and super hackish, I did not understand one bit.

3j5a · February 1, 2024, 2:14pm

Nice.
I did tweak your code to run it with 8 threads on my laptop… and while obviously tests don’t pass (e.g. expected formatting is off), it ran in about 29sec for me:

 Zanzibar City=26.002252093912084/-25.8/78.0, Zürich=9.305815037868426/-38.6/59.6, Ürümqi=7.394862869786727/-41.2/54.3, İzmir=17.889696814720303/-34.3/66.3}
./calculate_average_jkrumbiegel.sh  205.48s user 2.23s system 713% cpu 29.103 total

3j5a · February 1, 2024, 2:37pm

I think the main thing, besides all the bit-masking-jitsu, is direct memory access usng things like

java.lang.foreign.Arena
sun.misc.Unsafe

you can see in the top submission. The first one is part of a preview/new API that controls the “lifecycle of native memory segments, providing both flexible allocation and timely deallocation”. The second one, well it’s a kind of unsafe/older/current version of it.

I don’t know the specs for the machine where 300ms was achieved, but 1.535sec on the challenge evaluation machine (8 cores of 32 core AMD EPYC™ 7502P (Zen2), 128 GB RAM) is

gdalle · February 1, 2024, 3:13pm

I think there was another thread on this challenge but I can’t seem to find it?

jules · February 1, 2024, 3:19pm

What does direct memory access mean in this context? The file is located on an SSD so you can’t surpass SSD read speed, even if you memory map the SSD into your memory address space as if it was RAM.

rocco_sprmnt21 · February 1, 2024, 3:54pm

Just to note that for temperatures -Inf =-273,15 [°C]

More seriously to ask how to recover billion row file

3j5a · February 1, 2024, 4:17pm

ah, well it may indeed be something Java specific, where you don’t normally go out and grab some native memory, hence names like “Unsafe” and “foreign”. And challenge evaluation rules actually state: “Programs are run from a RAM disk (i.o. the IO overhead for loading the file from disk is not relevant), using 8 cores of the machine”.

3j5a · February 1, 2024, 4:21pm

If you want to generate 1B file, take a look at 1brc/prepare_3j5a_julia.sh at main · 3j5a/1brc · GitHub it should ( ) help installing Java and building the project, then

./create_measurements.sh 1000000000

Topic		Replies	Views
How to optimise and be faster than Java? Performance	24	2043	September 4, 2018
Julia programs now shown on benchmarks game website Community announcement	144	13718	December 3, 2019
Julia needs work at "Benchmark Game": numbers seem off up to 100x slower, maybe "Julia AOT" entries needed? Community	10	1842	November 23, 2018
At a glance, why don't I have better performances between Julia and PHP? Performance	24	1759	July 30, 2020
Help to get my slow Julia code to run as fast as Rust/Java/Lisp Performance	100	4616	August 6, 2021

The One Billion Row Challenge

Related topics