The One Billion Row Challenge

jules · February 1, 2024, 4:25pm

Oh the file itself is loaded into RAM? That will definitely change my timings.

Dan · February 1, 2024, 4:51pm

yep, standard string manipulation error. @view not to blame.
Fix by using:

station = @view(measurement[1:prevind(measurement,pos)])

Using this @view approach should improve speed.

3j5a · February 1, 2024, 9:06pm

prevind I should have figure it out. thnx.
now, don’t think it can go any “baseline” than this then and it runs in about 4min 47sec on my laptop.

ndé=-24.3/23.8/76.3, Yellowknife=-54.2/-4.3/43.7, Yerevan=-43.4/12.4/61.4, Yinchuan=-41.3/9.0/61.1, Zagreb=-38.3/10.7/62.9, Zanzibar City=-21.8/26.0/75.3, Zürich=-38.6/9.3/61.5, Ürümqi=-43.0/7.4/58.1, İzmir=-34.3/17.9/70.2}
./calculate_average_3j5a_julia.sh  281.46s user 5.44s system 100% cpu 4:46.84 total

below is full updated version of it:

mutable struct MeasurementsStats
    min::Float32; max::Float32; sum::Float64; count::Int64
end

roundjava(it, digits) = round(it, RoundNearestTiesUp; digits = digits)
roundjava(it) = roundjava(it, 1)

function outformat((station, stats))
     average = roundjava(10*stats.sum/stats.count)/10
    "$station=$(roundjava(stats.min))/$(roundjava(average))/$(roundjava(stats.max))"
end

function calculate_average(measurements)
    station_measurements = Dict{String,MeasurementsStats}()
    open(measurements, "r") do file
        for measurement in eachline(file)
            pos = findfirst(';', measurement)
            station = @view(measurement[1:prevind(measurement, pos)])
            temperature = parse(Float32, @view(measurement[pos+1:end]))
            stats = get!(station_measurements, station) do
                MeasurementsStats(temperature, temperature, 0, 0)
            end
            stats.min = min(stats.min, temperature)
            stats.max = max(stats.max, temperature)
            stats.sum += temperature
            stats.count += 1
        end
    end
    results = collect(station_measurements)
    sort!(results, by = it -> it[1])
    print("{")
    join(stdout, (outformat(result) for result in results), ", ")
    println("}")
end

calculate_average(isempty(ARGS) ? "./measurements.txt" : ARGS[1])

rdavis120 · February 1, 2024, 11:15pm

From the prior thread on this, the baseline could be the duckdb entry loaded into a DataFrame. Preferably with the latest dev version of duckdb because the version on the registry had an issue previously.

Dan · February 2, 2024, 12:20am

The mutable struct might cause some allocation slowdown in the big loop. Replacing it with a Tuple might help. The following is untested, but should be fixable:

function calculate_average(measurements)
    station_measurements = Tuple{Float32, Float32, Float64, Int}[]
    # use sizehint! on station_measurements to avoid some allocations?
    station_index = Dict{String,Int}()
    station_count = 0
    open(measurements, "r") do file
        for measurement in eachline(file)
            pos = findfirst(';', measurement)
            station = @view(measurement[1:prevind(measurement, pos)])
            temperature = parse(Float32, @view(measurement[pos+1:end]))
            idx = get!(station_index, station) do
                push!(station_measurements, (temperature, temperature, 0.0, 0))
                station_count += 1
            end
            stats = station_measurements[idx]
            station_measurements[idx] = (min(stats.min, temperature),
                                         max(stats.max, temperature),
                                         stats.sum + temperature,
                                         stats.count + 1)
        end
    end
    print("{")
    for (i,p) in enumerate(sort(stations_measurements))
        print(i > 1 ? ", " : "", outformat(p))
    end
    println("}")
end

outformat becomes:

function outformat((station, stats))
    average = roundjava(10*stats[3]/stats[4])/10
    "$station=$(roundjava(stats[1]))/$(roundjava(average))/$(roundjava(stats[2]))"
end

foobar_lv2 · February 2, 2024, 9:09am

They explain that in the challenge somewhere. Official measurement is from cold-start of the program, but the actual data file is already in page-cache.

Ideally no IO should hit the bus and the data is processed zero-copy (just mapped).

For julia, you should probably modify the requirements, i.e. report the following performance numbers:

Naive (script runs, i.e. needs compilation)
The standard julia way. You run the code on a tiny sample datafile in order to get everything compiled, then you time running it on the actual data.
With precompilation (i.e. you make a package and precompile, then run the script)
With prepared sysimage.

Of these, (1) and (2) are low-effort, and (1) will have the worst timings (startup + compilation included) and (2) will have the best timings (startup + compilation excluded).

The way I understand it, the 1brc benchmark ensures that stuff is in the page-cache by running the entire thing multiple times on a linux machine with ample memory, without any flushing. But afaiu you can also just put your data file into /tmp on a machine with swap disabled.

I’d like to note that 1brc invites a complete anti-pattern: If the file is not already loaded into page-cache, then a naive mmap on SSD will be worse that read, because you can’t feed the ssd read queue unless you massively multi-thread (page-faults block!). Consider MAP_POPULATE.

An excellent example is 1brc/src/main/java/dev/morling/onebrc/CalculateAverage_shipilev.java at main · gunnarmorling/1brc · GitHub

This is by far not the fastest submission to 1brc, but it is well-documented, careful and idiomatic. On modern java, it should work around that issue (it spawns enough tasks to fill the io queue if you are in a post-loom world), and it also takes care to work in settings where the data doesn’t even fit into main memory.

That being said, I have not verified that the shipilev submission actually works well in that setting

3j5a · February 2, 2024, 9:14am

Right, I see what you mean. Mutable struct is an attempt for naive but reasonable first approach to the problem… but maybe nobody in Julia land would start with it anyway? I was looking at the baseline implementation of the challenge, where results aggregator is a mutable struct (well kind of obviously, it’s a class with mutable fields), and results get added to a kind of standard map/dictionary.
It goes beyond a naive implementation very quickly, so it’s not a real “baseline” to the problem as it’s ridiculously easy to beat it. But I’m already learning things by just starting from it. Thnx!

vnegi10 · May 12, 2024, 3:46pm

Hi, I was about to make a post on 1BRC in Julia, and then found this discussion. Have there been any benchmarks shared already?

My implementation makes use of external packages with a best time of 89.459 s. I am sure there’s a lot more room for improvement. It would therefore be great to share ideas/solutions here.

jling · October 30, 2024, 10:09am

how fast is their disk… at that point should the benchmark starts with data in RAM?

foobar_lv2 · October 30, 2024, 12:58pm

The file starts in RAM. I think they don’t force it (e.g. by disabling swap and putting it on /tmp/), but it is a natural consequence of methodology, kernel and machine sizing, and it is known to them.

Topic		Replies	Views
How to optimise and be faster than Java? Performance	24	2092	September 4, 2018
Julia needs work at "Benchmark Game": numbers seem off up to 100x slower, maybe "Julia AOT" entries needed? Community	10	1866	November 23, 2018
Julia programs now shown on benchmarks game website Community announcement	144	13963	December 3, 2019
CSV read performance vs Pandas General Usage	29	8240	May 6, 2019
CSV Reading (rewrite in C?) Internals & Design	50	5131	October 1, 2018

The One Billion Row Challenge

Related topics