Oh the file itself is loaded into RAM? That will definitely change my timings.
yep, standard string manipulation error. @view
not to blame.
Fix by using:
station = @view(measurement[1:prevind(measurement,pos)])
Using this @view
approach should improve speed.
prevind I should have figure it out. thnx.
now, don’t think it can go any “baseline” than this then and it runs in about 4min 47sec on my laptop.
ndé=-24.3/23.8/76.3, Yellowknife=-54.2/-4.3/43.7, Yerevan=-43.4/12.4/61.4, Yinchuan=-41.3/9.0/61.1, Zagreb=-38.3/10.7/62.9, Zanzibar City=-21.8/26.0/75.3, Zürich=-38.6/9.3/61.5, Ürümqi=-43.0/7.4/58.1, İzmir=-34.3/17.9/70.2}
./calculate_average_3j5a_julia.sh 281.46s user 5.44s system 100% cpu 4:46.84 total
below is full updated version of it:
mutable struct MeasurementsStats
min::Float32; max::Float32; sum::Float64; count::Int64
end
roundjava(it, digits) = round(it, RoundNearestTiesUp; digits = digits)
roundjava(it) = roundjava(it, 1)
function outformat((station, stats))
average = roundjava(10*stats.sum/stats.count)/10
"$station=$(roundjava(stats.min))/$(roundjava(average))/$(roundjava(stats.max))"
end
function calculate_average(measurements)
station_measurements = Dict{String,MeasurementsStats}()
open(measurements, "r") do file
for measurement in eachline(file)
pos = findfirst(';', measurement)
station = @view(measurement[1:prevind(measurement, pos)])
temperature = parse(Float32, @view(measurement[pos+1:end]))
stats = get!(station_measurements, station) do
MeasurementsStats(temperature, temperature, 0, 0)
end
stats.min = min(stats.min, temperature)
stats.max = max(stats.max, temperature)
stats.sum += temperature
stats.count += 1
end
end
results = collect(station_measurements)
sort!(results, by = it -> it[1])
print("{")
join(stdout, (outformat(result) for result in results), ", ")
println("}")
end
calculate_average(isempty(ARGS) ? "./measurements.txt" : ARGS[1])
From the prior thread on this, the baseline could be the duckdb entry loaded into a DataFrame. Preferably with the latest dev version of duckdb because the version on the registry had an issue previously.
The mutable struct
might cause some allocation slowdown in the big loop. Replacing it with a Tuple
might help. The following is untested, but should be fixable:
function calculate_average(measurements)
station_measurements = Tuple{Float32, Float32, Float64, Int}[]
# use sizehint! on station_measurements to avoid some allocations?
station_index = Dict{String,Int}()
station_count = 0
open(measurements, "r") do file
for measurement in eachline(file)
pos = findfirst(';', measurement)
station = @view(measurement[1:prevind(measurement, pos)])
temperature = parse(Float32, @view(measurement[pos+1:end]))
idx = get!(station_index, station) do
push!(station_measurements, (temperature, temperature, 0.0, 0))
station_count += 1
end
stats = station_measurements[idx]
station_measurements[idx] = (min(stats.min, temperature),
max(stats.max, temperature),
stats.sum + temperature,
stats.count + 1)
end
end
print("{")
for (i,p) in enumerate(sort(stations_measurements))
print(i > 1 ? ", " : "", outformat(p))
end
println("}")
end
outformat
becomes:
function outformat((station, stats))
average = roundjava(10*stats[3]/stats[4])/10
"$station=$(roundjava(stats[1]))/$(roundjava(average))/$(roundjava(stats[2]))"
end
They explain that in the challenge somewhere. Official measurement is from cold-start of the program, but the actual data file is already in page-cache.
Ideally no IO should hit the bus and the data is processed zero-copy (just mapped).
For julia, you should probably modify the requirements, i.e. report the following performance numbers:
- Naive (script runs, i.e. needs compilation)
- The standard julia way. You run the code on a tiny sample datafile in order to get everything compiled, then you time running it on the actual data.
- With precompilation (i.e. you make a package and precompile, then run the script)
- With prepared sysimage.
Of these, (1) and (2) are low-effort, and (1) will have the worst timings (startup + compilation included) and (2) will have the best timings (startup + compilation excluded).
The way I understand it, the 1brc benchmark ensures that stuff is in the page-cache by running the entire thing multiple times on a linux machine with ample memory, without any flushing. But afaiu you can also just put your data file into /tmp
on a machine with swap disabled.
I’d like to note that 1brc invites a complete anti-pattern: If the file is not already loaded into page-cache, then a naive mmap
on SSD will be worse that read
, because you can’t feed the ssd read queue unless you massively multi-thread (page-faults block!). Consider MAP_POPULATE.
An excellent example is 1brc/src/main/java/dev/morling/onebrc/CalculateAverage_shipilev.java at main · gunnarmorling/1brc · GitHub
This is by far not the fastest submission to 1brc, but it is well-documented, careful and idiomatic. On modern java, it should work around that issue (it spawns enough tasks to fill the io queue if you are in a post-loom world), and it also takes care to work in settings where the data doesn’t even fit into main memory.
That being said, I have not verified that the shipilev submission actually works well in that setting
Right, I see what you mean. Mutable struct is an attempt for naive but reasonable first approach to the problem… but maybe nobody in Julia land would start with it anyway? I was looking at the baseline implementation of the challenge, where results aggregator is a mutable struct (well kind of obviously, it’s a class with mutable fields), and results get added to a kind of standard map/dictionary.
It goes beyond a naive implementation very quickly, so it’s not a real “baseline” to the problem as it’s ridiculously easy to beat it. But I’m already learning things by just starting from it. Thnx!
Hi, I was about to make a post on 1BRC in Julia, and then found this discussion. Have there been any benchmarks shared already?
My implementation makes use of external packages with a best time of 89.459 s. I am sure there’s a lot more room for improvement. It would therefore be great to share ideas/solutions here.
how fast is their disk… at that point should the benchmark starts with data in RAM?
The file starts in RAM. I think they don’t force it (e.g. by disabling swap and putting it on /tmp/
), but it is a natural consequence of methodology, kernel and machine sizing, and it is known to them.