Is my understanding of a data reading benchmark correct?

Ahmed_Salih · April 28, 2019, 5:31am

So lately I’ve spend a lot of time developing a tool reading data from files, and I just want to understand if my benchmarking procedure is correct. Currently I do something like:

using Benchmarktools
@benchmark function
@benchmark function second time

And then use the last result. But lately I’ve been thinking whether this is legit or not, since when I read data the first time, then I save it in some kind of cache as well or? If yes, how would I go about clearing this cache?

Or am I overthinking?

Kind regards

Tamas_Papp · April 28, 2019, 6:30am

Your understanding is incorrect: the API of BenchmarkTools will take care of ignoring compilation time for you, so you only need to call the relevant macros once.

The package is very well documented, and the manual also explains benchmarking concepts:

https://github.com/JuliaCI/BenchmarkTools.jl/blob/master/doc/manual.md

Ahmed_Salih · April 28, 2019, 6:45am

Thanks! What about caching of data, I assume that if I read something from a harddisk, it will put this in some kind of ram/cache, and when I read the same data again I will get an artificial improvement? Or is this also wrongly understood?

Tamas_Papp · April 28, 2019, 6:56am

It is not very useful to confuse benchmark results with I/O speeds from within Julia, since they will be very specific to your hardware setup and how the OS handles caching. This will effectively randomize your benchmark results and make them very hard to understand.

The facilities of BenchmarkTools are for benchmarking computations. So you should separate that part, read in the data (a subset if the data is very large), and benchmark that separately.

You can also explore various alternatives for I/O, eg using @time. There is no general answer to whether caching matters, it depends on whether you think it matters for your application

Ahmed_Salih · April 28, 2019, 7:05am

Thanks! Now I understand. I will figure out a standardized way to a benchmark for my application then - makes sense that readspeeds are hard to benchmark because of different hardware / caching situations.

Topic		Replies	Views
Unexpected behaviour while benchmarking Performance	1	372	January 28, 2020
Interaction Between Caching and Benchmarking Profiling	2	135	December 18, 2024
First part benchmarking medium-data tools Data juliadb	4	741	September 1, 2019
Identical functions repeated benchmarks show systematic differences Performance question , sort	37	2835	August 2, 2021
@btime does not work properly New to Julia	3	340	December 19, 2023

Is my understanding of a data reading benchmark correct?

Related topics