Is my understanding of a data reading benchmark correct?

So lately I’ve spend a lot of time developing a tool reading data from files, and I just want to understand if my benchmarking procedure is correct. Currently I do something like:

  • using Benchmarktools
  • @benchmark function
  • @benchmark function second time

And then use the last result. But lately I’ve been thinking whether this is legit or not, since when I read data the first time, then I save it in some kind of cache as well or? If yes, how would I go about clearing this cache?

Or am I overthinking?

Kind regards

Your understanding is incorrect: the API of BenchmarkTools will take care of ignoring compilation time for you, so you only need to call the relevant macros once.

The package is very well documented, and the manual also explains benchmarking concepts:

https://github.com/JuliaCI/BenchmarkTools.jl/blob/master/doc/manual.md

Thanks! What about caching of data, I assume that if I read something from a harddisk, it will put this in some kind of ram/cache, and when I read the same data again I will get an artificial improvement? Or is this also wrongly understood?

It is not very useful to confuse benchmark results with I/O speeds from within Julia, since they will be very specific to your hardware setup and how the OS handles caching. This will effectively randomize your benchmark results and make them very hard to understand.

The facilities of BenchmarkTools are for benchmarking computations. So you should separate that part, read in the data (a subset if the data is very large), and benchmark that separately.

You can also explore various alternatives for I/O, eg using @time. There is no general answer to whether caching matters, it depends on whether you think it matters for your application

Thanks! Now I understand. I will figure out a standardized way to a benchmark for my application then - makes sense that readspeeds are hard to benchmark because of different hardware / caching situations.