How fast is binary reading capabilities in Julia compared with other languages?

Ahmed_Salih · April 22, 2019, 12:02pm

I am specifically talking about reading binary files, so nothing with CSV. Is Julia one of the fastest in this respect or would I still see much better performance by making it in c++?

Currently I am for an example able to read 251 binary vtk files of size 201 KB each in 0.056 s, which gives a readspeed of 50.702 / 0.056 = 900 MB/s. Is this efficient?

Hoping some big data guys would like to chime in

Kind regards

Tamas_Papp · April 22, 2019, 12:12pm

Most likely the speed is constrained by disk I/O, so this depends on your hardware. It is fairly fast. Efficient code in any language would be very likely to produce the same benchmarks.

tkluck · April 22, 2019, 12:13pm

To answer the question in your title: Every language will just send the file reading operation off to read(2) in the standard c library. So you’ll get the same performance in Python and bash as you’ll get in Julia or C++.

However, from the body, it seems like you have a more specific question, namely about parsing a specific file format? I’m not familiar with “vtk files”, so maybe you can give a bit more context?

In general, you seem to have the right approach: just benchmark the read performance you are seeing. A mature julia library should typically be about as fast as a c++ implementation, so what you’ll be measuring is the maturity of the library; not the performance of julia itself.

Tamas_Papp · April 22, 2019, 12:30pm

I guess Mmap is worth trying, too, at least for a benchmark.

Ahmed_Salih · April 22, 2019, 1:32pm

I will try making a minimal working example on one “big” file (4500 KB) and share my readbytes! code with you and explaining setup.

I have been looking at Mmap but not been able to understand how to utilize it, so hope we could look at that too.

The example will hopefully come later today if you want to take a look at it, thank you.

Kind regards

Ahmed_Salih · April 22, 2019, 2:50pm

@tkluck

A “vtk” file is a “Visual Toolkit” file, which allows users to visualize simulation data, usually with Paraview, or any other kind of data. In my case I am trying to extract data from these files directly for two purposes:

Avoiding slow reading from CSV files
Lessening storage need for simulations

So opened in Paraview would show something like:

So these vtk files only store simulation data and nothing else. I’ve also included the minimal working example in the dropbox folder. Basically when I benchmark the functionality where I use readbytes!:

@benchmark readVtkArray("parts_")
BenchmarkTools.Trial:
  memory estimate:  986.72 KiB
  allocs estimate:  564
  --------------
  minimum time:     2.731 ms (0.00% GC)
  median time:      3.428 ms (0.00% GC)
  mean time:        3.822 ms (0.00% GC)
  maximum time:     6.814 ms (0.00% GC)
  --------------
  samples:          1307
  evals/sample:     1

So about 3.428 ms. When I use my own approach using read, where I read a Int32 at a time I get 1.6 ms on the files I’ve put in the dropbox link.

The command you have to use is:

using BenchmarkTools
@benchmark readVtkArray("parts_")

If you can get it down under 1.6 ms, I would be very happy - note that I utilize Threads.@threads and I use 4 on an i7.

@Tamas_Papp if you want to try memory mapping, I can tell you that “Idp” has type Int32 and that it is always nRow long while having a width of 1 ie. Array{Int32,1}.

If anything else, let me know guys, I tried to be as clear as possible.

Kind regards

Tamas_Papp · April 22, 2019, 2:51pm

Thanks, I have tried mmap in the past already, so I know I like it. If you want to try Mmap.mmap, its docstring has examples.

Ahmed_Salih · April 22, 2019, 2:53pm

@Tamas_Papp thanks! I will try and post in here if I am struggling some where.

Kind regards

johnh · April 23, 2019, 10:22am

Hello. 251 x 201 kbytes is 50 Megabytes.
I have to say - Linux and other OSes will aggressively cache data in memory.
You are probably benchmarking reading from cache - and you have to be quite careful when benchmarking IO performance to make sure caches are flushed when writing etc.
A rule of thumb when working with benchmarking tools such as FIO is to work with files bigger then the RAM in your machine. These days with big RAM sizes that means some pretty big files!
Also in each benchmarking utility there is usually a flag which means ‘really write to disk - and only report back when the data is written’

One way to drop caches ebfore you run benchmarks like this, on Linux specifically

echo 3 > /proc/sys/vm/drop_caches

Also it is instructive to run in another terminal window watch cat /proc/meminfo
Assuming you are writing to a local hard drive or SSD the iotop utility is fun also.

Sorry - I have benchmarking in my head as I am travelling to Paris tomorrow to do some evaluation.

Ahmed_Salih · April 23, 2019, 11:08am

Thanks for your answer, I haven’t thought about that stuff before. I have one question though, currently I am only reading data, and storing it in a array in a Julia terminal ie. as such not writing anything as far as I know?

So does your comment apply to my function currently or was it more “in case you wanted to start writing data”?

Also if you would not mind, could you explain caches and flushes?

Kind regards

johnh · April 23, 2019, 1:03pm

Certainly. I hope my answer makes some sense. If anyone wants to correct me - please do so.
I will concentrate on the Linux OS, as I know that best.
There are now many data storage technologies - from Dynamic RAM at the top, Intel Optane persistent memory, NVMe solid state drives, Solid State Disks, spinning hard drives, tape.
However, let us forget exotic technologies and consider a system with RAM and a spinning hard drive.
When you write data to the hard drive, the OS caches your write in RAM - it is not necessarily written (or flushed) to the disk before the write call returns to your application. This is done so the system feels fast, and also multiple writes can be queued up which keeps the writing to disk efficient.
This is also why it is important to shut down a server properly - if the power is shut off abruptly then data in the cache can be lost.

Similarly when data is read from disk it is given to your application. However the next time you read the same data you will get data which is read from RAM - a copy is kept in the cache.

The rather obscure command I gave will trigger a clear of the cache. If you want to benchmark the genuine read speed from disk it is generally advised to run this before the application.

Oh, and this is the command for users who have sudo privileges:
echo 3 | sudo tee /proc/sys/vm/drop_caches

Ahmed_Salih · April 23, 2019, 6:18pm

Thanks for your answer and explanation, will keep it in mind for the future

Kind regards

Topic		Replies	Views
Fastest Approach to reading Binary Files Performance binaryio	2	773	April 7, 2019
Does Python and Julia have the same file reading tools? Data	14	2039	January 4, 2020
Some tweaks about binary I/O plus some conversions Data binaryio	4	857	June 18, 2020
File IO Buffers too small? Performance binaryio , io	14	1717	November 25, 2022
Abysmal performance when reading block of data from disk with Julia Performance	15	372	January 21, 2025

How fast is binary reading capabilities in Julia compared with other languages?

Related topics