I have spent some time in both Python and Julia to develop the same kind of function, a function which reads some binary files and import its data into an array. I have spent a lot of time tuning the Julia function and very little time on the Python code, but I have experienced that the Python code is 10 times faster than the Julia one.
I am of course going to make a few more tests, but before I dig to deep, I would like to know if there should be a difference in reading speeds depending on using Julia or Python, theoretically.
@baggepinnen no, I am using Julia 1.2 right now, forgot to state it.
@Sukera okay thanks - I just wanted a general understanding first, thanks for giving me that. I while ask more specific at a later day, for now just to explain fast, in Julia I am reading one value at a time using an optimized for loop (as far as I know) and inserting it in a preallocated array while transforming the bytes to the relevant type (Float32 ie.). In Python I am using Numpy and reading the whole byte array, then converting/reshaping as needed.
A while back I tried the same as I did in Python, but in Julia, but it was much slower. Just to give a general feel of the main differene between them.
Well that should be the difference then - the numpy routine you’re using is likely reading much more data at once and processing it after reading it into memory. Small reads are slower (independent of programming language) since you’re stopping the read and starting it again frequently.
Julia should pretty much have state of the art performance for most IO operations, but since it also exposes the low level functionality, it’s easy to run into performance traps.
In Python I am using Numpy and reading the whole byte array, then converting/reshaping as needed.
You can do the same in Julia, even without reshaping & conversions:
open(binary_file) do io
dims = parse(io) # or wherever you get the size
result = Array{Float32}(dims)
read!(io, result)
return result
end
Or if you don’t have the size:
bin_array = read(binary_file)
result = reshape(reinterpret(Float32, bin_array), dims)
Note that in the last case you will get a reshape / reinterpret array which does have slower performance in quite a few cases, which is a bit awkward… You can either copy it (just call collect) or unsafely reinterpret it to the correct shape & type…
@Sukera to my understanding loops in Julia should in general not impact performance - atleast that was one of the selling points of starting to use Julia.
@sdanisch I will try rewriting my Julia version using your “skeleton” and see if I can get same performance as in Pyhon. In Python I am using cProfiler to test the function and in Julia, BenchmarkTools - to my understanding these two time outputs should be comparable.
Correct – what I’m talking about is the underlying loading from disk byte by byte that’s slower than reading a bigger chunk and processing that, somewhat like @sdanisch suggested. Having a faster loop is of no use if all the time is spent loading small chunks of data, since every read has some overhead which will add up. It’s faster to load all the data at once than to load small chunks one after another. This is a limitation of disk IO in general, not of julia.
I ended up neglecting the Julia project, since I got some help to learn and use C to do it instead. I don’t think this is the answer you hoped for, but this is what I ended up doing I think if one did it properly in Julia one could reach the same speeds of reading almost.
I feel that the file reading tools in Julia are better than in Python. Perhaps, you do not need to any write code at all to read your files into Julia.
I am impressed with the File IO implemented in Queryverse.jl. You can read any of the file types below and then save into any of the file types below. The read file type does not have to be the same as the write file type which is great.
File IO in Queryverse.jl
[CSVFiles.jl] can read and write CSV files. Under the hood it uses the extremely fast TextParse.jl
[FeatherFiles.jl] can read and write Feather files.
[ExcelFiles.jl] can read and write Excel Files.
[StatFiles.jl] can read SPSS, STATA and SAS files.
[ParquetFiles.jl] can read Parquet files.
[VegaDatasets.jl] provides some example datasets from the Vega Datasets