Does Python and Julia have the same file reading tools?

Ahmed_Salih · September 26, 2019, 12:49pm

Hey folks

I have spent some time in both Python and Julia to develop the same kind of function, a function which reads some binary files and import its data into an array. I have spent a lot of time tuning the Julia function and very little time on the Python code, but I have experienced that the Python code is 10 times faster than the Julia one.

I am of course going to make a few more tests, but before I dig to deep, I would like to know if there should be a difference in reading speeds depending on using Julia or Python, theoretically.

Kind regards

Sukera · September 26, 2019, 12:57pm

That very much depends on your implementation. Purely reading a file should be just as fast, since the heavy lifting is done by the OS anyway.

If it’s slow, maybe you can show some of the code or ask some more specific questions about your code?

baggepinnen · September 26, 2019, 1:00pm

Are you perhaps using Julia v1.3 and are hitting this method?

Ahmed_Salih · September 26, 2019, 1:07pm

@baggepinnen no, I am using Julia 1.2 right now, forgot to state it.

@Sukera okay thanks - I just wanted a general understanding first, thanks for giving me that. I while ask more specific at a later day, for now just to explain fast, in Julia I am reading one value at a time using an optimized for loop (as far as I know) and inserting it in a preallocated array while transforming the bytes to the relevant type (Float32 ie.). In Python I am using Numpy and reading the whole byte array, then converting/reshaping as needed.

A while back I tried the same as I did in Python, but in Julia, but it was much slower. Just to give a general feel of the main differene between them.

Kind regards

Sukera · September 26, 2019, 1:09pm

Well that should be the difference then - the numpy routine you’re using is likely reading much more data at once and processing it after reading it into memory. Small reads are slower (independent of programming language) since you’re stopping the read and starting it again frequently.

sdanisch · September 26, 2019, 1:49pm

Julia should pretty much have state of the art performance for most IO operations, but since it also exposes the low level functionality, it’s easy to run into performance traps.

In Python I am using Numpy and reading the whole byte array, then converting/reshaping as needed.

You can do the same in Julia, even without reshaping & conversions:

open(binary_file) do io
    dims = parse(io) # or wherever you get the size
    result = Array{Float32}(dims)
    read!(io, result)
    return result
end

Or if you don’t have the size:

bin_array = read(binary_file)
result = reshape(reinterpret(Float32, bin_array), dims)

Note that in the last case you will get a reshape / reinterpret array which does have slower performance in quite a few cases, which is a bit awkward… You can either copy it (just call collect) or unsafely reinterpret it to the correct shape & type…

Ahmed_Salih · September 26, 2019, 9:11pm

@Sukera to my understanding loops in Julia should in general not impact performance - atleast that was one of the selling points of starting to use Julia.

@sdanisch I will try rewriting my Julia version using your “skeleton” and see if I can get same performance as in Pyhon. In Python I am using cProfiler to test the function and in Julia, BenchmarkTools - to my understanding these two time outputs should be comparable.

Kind regards

Sukera · September 26, 2019, 9:17pm

Correct – what I’m talking about is the underlying loading from disk byte by byte that’s slower than reading a bigger chunk and processing that, somewhat like @sdanisch suggested. Having a faster loop is of no use if all the time is spent loading small chunks of data, since every read has some overhead which will add up. It’s faster to load all the data at once than to load small chunks one after another. This is a limitation of disk IO in general, not of julia.

Ahmed_Salih · September 26, 2019, 9:33pm

Great, thanks for explaining it so clearly. Looking forward to test again.

Kind regards

Lee · December 13, 2019, 9:16pm

Hi! @Ahmed_Salih,
How successful have you been in the last tests to read the julia binary file?

Thank you in advance!

Ahmed_Salih · December 14, 2019, 10:09am

Hello!

I ended up neglecting the Julia project, since I got some help to learn and use C to do it instead. I don’t think this is the answer you hoped for, but this is what I ended up doing I think if one did it properly in Julia one could reach the same speeds of reading almost.

Kind regards

Lee · December 18, 2019, 12:28pm

Do you call C in julia for binary reading or get it wrong?

Ahmed_Salih · December 19, 2019, 6:29pm

No currently I am calling the C code from Python, but I suppose it could also be called from Julia.

Kind regards

Lee · December 19, 2019, 11:05pm

Thank you.

Clark_Cruz · January 4, 2020, 6:52pm

Hi,

I feel that the file reading tools in Julia are better than in Python. Perhaps, you do not need to any write code at all to read your files into Julia.

I am impressed with the File IO implemented in Queryverse.jl. You can read any of the file types below and then save into any of the file types below. The read file type does not have to be the same as the write file type which is great.

File IO in Queryverse.jl
[CSVFiles.jl] can read and write CSV files. Under the hood it uses the extremely fast TextParse.jl

[FeatherFiles.jl] can read and write Feather files.

[ExcelFiles.jl] can read and write Excel Files.

[StatFiles.jl] can read SPSS, STATA and SAS files.

[ParquetFiles.jl] can read Parquet files.

[VegaDatasets.jl] provides some example datasets from the Vega Datasets

Check out Packages | Queryverse

Topic		Replies	Views
How fast is binary reading capabilities in Julia compared with other languages? Data binaryio	11	2119	April 23, 2019
Some tweaks about binary I/O plus some conversions Data binaryio	4	860	June 18, 2020
Fastest Approach to reading Binary Files Performance binaryio	2	773	April 7, 2019
Is python pandas faster than julia CSV? General Usage csv	3	964	June 28, 2020
Fast reading of multiple big-endian binary files Performance binaryio	1	793	December 18, 2020

Does Python and Julia have the same file reading tools?

Related topics