Hey guys

I’ve been trying to read a binary file in which I know there exists n = 3186, Int32 values. The way I’ve done it is like this:

``````raw_data = zeros(UInt8,n*4)
``````

By doing this I get:

``````raw_data
12744-element Array{UInt8,1}:
0x00
0x00
0x00
0x1b
0x00
0x00
0x00
0x1c
0x00
0x00
0x00
0x1d
⋮
``````

Which is great! The problem arises when I try to reinterpret these numbers, I know that in this array piece I’ve shown you I should get 27, 28 and 29. But something is going wrong with Endian representation or when I use reinterpret because I get:

``````reinterpret(Int32,raw_data)
3-element reinterpret(Int32, ::Array{UInt8,1}):
452984832
469762048
486539264
``````

I can get the right numbers if I use ‘reverse’ but I have to use it twice, first to get the right numbers, then again to get the right indices - but I feel this is a wrong approach:

``````reverse(reinterpret(Int32,reverse(raw_data)))
3-element Array{Int32,1}:
27
28
29
``````

There must be an easier way and I assume that reversing twice makes a big performance hit? Currently my code reading line by line is 200 ms faster than readbytes! which seems a bit wrong. Hope someone can help.

Kind regards

You’re looking for the `ntoh` function:

``````map(ntoh, reinterpret(Int32, raw_data))
``````
1 Like

Thanks, you came in clutch!

Now I am only 130 ms slower, so will just have to fix my implementation now

Kind regards

May want to try in-place as well, using broadcasting:

``````data = reinterpret(Int32, raw_data)
data .= ntoh.(data)
``````

And just in case : don’t forget to benchmark your code by wrapping it in a function; don’t benchmark at top level in the REPL. All top-level variables have type `Any` so the compiler can’t optimize it very well.

I’ve put it into a function and am benchmarking now using your in-line tip. I’ve gotten it down to 330 ms, which is still about 70 ms slower than my reading line by line.

``````    #Preallocate an array depending on datatype and of chosen size
arrayVal::Array{UInt8,1} = zeros(UInt8,size[1]*4)
#Close the open file
close(fd)
data = reinterpret(Int32,arrayVal)
return ntoh.(data)
end
``````

Maybe reading line by line is just superior in Julia instead of reading a whole array of bytes and making operations on it?

``````@benchmark k = readVtkArray("PartAll",Idp)
BenchmarkTools.Trial:
memory estimate:  477.45 MiB
allocs estimate:  28701
--------------
minimum time:     328.899 ms (0.00% GC)
median time:      342.392 ms (0.00% GC)
mean time:        431.651 ms (20.01% GC)
maximum time:     1.405 s (73.79% GC)
--------------
samples:          12
evals/sample:     1
``````

Down under; read line by line

``````@benchmark k = readVtkArray("PartAll",Idp)
BenchmarkTools.Trial:
memory estimate:  261.69 MiB
allocs estimate:  28691
--------------
minimum time:     276.426 ms (0.00% GC)
median time:      289.531 ms (0.00% GC)
mean time:        379.285 ms (21.72% GC)
maximum time:     1.744 s (80.30% GC)
--------------
samples:          17
evals/sample:     1
``````

Kind regards

Two more things to try:

1. Skip initializing to zero:
``````arrayVal = Vector{UInt8}(undef, size[1]*4)
``````
1. replace the copy operation `return ntoh.(data)` by changing `data` in-place:
``````data .= ntoh.(data)
return data
``````

ON top of that, it might be helpful to paste a working example for both of your benchmarks.

I’ve done your suggestion 1 and can see some marginal improvement, so will keep it, but in this case your 2nd suggestion is slowing things down dramatically. I’ve made it as such:

``````#Preallocate an array depending on datatype and of chosen size
arrayVal::Array{UInt8,1} = Vector{UInt8}(undef, size[1]*4)
#Close the open file
close(fd)
data = reinterpret(Int32,arrayVal)
data .= ntoh.(data)
return data
``````

And now the results are:

``````@benchmark  k = readVtkArray("PartAll",Idp)
BenchmarkTools.Trial:
memory estimate:  477.90 MiB
allocs estimate:  28706
--------------
minimum time:     534.921 ms (0.00% GC)
median time:      556.051 ms (0.00% GC)
mean time:        719.914 ms (22.26% GC)
maximum time:     2.025 s (71.23% GC)
--------------
samples:          9
evals/sample:     1
``````

Which is a major decrease (if I’ve done it correctly). I am trying to make a minimal working example available in my other post (How fast is binary reading capabilities in Julia compared with other languages?), I will try to post it as soon as possible.

Kind regards

Ah, yes, I’m seeing the same:

``````julia> function f()
a = Vector{UInt8}(undef, 100_000)
data = reinterpret(Int32, a)
data .= ntoh.(data)
data
end
f (generic function with 1 method)

julia> using BenchmarkTools

julia> @btime f();
284.997 μs (3 allocations: 97.80 KiB)

julia> function g()
a = Vector{UInt8}(undef, 100_000)
data = reinterpret(Int32, a)
return ntoh.(data)
end
g (generic function with 1 method)

julia> @btime g();
14.242 μs (5 allocations: 195.56 KiB)
``````

it seems related to the `reinterpret` call. I wonder if it’s actually necessary; it seems this might work too:

``````julia> function h()
data = Vector{Int32}(undef, 100_000)
data .= ntoh.(data)
return data
end
h (generic function with 1 method)

julia> @btime h();
19.140 μs (2 allocations: 390.70 KiB)
``````

… Okay, that’s still a bit slower. Then I see no reason to take my second suggestion.

I’ve also arrived at the conclusion that for bigger files, it is much faster to just read every byte/element in a for loop and handle conversion instantly, instead of reading the whole array and then reinterpreting. I still can’t make it make logical sense why this is the case, but I’ve tried a lot of different approaches and have not been able to beat line by line.