I first tried a MATLAB version which managed to read&unpack in ~30ms for a 2080*2080 image.
Here is a julia version reaching ~5ms
using Mmap, BenchmarkTools
function readraw(file,width,height)
raw = Mmap.mmap(file,Vector{UInt8},Int(width*height*1.5),0)
npack = Int(length(raw)/3)
img = Vector{UInt8}(undef,4npack)
@inbounds @simd for i in 0:npack-1
img[1+4i] = raw[2+3i] << 4
img[2+4i] = raw[1+3i]
img[3+4i] = raw[2+3i]
img[4+4i] = raw[3+3i]
end
img = reinterpret(UInt16,img)
@inbounds @simd for i in 1:2npack
img[i] >>>= 4
end
img = reshape(img,width,height)'
end
file = "C:\\Users\\fff00\\FullMono12Packed.Raw"
w = 2080
@btime img = readraw(file,$w,$w);
Since i need to read&unpack for huge number of raw images, i am looking for anyone who could help speeding up more of this function.
All tests are done in Julia-1.6.1, and here is a test raw image.
Agreed - reading chunks of UInt16 and splitting them up appropriately could lead to more speedup. Since your data is aligned for every 2 pixels, this should unroll and SIMD really nice. Also worth taking a look at LoopVectorization.jl, though I think youβll have to read data into a buffer first (at least a small one?) to really take advantage of it.
Paging @elrod - LoopVectorization should fall back to @inbounds @simd if it isnβt able to work with an expression, but it appears to be slower than @inbounds @simd in this case. It may be getting tripped up by by the irregular access pattern, but you can get around that to some degree by specifying @turbo unroll=1.
function readraw_turbo!(raw, file, width, height)
read!(file, raw)
npack = Int(length(raw)/3)
img = Vector{UInt8}(undef,4npack)
@turbo unroll=1 for i in 0:npack-1
img[1+4i] = raw[2+3i] << 4
img[2+4i] = raw[1+3i]
img[3+4i] = raw[2+3i]
img[4+4i] = raw[3+3i]
end
img = reinterpret(UInt16,img)
@turbo unroll=1 for i in 1:2npack
img[i] >>>= 4
end
reshape(img,width,height)'
end
By default, it is probably unrolling by 4 so that it can load a vector of 4 contiguous values from img and then shuffle them.
That might not work well for UInt8.
Should be fixed with LoopVectorization 0.12.51.
Thereβs still a performance issue I need to fix, but itβs already much faster than @inbounds @simd for me:
julia> @benchmark readraw!($raw, $file, $w, $w)
BenchmarkTools.Trial: 1658 samples with 1 evaluation.
Range (min β¦ max): 2.869 ms β¦ 3.871 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 2.902 ms β GC (median): 0.00%
Time (mean Β± Ο): 3.008 ms Β± 242.909 ΞΌs β GC (mean Β± Ο): 3.28% Β± 6.22%
βββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
2.87 ms Histogram: frequency by time 3.57 ms <
Memory estimate: 8.25 MiB, allocs estimate: 17.
julia> @benchmark readraw_turbo!($raw, $file, $w, $w)
BenchmarkTools.Trial: 2752 samples with 1 evaluation.
Range (min β¦ max): 1.640 ms β¦ 4.106 ms β GC (min β¦ max): 0.00% β¦ 55.55%
Time (median): 1.673 ms β GC (median): 0.00%
Time (mean Β± Ο): 1.809 ms Β± 314.167 ΞΌs β GC (mean Β± Ο): 7.38% Β± 11.92%
ββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
1.64 ms Histogram: frequency by time 2.52 ms <
Memory estimate: 8.25 MiB, allocs estimate: 17.
julia> @benchmark readraw_turbo_unroll1!($raw, $file, $w, $w)
BenchmarkTools.Trial: 2763 samples with 1 evaluation.
Range (min β¦ max): 1.628 ms β¦ 4.122 ms β GC (min β¦ max): 0.00% β¦ 55.91%
Time (median): 1.663 ms β GC (median): 0.00%
Time (mean Β± Ο): 1.802 ms Β± 318.433 ΞΌs β GC (mean Β± Ο): 7.45% Β± 11.97%
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
1.63 ms Histogram: frequency by time 2.52 ms <
Memory estimate: 8.25 MiB, allocs estimate: 17.
Definitions
julia> using LoopVectorization
julia> function readraw!(raw, file, width, height)
read!(file, raw)
npack = Int(length(raw)/3)
img = Vector{UInt8}(undef,4npack)
@inbounds @simd for i in 0:npack-1
img[1+4i] = raw[2+3i] << 4
img[2+4i] = raw[1+3i]
img[3+4i] = raw[2+3i]
img[4+4i] = raw[3+3i]
end
img = reinterpret(UInt16,img)
@inbounds @simd for i in 1:2npack
img[i] >>>= 4
end
reshape(img,width,height)'
end
readraw! (generic function with 1 method)
julia> function readraw_turbo!(raw, file, width, height)
read!(file, raw)
npack = Int(length(raw)/3)
img = Vector{UInt8}(undef,4npack)
@turbo for i in 0:npack-1
img[1+4i] = raw[2+3i] << 4
img[2+4i] = raw[1+3i]
img[3+4i] = raw[2+3i]
img[4+4i] = raw[3+3i]
end
img = reinterpret(UInt16,img)
@turbo for i in 1:2npack
img[i] >>>= 4
end
reshape(img,width,height)'
end
readraw_turbo! (generic function with 1 method)
julia> function readraw_turbo_unroll1!(raw, file, width, height)
read!(file, raw)
npack = Int(length(raw)/3)
img = Vector{UInt8}(undef,4npack)
@turbo unroll=1 for i in 0:npack-1
img[1+4i] = raw[2+3i] << 4
img[2+4i] = raw[1+3i]
img[3+4i] = raw[2+3i]
img[4+4i] = raw[3+3i]
end
img = reinterpret(UInt16,img)
@turbo unroll=1 for i in 1:2npack
img[i] >>>= 4
end
reshape(img,width,height)'
end
readraw_turbo_unroll1! (generic function with 1 method)