# Could this function be faster?

Hi,

I recently have to read raw images from a camera. The pixel format is Mono color 12bit packed.

I first tried a MATLAB version which managed to read&unpack in ~30ms for a 2080*2080 image.

Here is a julia version reaching ~5ms

using Mmap, BenchmarkTools

raw = Mmap.mmap(file,Vector{UInt8},Int(width*height*1.5),0)
npack = Int(length(raw)/3)
img = Vector{UInt8}(undef,4npack)
@inbounds @simd for i in 0:npack-1
img[1+4i] = raw[2+3i] << 4
img[2+4i] = raw[1+3i]
img[3+4i] = raw[2+3i]
img[4+4i] = raw[3+3i]
end
img = reinterpret(UInt16,img)
@inbounds @simd for i in 1:2npack
img[i] >>>= 4
end
img = reshape(img,width,height)'
end

file = "C:\\Users\\fff00\\FullMono12Packed.Raw"
w = 2080

Since i need to read&unpack for huge number of raw images, i am looking for anyone who could help speeding up more of this function.

All tests are done in Julia-1.6.1, and here is a test raw image.

Thanks
Alex

Mmap is in general slow. Since you know exactly what these layout looks like, I suggest read and seek

2 Likes

Agreed - reading chunks of UInt16 and splitting them up appropriately could lead to more speedup. Since your data is aligned for every 2 pixels, this should unroll and SIMD really nice. Also worth taking a look at LoopVectorization.jl, though I think youβll have to read data into a buffer first (at least a small one?) to really take advantage of it.

I replaced the line of Memory Mapping with raw = read(file), and it improved a little from

5.787 ms (23 allocations: 8.25 MiB)
to
5.239 ms (20 allocations: 14.44 MiB)

I am not sure about open the file IO, and seek bytes in the for loop, isnβt request IO every time in the loop and be slow?

isnβt that time just dominated by I/O? how long does read(file) take?

I tried benchmark the file io part

and it shows
1.741 ms (17 allocations: 6.19 MiB

I just blindly replaced the @inbounds @simd with @turbo, and it gets worse from

5.454 ms (20 allocations: 14.44 MiB)
to
8.473 ms (20 allocations: 14.44 MiB)

Paging @elrod - LoopVectorization should fall back to @inbounds @simd if it isnβt able to work with an expression, but it appears to be slower than @inbounds @simd in this case. It may be getting tripped up by by the irregular access pattern, but you can get around that to some degree by specifying @turbo unroll=1.

npack = Int(length(raw)/3)
img = Vector{UInt8}(undef,4npack)
@turbo unroll=1 for i in 0:npack-1
img[1+4i] = raw[2+3i] << 4
img[2+4i] = raw[1+3i]
img[3+4i] = raw[2+3i]
img[4+4i] = raw[3+3i]
end
img = reinterpret(UInt16,img)
@turbo unroll=1 for i in 1:2npack
img[i] >>>= 4
end
reshape(img,width,height)'
end

This gives me

julia> @btime img = readraw!(\$raw, \$file, \$w, \$w);
5.416 ms (16 allocations: 8.25 MiB)

julia> @btime img = readraw_turbo!(\$raw, \$file, \$w, \$w);
4.037 ms (16 allocations: 8.25 MiB)

where readraw! is your original version but with raw passed in as a buffer to avoid needing to allocate it every time.

2 Likes

By default, it is probably unrolling by 4 so that it can load a vector of 4 contiguous values from img and then shuffle them.
That might not work well for UInt8.

Iβm currently getting crashes:

julia: /home/chriselrod/Documents/languages/juliarelease/src/ccall.cpp:879: jl_cgval_t emit_llvmcall(jl_codectx_t&, jl_value_t**, size_t): Assertion `*it == f->getFunctionType()->getParamType(i)' failed.

So Iβll have to look into whatβs going on later.

You could also try @tturbo unroll=1/@turbo thread=true unroll=1.

@stillyslalom i liked the buffer improvement, should help a lot when loading lots of images.

using @turbo thread=true unroll=1 improves further to

3.478 ms (20 allocations: 14.44 MiB)

if fileio part is excluded(~1.7ms), the unpacking part now is only ~1.7ms.

This is great

1 Like

I just found that the images from @turbo version and @inbounds version are slightly different and @turbo result is not the correct one.

By switch for loops of each version, it shows that the incorrect result is caused by this for loop:

@turbo for i in 0:npack-1
img[1+4i] = raw[2+3i] << 4
img[2+4i] = raw[1+3i]
img[3+4i] = raw[2+3i]
img[4+4i] = raw[3+3i]
end

Thanks. I filed an issue.

Should be fixed with LoopVectorization 0.12.51.
Thereβs still a performance issue I need to fix, but itβs already much faster than @inbounds @simd for me:

julia> @benchmark readraw!(\$raw, \$file, \$w, \$w)
BenchmarkTools.Trial: 1658 samples with 1 evaluation.
Range (min β¦ max):  2.869 ms β¦   3.871 ms  β GC (min β¦ max): 0.00% β¦ 0.00%
Time  (median):     2.902 ms               β GC (median):    0.00%
Time  (mean Β± Ο):   3.008 ms Β± 242.909 ΞΌs  β GC (mean Β± Ο):  3.28% Β± 6.22%

βββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
2.87 ms         Histogram: frequency by time        3.57 ms <

Memory estimate: 8.25 MiB, allocs estimate: 17.

julia> @benchmark readraw_turbo!(\$raw, \$file, \$w, \$w)
BenchmarkTools.Trial: 2752 samples with 1 evaluation.
Range (min β¦ max):  1.640 ms β¦   4.106 ms  β GC (min β¦ max): 0.00% β¦ 55.55%
Time  (median):     1.673 ms               β GC (median):    0.00%
Time  (mean Β± Ο):   1.809 ms Β± 314.167 ΞΌs  β GC (mean Β± Ο):  7.38% Β± 11.92%

ββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
1.64 ms         Histogram: frequency by time        2.52 ms <

Memory estimate: 8.25 MiB, allocs estimate: 17.

julia> @benchmark readraw_turbo_unroll1!(\$raw, \$file, \$w, \$w)
BenchmarkTools.Trial: 2763 samples with 1 evaluation.
Range (min β¦ max):  1.628 ms β¦   4.122 ms  β GC (min β¦ max): 0.00% β¦ 55.91%
Time  (median):     1.663 ms               β GC (median):    0.00%
Time  (mean Β± Ο):   1.802 ms Β± 318.433 ΞΌs  β GC (mean Β± Ο):  7.45% Β± 11.97%

ββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
1.63 ms         Histogram: frequency by time        2.52 ms <

Memory estimate: 8.25 MiB, allocs estimate: 17.
Definitions
julia> using LoopVectorization

julia> function readraw!(raw, file, width, height)
npack = Int(length(raw)/3)
img = Vector{UInt8}(undef,4npack)
@inbounds @simd for i in 0:npack-1
img[1+4i] = raw[2+3i] << 4
img[2+4i] = raw[1+3i]
img[3+4i] = raw[2+3i]
img[4+4i] = raw[3+3i]
end
img = reinterpret(UInt16,img)
@inbounds @simd for i in 1:2npack
img[i] >>>= 4
end
reshape(img,width,height)'
end
readraw! (generic function with 1 method)

julia> function readraw_turbo!(raw, file, width, height)
npack = Int(length(raw)/3)
img = Vector{UInt8}(undef,4npack)
@turbo for i in 0:npack-1
img[1+4i] = raw[2+3i] << 4
img[2+4i] = raw[1+3i]
img[3+4i] = raw[2+3i]
img[4+4i] = raw[3+3i]
end
img = reinterpret(UInt16,img)
@turbo for i in 1:2npack
img[i] >>>= 4
end
reshape(img,width,height)'
end
readraw_turbo! (generic function with 1 method)

julia> function readraw_turbo_unroll1!(raw, file, width, height)
npack = Int(length(raw)/3)
img = Vector{UInt8}(undef,4npack)
@turbo unroll=1 for i in 0:npack-1
img[1+4i] = raw[2+3i] << 4
img[2+4i] = raw[1+3i]
img[3+4i] = raw[2+3i]
img[4+4i] = raw[3+3i]
end
img = reinterpret(UInt16,img)
@turbo unroll=1 for i in 1:2npack
img[i] >>>= 4
end
reshape(img,width,height)'
end
readraw_turbo_unroll1! (generic function with 1 method)
6 Likes

Iβve test it again on Julia 1.6.1, LoopVectorization 0.12.54, and now result is correct and faster, from

5.303 ms (19 allocations: 14.44 MiB)  # @inbound @simd
to
3.426 ms (19 allocations: 14.44 MiB) # @turbo unroll=1 thread=true

Thanks for all your help, and @Elrod for the amazing package.

1 Like