Could this function be faster?

babaq · July 15, 2021, 7:19pm

Hi,

I recently have to read raw images from a camera. The pixel format is Mono color 12bit packed.

I first tried a MATLAB version which managed to read&unpack in ~30ms for a 2080*2080 image.

Here is a julia version reaching ~5ms

using Mmap, BenchmarkTools

function readraw(file,width,height)
        raw = Mmap.mmap(file,Vector{UInt8},Int(width*height*1.5),0)
        npack = Int(length(raw)/3)
        img = Vector{UInt8}(undef,4npack)
        @inbounds @simd for i in 0:npack-1
            img[1+4i] = raw[2+3i] << 4
            img[2+4i] = raw[1+3i]
            img[3+4i] = raw[2+3i]
            img[4+4i] = raw[3+3i]
        end
        img = reinterpret(UInt16,img)
        @inbounds @simd for i in 1:2npack
            img[i] >>>= 4
        end
        img = reshape(img,width,height)'
end

file = "C:\\Users\\fff00\\FullMono12Packed.Raw"
w = 2080
@btime img = readraw(file,$w,$w);

Since i need to read&unpack for huge number of raw images, i am looking for anyone who could help speeding up more of this function.

All tests are done in Julia-1.6.1, and here is a test raw image.

Thanks
Alex

jling · July 15, 2021, 7:21pm

Mmap is in general slow. Since you know exactly what these layout looks like, I suggest read and seek

Sukera · July 15, 2021, 8:15pm

Agreed - reading chunks of UInt16 and splitting them up appropriately could lead to more speedup. Since your data is aligned for every 2 pixels, this should unroll and SIMD really nice. Also worth taking a look at LoopVectorization.jl, though I think you’ll have to read data into a buffer first (at least a small one?) to really take advantage of it.

babaq · July 15, 2021, 9:59pm

I replaced the line of Memory Mapping with raw = read(file), and it improved a little from

5.787 ms (23 allocations: 8.25 MiB)
to
5.239 ms (20 allocations: 14.44 MiB)

I am not sure about open the file IO, and seek bytes in the for loop, isn’t request IO every time in the loop and be slow?

jling · July 15, 2021, 10:16pm

isn’t that time just dominated by I/O? how long does read(file) take?

babaq · July 15, 2021, 10:37pm

I tried benchmark the file io part

@btime read($file)
and it shows
1.741 ms (17 allocations: 6.19 MiB

babaq · July 15, 2021, 10:59pm

I just blindly replaced the @inbounds @simd with @turbo, and it gets worse from

5.454 ms (20 allocations: 14.44 MiB)
to
8.473 ms (20 allocations: 14.44 MiB)

stillyslalom · July 15, 2021, 11:59pm

Paging @elrod - LoopVectorization should fall back to @inbounds @simd if it isn’t able to work with an expression, but it appears to be slower than @inbounds @simd in this case. It may be getting tripped up by by the irregular access pattern, but you can get around that to some degree by specifying @turbo unroll=1.

function readraw_turbo!(raw, file, width, height)
        read!(file, raw)
        npack = Int(length(raw)/3)
        img = Vector{UInt8}(undef,4npack)
        @turbo unroll=1 for i in 0:npack-1
            img[1+4i] = raw[2+3i] << 4
            img[2+4i] = raw[1+3i]
            img[3+4i] = raw[2+3i]
            img[4+4i] = raw[3+3i]
        end
        img = reinterpret(UInt16,img)
        @turbo unroll=1 for i in 1:2npack
            img[i] >>>= 4
        end
        reshape(img,width,height)'
end

This gives me

julia> @btime img = readraw!($raw, $file, $w, $w);
  5.416 ms (16 allocations: 8.25 MiB)

julia> @btime img = readraw_turbo!($raw, $file, $w, $w);
  4.037 ms (16 allocations: 8.25 MiB)

where readraw! is your original version but with raw passed in as a buffer to avoid needing to allocate it every time.

Elrod · July 16, 2021, 12:22am

By default, it is probably unrolling by 4 so that it can load a vector of 4 contiguous values from img and then shuffle them.
That might not work well for UInt8.

I’m currently getting crashes:

julia: /home/chriselrod/Documents/languages/juliarelease/src/ccall.cpp:879: jl_cgval_t emit_llvmcall(jl_codectx_t&, jl_value_t**, size_t): Assertion `*it == f->getFunctionType()->getParamType(i)' failed.

So I’ll have to look into what’s going on later.

You could also try @tturbo unroll=1/@turbo thread=true unroll=1.

babaq · July 16, 2021, 2:27am

@stillyslalom i liked the buffer improvement, should help a lot when loading lots of images.

using @turbo thread=true unroll=1 improves further to

3.478 ms (20 allocations: 14.44 MiB)

if fileio part is excluded(~1.7ms), the unpacking part now is only ~1.7ms.

This is great

babaq · July 16, 2021, 4:55am

I just found that the images from @turbo version and @inbounds version are slightly different and @turbo result is not the correct one.

By switch for loops of each version, it shows that the incorrect result is caused by this for loop:

@turbo for i in 0:npack-1
            img[1+4i] = raw[2+3i] << 4
            img[2+4i] = raw[1+3i]
            img[3+4i] = raw[2+3i]
            img[4+4i] = raw[3+3i]
end

Elrod · July 16, 2021, 2:46pm

Thanks. I filed an issue.

Elrod · July 18, 2021, 6:35am

Should be fixed with LoopVectorization 0.12.51.
There’s still a performance issue I need to fix, but it’s already much faster than @inbounds @simd for me:

julia> @benchmark readraw!($raw, $file, $w, $w)
BenchmarkTools.Trial: 1658 samples with 1 evaluation.
 Range (min … max):  2.869 ms …   3.871 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.902 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.008 ms ± 242.909 μs  ┊ GC (mean ± σ):  3.28% ± 6.22%

   ▁█▃
  ▃███▃▂▂▂▂▁▁▁▁▁▁▁▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▃▅▃ ▂
  2.87 ms         Histogram: frequency by time        3.57 ms <

 Memory estimate: 8.25 MiB, allocs estimate: 17.

julia> @benchmark readraw_turbo!($raw, $file, $w, $w)
BenchmarkTools.Trial: 2752 samples with 1 evaluation.
 Range (min … max):  1.640 ms …   4.106 ms  ┊ GC (min … max): 0.00% … 55.55%
 Time  (median):     1.673 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.809 ms ± 314.167 μs  ┊ GC (mean ± σ):  7.38% ± 11.92%

   ▆█
  ▇██▄▂▂▂▁▁▂▁▂▂▁▁▁▁▁▁▁▁▁▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▅▄ ▂
  1.64 ms         Histogram: frequency by time        2.52 ms <

 Memory estimate: 8.25 MiB, allocs estimate: 17.

julia> @benchmark readraw_turbo_unroll1!($raw, $file, $w, $w)
BenchmarkTools.Trial: 2763 samples with 1 evaluation.
 Range (min … max):  1.628 ms …   4.122 ms  ┊ GC (min … max): 0.00% … 55.91%
 Time  (median):     1.663 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.802 ms ± 318.433 μs  ┊ GC (mean ± σ):  7.45% ± 11.97%

   ▅█
  ▅██▆▂▂▂▁▁▂▁▁▂▁▂▁▁▁▁▂▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▅▄ ▂
  1.63 ms         Histogram: frequency by time        2.52 ms <

 Memory estimate: 8.25 MiB, allocs estimate: 17.

Definitions

julia> using LoopVectorization

julia> function readraw!(raw, file, width, height)
               read!(file, raw)
               npack = Int(length(raw)/3)
               img = Vector{UInt8}(undef,4npack)
               @inbounds @simd for i in 0:npack-1
                   img[1+4i] = raw[2+3i] << 4
                   img[2+4i] = raw[1+3i]
                   img[3+4i] = raw[2+3i]
                   img[4+4i] = raw[3+3i]
               end
               img = reinterpret(UInt16,img)
               @inbounds @simd for i in 1:2npack
                   img[i] >>>= 4
               end
               reshape(img,width,height)'
       end
readraw! (generic function with 1 method)

julia> function readraw_turbo!(raw, file, width, height)
               read!(file, raw)
               npack = Int(length(raw)/3)
               img = Vector{UInt8}(undef,4npack)
               @turbo for i in 0:npack-1
                   img[1+4i] = raw[2+3i] << 4
                   img[2+4i] = raw[1+3i]
                   img[3+4i] = raw[2+3i]
                   img[4+4i] = raw[3+3i]
               end
               img = reinterpret(UInt16,img)
               @turbo for i in 1:2npack
                   img[i] >>>= 4
               end
               reshape(img,width,height)'
       end
readraw_turbo! (generic function with 1 method)

julia> function readraw_turbo_unroll1!(raw, file, width, height)
               read!(file, raw)
               npack = Int(length(raw)/3)
               img = Vector{UInt8}(undef,4npack)
               @turbo unroll=1 for i in 0:npack-1
                   img[1+4i] = raw[2+3i] << 4
                   img[2+4i] = raw[1+3i]
                   img[3+4i] = raw[2+3i]
                   img[4+4i] = raw[3+3i]
               end
               img = reinterpret(UInt16,img)
               @turbo unroll=1 for i in 1:2npack
                   img[i] >>>= 4
               end
               reshape(img,width,height)'
       end
readraw_turbo_unroll1! (generic function with 1 method)

babaq · July 27, 2021, 12:03am

I’ve test it again on Julia 1.6.1, LoopVectorization 0.12.54, and now result is correct and faster, from

5.303 ms (19 allocations: 14.44 MiB)  # @inbound @simd
to
3.426 ms (19 allocations: 14.44 MiB) # @turbo unroll=1 thread=true

Thanks for all your help, and @Elrod for the amazing package.

Topic		Replies	Views
How fast is binary reading capabilities in Julia compared with other languages? Data binaryio	11	2099	April 23, 2019
Very slow loading speed of TIFF images New to Julia	15	1093	October 3, 2023
Porting code from MatLab - performance tips New to Julia	18	406	June 26, 2024
Compiler optimization challenge: Bayer pattern image unpacking Performance	0	459	June 20, 2020
Improving performance of mapping algorithm Performance question	17	895	April 28, 2020

Could this function be faster?

Related topics