Could this function be faster?

Hi,

I recently have to read raw images from a camera. The pixel format is Mono color 12bit packed.

I first tried a MATLAB version which managed to read&unpack in ~30ms for a 2080*2080 image.

Here is a julia version reaching ~5ms

using Mmap, BenchmarkTools

function readraw(file,width,height)
        raw = Mmap.mmap(file,Vector{UInt8},Int(width*height*1.5),0)
        npack = Int(length(raw)/3)
        img = Vector{UInt8}(undef,4npack)
        @inbounds @simd for i in 0:npack-1
            img[1+4i] = raw[2+3i] << 4
            img[2+4i] = raw[1+3i]
            img[3+4i] = raw[2+3i]
            img[4+4i] = raw[3+3i]
        end
        img = reinterpret(UInt16,img)
        @inbounds @simd for i in 1:2npack
            img[i] >>>= 4
        end
        img = reshape(img,width,height)'
end

file = "C:\\Users\\fff00\\FullMono12Packed.Raw"
w = 2080
@btime img = readraw(file,$w,$w);

Since i need to read&unpack for huge number of raw images, i am looking for anyone who could help speeding up more of this function.

All tests are done in Julia-1.6.1, and here is a test raw image.

Thanks
Alex

Mmap is in general slow. Since you know exactly what these layout looks like, I suggest read and seek

2 Likes

Agreed - reading chunks of UInt16 and splitting them up appropriately could lead to more speedup. Since your data is aligned for every 2 pixels, this should unroll and SIMD really nice. Also worth taking a look at LoopVectorization.jl, though I think you’ll have to read data into a buffer first (at least a small one?) to really take advantage of it.

I replaced the line of Memory Mapping with raw = read(file), and it improved a little from

5.787 ms (23 allocations: 8.25 MiB)
to
5.239 ms (20 allocations: 14.44 MiB)

I am not sure about open the file IO, and seek bytes in the for loop, isn’t request IO every time in the loop and be slow?

isn’t that time just dominated by I/O? how long does read(file) take?

I tried benchmark the file io part

@btime read($file)
and it shows
1.741 ms (17 allocations: 6.19 MiB

I just blindly replaced the @inbounds @simd with @turbo, and it gets worse from

5.454 ms (20 allocations: 14.44 MiB)
to
8.473 ms (20 allocations: 14.44 MiB)

Paging @elrod - LoopVectorization should fall back to @inbounds @simd if it isn’t able to work with an expression, but it appears to be slower than @inbounds @simd in this case. It may be getting tripped up by by the irregular access pattern, but you can get around that to some degree by specifying @turbo unroll=1.

function readraw_turbo!(raw, file, width, height)
        read!(file, raw)
        npack = Int(length(raw)/3)
        img = Vector{UInt8}(undef,4npack)
        @turbo unroll=1 for i in 0:npack-1
            img[1+4i] = raw[2+3i] << 4
            img[2+4i] = raw[1+3i]
            img[3+4i] = raw[2+3i]
            img[4+4i] = raw[3+3i]
        end
        img = reinterpret(UInt16,img)
        @turbo unroll=1 for i in 1:2npack
            img[i] >>>= 4
        end
        reshape(img,width,height)'
end

This gives me

julia> @btime img = readraw!($raw, $file, $w, $w);
  5.416 ms (16 allocations: 8.25 MiB)

julia> @btime img = readraw_turbo!($raw, $file, $w, $w);
  4.037 ms (16 allocations: 8.25 MiB)

where readraw! is your original version but with raw passed in as a buffer to avoid needing to allocate it every time.

2 Likes

By default, it is probably unrolling by 4 so that it can load a vector of 4 contiguous values from img and then shuffle them.
That might not work well for UInt8.

I’m currently getting crashes:

julia: /home/chriselrod/Documents/languages/juliarelease/src/ccall.cpp:879: jl_cgval_t emit_llvmcall(jl_codectx_t&, jl_value_t**, size_t): Assertion `*it == f->getFunctionType()->getParamType(i)' failed.

So I’ll have to look into what’s going on later.

You could also try @tturbo unroll=1/@turbo thread=true unroll=1.

@stillyslalom i liked the buffer improvement, should help a lot when loading lots of images.

using @turbo thread=true unroll=1 improves further to

3.478 ms (20 allocations: 14.44 MiB)

if fileio part is excluded(~1.7ms), the unpacking part now is only ~1.7ms.

This is great :clap:

1 Like

I just found that the images from @turbo version and @inbounds version are slightly different and @turbo result is not the correct one.

By switch for loops of each version, it shows that the incorrect result is caused by this for loop:

@turbo for i in 0:npack-1
            img[1+4i] = raw[2+3i] << 4
            img[2+4i] = raw[1+3i]
            img[3+4i] = raw[2+3i]
            img[4+4i] = raw[3+3i]
end

Thanks. I filed an issue.

Should be fixed with LoopVectorization 0.12.51.
There’s still a performance issue I need to fix, but it’s already much faster than @inbounds @simd for me:

julia> @benchmark readraw!($raw, $file, $w, $w)
BenchmarkTools.Trial: 1658 samples with 1 evaluation.
 Range (min … max):  2.869 ms …   3.871 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     2.902 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   3.008 ms Β± 242.909 ΞΌs  β”Š GC (mean Β± Οƒ):  3.28% Β± 6.22%

   β–β–ˆβ–ƒ
  β–ƒβ–ˆβ–ˆβ–ˆβ–ƒβ–‚β–‚β–‚β–‚β–β–β–β–β–β–β–β–‚β–β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–‚β–ƒβ–…β–ƒ β–‚
  2.87 ms         Histogram: frequency by time        3.57 ms <

 Memory estimate: 8.25 MiB, allocs estimate: 17.

julia> @benchmark readraw_turbo!($raw, $file, $w, $w)
BenchmarkTools.Trial: 2752 samples with 1 evaluation.
 Range (min … max):  1.640 ms …   4.106 ms  β”Š GC (min … max): 0.00% … 55.55%
 Time  (median):     1.673 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   1.809 ms Β± 314.167 ΞΌs  β”Š GC (mean Β± Οƒ):  7.38% Β± 11.92%

   β–†β–ˆ
  β–‡β–ˆβ–ˆβ–„β–‚β–‚β–‚β–β–β–‚β–β–‚β–‚β–β–β–β–β–β–β–β–β–β–‚β–β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ƒβ–…β–„ β–‚
  1.64 ms         Histogram: frequency by time        2.52 ms <

 Memory estimate: 8.25 MiB, allocs estimate: 17.

julia> @benchmark readraw_turbo_unroll1!($raw, $file, $w, $w)
BenchmarkTools.Trial: 2763 samples with 1 evaluation.
 Range (min … max):  1.628 ms …   4.122 ms  β”Š GC (min … max): 0.00% … 55.91%
 Time  (median):     1.663 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   1.802 ms Β± 318.433 ΞΌs  β”Š GC (mean Β± Οƒ):  7.45% Β± 11.97%

   β–…β–ˆ
  β–…β–ˆβ–ˆβ–†β–‚β–‚β–‚β–β–β–‚β–β–β–‚β–β–‚β–β–β–β–β–‚β–β–β–β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–‚β–…β–„ β–‚
  1.63 ms         Histogram: frequency by time        2.52 ms <

 Memory estimate: 8.25 MiB, allocs estimate: 17.
Definitions
julia> using LoopVectorization

julia> function readraw!(raw, file, width, height)
               read!(file, raw)
               npack = Int(length(raw)/3)
               img = Vector{UInt8}(undef,4npack)
               @inbounds @simd for i in 0:npack-1
                   img[1+4i] = raw[2+3i] << 4
                   img[2+4i] = raw[1+3i]
                   img[3+4i] = raw[2+3i]
                   img[4+4i] = raw[3+3i]
               end
               img = reinterpret(UInt16,img)
               @inbounds @simd for i in 1:2npack
                   img[i] >>>= 4
               end
               reshape(img,width,height)'
       end
readraw! (generic function with 1 method)

julia> function readraw_turbo!(raw, file, width, height)
               read!(file, raw)
               npack = Int(length(raw)/3)
               img = Vector{UInt8}(undef,4npack)
               @turbo for i in 0:npack-1
                   img[1+4i] = raw[2+3i] << 4
                   img[2+4i] = raw[1+3i]
                   img[3+4i] = raw[2+3i]
                   img[4+4i] = raw[3+3i]
               end
               img = reinterpret(UInt16,img)
               @turbo for i in 1:2npack
                   img[i] >>>= 4
               end
               reshape(img,width,height)'
       end
readraw_turbo! (generic function with 1 method)

julia> function readraw_turbo_unroll1!(raw, file, width, height)
               read!(file, raw)
               npack = Int(length(raw)/3)
               img = Vector{UInt8}(undef,4npack)
               @turbo unroll=1 for i in 0:npack-1
                   img[1+4i] = raw[2+3i] << 4
                   img[2+4i] = raw[1+3i]
                   img[3+4i] = raw[2+3i]
                   img[4+4i] = raw[3+3i]
               end
               img = reinterpret(UInt16,img)
               @turbo unroll=1 for i in 1:2npack
                   img[i] >>>= 4
               end
               reshape(img,width,height)'
       end
readraw_turbo_unroll1! (generic function with 1 method)
6 Likes

I’ve test it again on Julia 1.6.1, LoopVectorization 0.12.54, and now result is correct and faster, from

5.303 ms (19 allocations: 14.44 MiB)  # @inbound @simd
to
3.426 ms (19 allocations: 14.44 MiB) # @turbo unroll=1 thread=true

Thanks for all your help, and @Elrod for the amazing package.

1 Like