How to implant Julia 1.0 code:Matrix{Vector{T}}(undef, M, N) in Julia 0.6?

I wish to rewrite the following Julia 1.0 code
patch = Matrix{Vector{T}}(undef, M, N)
into Julia 0.6, since I want to take advantage of the high performance MKL in JuliaPro which only supports Julia 0.6.

However, I have trouble doing that since the direct code in Julia has error claiming: UndefVarError: undef not defined. I don’t want initialize patch as a 3D matrix, since indexing the elements in it is quite time consuming. Thus, the best way for me to do this is to find the code’s counterpart in Julia 0.6 environment.

To have more context please check out @Gnimuc’s answer at full Julia 1.0 code for image denoising

You can just use patch = Matrix{Vector{T}}(M, N) in Julia 0.6.

Alternatively, if you’re happy to use standalone Julia, rather than JuliaPro, you can build Julia 1.0 from source with MKL (https://github.com/JuliaLang/julia#intel-mkl).

1 Like

The Compat.jl package exists precisely to allow you to write code that works in both Julia v0.6 and Julia v1.0:

  | | |_| | | | (_| |  |  Version 0.6.4 (2018-07-09 19:09 UTC)
 _/ |\__'_|_|_|\__'_|  |  Official http://julialang.org/ release
|__/                   |  x86_64-pc-linux-gnu

julia> using Compat

julia> Matrix{Vector{Float64}}(undef, 5, 10)
5×10 Array{Array{Float64,1},2}:
 #undef  #undef  #undef  #undef  #undef  #undef  #undef  #undef  #undef  #undef
 #undef  #undef  #undef  #undef  #undef  #undef  #undef  #undef  #undef  #undef
 #undef  #undef  #undef  #undef  #undef  #undef  #undef  #undef  #undef  #undef
 #undef  #undef  #undef  #undef  #undef  #undef  #undef  #undef  #undef  #undef
 #undef  #undef  #undef  #undef  #undef  #undef  #undef  #undef  #undef  #undef

Dear All, I have tried the modified version of Julia 0.6 using @rdeits’s suggestion. Here are two version of codes, Julia 0.6 and Julia 1.0, respectively.

using Images
using Compat
# using Statistics
# using LinearAlgebra
function crop(img, l)
    (m,n)=size(img)
im_crop = img[l+1:m-l,l+1:n-l]
return im_crop
end

function extend(img, L)
    (M,N)=size(img)
    im_Extend=zeros(M+2*L,N+2*L)
    im_Extend[L+1:M+L,L+1:N+L]=img[1:M,1:N]
    im_Extend[1:M+2*L,    1:L    ]=flipdim(im_Extend[1:M+2*L,L+1:2*L], 2)
    im_Extend[1:M+2*L,N+L+1:N+L*2]=flipdim(im_Extend[1:M+2*L,N+1:N+L], 2)
    im_Extend[    1:L    ,1:N+2*L]=flipdim(im_Extend[L+1:2*L,1:N+2*L], 1)
    im_Extend[M+L+1:M+L*2,1:N+2*L]=flipdim(im_Extend[M+1:M+L,1:N+2*L], 1)
    return im_Extend
end

function im2col(img::AbstractMatrix{T}, m, n) where {T<:Real}
    M, N = size(img)
    patch           = Matrix{Vector{T}}(undef, M-m+1, N-n+1)
    patchNormalized = Matrix{Vector{T}}(undef, M-m+1, N-n+1)
    @inbounds for x = 1:M-m+1, y = 1:N-n+1
        p = vec(img[x:x+m-1,y:y+n-1])
        patch[x,y] = p
        patchNormalized[x,y] = p./norm(p)
    end
    return patch,patchNormalized
end

function denoise(yNoisy, n, W, K, λ)
    yNoisy = extend(yNoisy, n)
    MAX_PATCH_NUM = (2W + 1)^2
    M, N = size(yNoisy)
    patchM, patchN = M-n+1, N-n+1
    noisyPatch,noisyPatchNormalized = im2col(yNoisy, n, n)
    yHat = zeros(M, N)
    count = zeros(M, N)
    maxKVal = zeros(K)
    maxKidx = zeros(Int, 2, K)
    X = Matrix{Float64}(undef, n*n, K)
    # loop over each patch(there're totally patchM*patchN patches)
    for  y = 1:patchM, x = 1:patchN
        # window bounds sanity check
        xMin = x - W ≤ 0 ? 1      : (x + W > patchM ? patchM - 2W : x - W)
        xMax = x - W ≤ 0 ? 1 + 2W : (x + W > patchM ? patchM      : x + W)
        yMin = y - W ≤ 0 ? 1      : (y + W > patchN ? patchN - 2W : y - W)
        yMax = y - W ≤ 0 ? 1 + 2W : (y + W > patchN ? patchN      : y + W)
        # loop over every coordinate(patch) in the window
        # this could find K most similar patches in one pass
        b = noisyPatch[x,y]
        # println(typeof(b));
        # println(typeof(noisyPatchNormalized[1,1]));
        minVal = 0.0
        minIdx = 1
        k = 0
        @inbounds for py = yMin:yMax, px = xMin:xMax
            temp=noisyPatchNormalized[px,py];
            corr = BLAS.dot(n*n,temp,1,b,1)
            if k < K
                k += 1
                maxKVal[k] = corr
                maxKidx[1,minIdx] = px
                maxKidx[2,minIdx] = py
                minVal, minIdx = findmin(maxKVal)
            end
            corr < minVal && continue
            maxKVal[minIdx] = corr
            maxKidx[1,minIdx] = px
            maxKidx[2,minIdx] = py
            minVal, minIdx = findmin(maxKVal)
        end
        # find the weights
        for i = 1:K
            px = maxKidx[1,i]
            py = maxKidx[2,i]
            X[:,i] = noisyPatch[px,py]
        end
        core = transpose(X) * X + λ * I
        weights = core \ (transpose(X) * b)
        # synthesize
        patch = X * weights |> x->reshape(x, n, n)
        yHat[x:x+n-1,y:y+n-1] .+= patch
        count[x:x+n-1,y:y+n-1] .+= 1
    end
    y_DeNoise = crop(yHat ./ count, n)
end
PSNR(x, y) = 20 * log10(255/sqrt(mean((x-y).^2)))
# lena = load(joinpath(@__DIR__, "lena.bmp")) |> channelview
# img = lena[1,:,:].*255.0 # select first channel
# println(norm(img))
file = matopen("y1.mat")
img=read(file, "y1")
close(file)

n = 8
W = 20
K = 12
λ = 300
imgNoisy = img + 3randn(512,512)
PSNR(imgNoisy, img)
imgdenoised = @time denoise(imgNoisy, n, W, K, λ);
println(PSNR(imgNoisy, img))
println(PSNR(imgdenoised, img))

using Images
using Statistics
using LinearAlgebra
using QuartzImageIO
using ImageMagick

crop(img, l) = crop(img, size(img), l)
crop(img, (m, n), l) = img[l+1:m-l,l+1:n-l]

function extend(img, L)
    M, N = size(img)
    extended = zeros(M+2*L,N+2*L)
    extended[L+1:M+L,L+1:N+L] = img
    extended[1:M+2*L,    1:L    ] = reverse(extended[1:M+2*L,L+1:2*L], dims=2)
    extended[1:M+2*L,N+L+1:N+L*2] = reverse(extended[1:M+2*L,N+1:N+L], dims=2)
    extended[    1:L    ,1:N+2*L] = reverse(extended[L+1:2*L,1:N+2*L], dims=1)
    extended[M+L+1:M+L*2,1:N+2*L] = reverse(extended[M+1:M+L,1:N+2*L], dims=1)
    extended
end

function im2col(img::AbstractMatrix{T}, m, n) where {T<:Real}
    M, N = size(img)
    patch           = Matrix{Vector{T}}(undef, M-m+1, N-n+1)
    patchNormalized = Matrix{Vector{T}}(undef, M-m+1, N-n+1)
    @inbounds for x = 1:M-m+1, y = 1:N-n+1
        p = vec(img[x:x+m-1,y:y+n-1])
        patch[x,y] = p
        patchNormalized[x,y] = p./norm(p)
    end
    return patch,patchNormalized
end

function denoise(yNoisy, n, W, K, λ)
    yNoisy = extend(yNoisy, n)
    MAX_PATCH_NUM = (2W + 1)^2
    M, N = size(yNoisy)
    patchM, patchN = M-n+1, N-n+1
    noisyPatch,noisyPatchNormalized = im2col(yNoisy, n, n)
    yHat = zeros(M, N)
    count = zeros(M, N)
    maxKVal = zeros(K)
    maxKidx = zeros(Int, 2, K)
    X = Matrix{Float64}(undef, n*n, K)
    # loop over each patch(there're totally patchM*patchN patches)
   for ii in CartesianIndices((patchM,patchN))
        x, y = Tuple(ii)
        # window bounds sanity check
        xMin = x - W ≤ 0 ? 1      : (x + W > patchM ? patchM - 2W : x - W)
        xMax = x - W ≤ 0 ? 1 + 2W : (x + W > patchM ? patchM      : x + W)
        yMin = y - W ≤ 0 ? 1      : (y + W > patchN ? patchN - 2W : y - W)
        yMax = y - W ≤ 0 ? 1 + 2W : (y + W > patchN ? patchN      : y + W)
        # loop over every coordinate(patch) in the window
        # this could find K most similar patches in one pass
        b = noisyPatch[x,y]
        # println(typeof(b));
        # println(typeof(noisyPatchNormalized[1,1]));
        minVal = 0.0
        minIdx = 1
        k = 0
        @inbounds for py = yMin:yMax, px = xMin:xMax
            temp=noisyPatchNormalized[px,py];
            corr = BLAS.dot(n*n,temp,1,b,1)
            if k < K
                k += 1
                maxKVal[k] = corr
                maxKidx[1,minIdx] = px
                maxKidx[2,minIdx] = py
                minVal, minIdx = findmin(maxKVal)
            end
            corr < minVal && continue
            maxKVal[minIdx] = corr
            maxKidx[1,minIdx] = px
            maxKidx[2,minIdx] = py
            minVal, minIdx = findmin(maxKVal)
        end
        # find the weights
        for i = 1:K
            px = maxKidx[1,i]
            py = maxKidx[2,i]
            X[:,i] = noisyPatch[px,py]
        end
        core = transpose(X) * X + λ * I
        weights = core \ (transpose(X) * b)
        # synthesize
        patch = X * weights |> x->reshape(x, n, n)
        yHat[x:x+n-1,y:y+n-1] .+= patch
        count[x:x+n-1,y:y+n-1] .+= 1
    end
    y_DeNoise = crop(yHat ./ count, n)
end
PSNR(x, y) = 20 * log10(255/sqrt(mean((x-y).^2)))
lena = load(joinpath(@__DIR__, "lena.bmp")) |> channelview
img = lena[1,:,:].*255.0 # select first channel
# println(norm(img))
n = 8
W = 20
K = 12
λ = 300
imgNoisy = img + 3randn(512,512)
PSNR(imgNoisy, img)
imgdenoised = @time denoise(imgNoisy, n, W, K, λ);
@info "PSNR Before Denoising:" PSNR(imgNoisy, img)
@info "PSNR After Denoising:" PSNR(imgdenoised, img)

It is worth noticing that Julia 0.6 is costing the same amount of time as Julia 1.0 even though it comes with the help of MKL. Does anyone have an idea why?

Julia 1.0: 21.252800 seconds (4.74 M allocations: 1.953 GiB, 5.83% gc time)
Julia 0.6: 20.306164 seconds (10.76 M allocations: 5.229 GiB, 9.76% gc time)

Thanks. Do you have any recommended tutorial on how to do that on MacOS? I am really desperate for it.

Maybe a significant time is not spent in BLAS or at least not in the parts of BLAS where MKL eventually has any performance advantage.

1 Like

The julia README has build instructions (https://github.com/JuliaLang/julia#source-download-and-compilation), and has some extra tips for building on MacOS (https://github.com/JuliaLang/julia#os-x). I haven’t built Julia on MacOS myself, so I don’t know how involved it is - but I find it pretty straightforward (albeit a bit time consuming) on linux.

But I would say the broader point (echoing what Kristoffer said) is that you might not get much improvement from using MKL. Julia is not like R - where there are big gains to be had from using MKL - OpenBLAS usually does pretty well.

Have you tried profiling your code to see where the performance bottlenecks are? ProfileView.jl is very helpful for pinpointing where to focus your development time.

The bottleneck in both codes is BLAS.dot operator. I have tested it in Julia 0.6 linking MKL, which is 4X the speed as it in Julia 1.0 linking OpenBLAS. But the weird thing is, the time of the code above are roughly the same.

For building Julia 1.0 with MKL, I will try to implement it by myself.

Unfortunately, this is currently unsupported by Arpack.jl, and a large number of libraries include Arpack in their dependency tree.

I still stubbornly went this route on a couple computers with Intel processors (although I have secondary installs with OpenBLAS), but that is a pain point that’s unlikely to be worth it for many.

~~I think you are running into this problem: https://discourse.julialang.org/t/innefficient-paralellization-need-some-help-optimizing-a-simple-dot-product/9723/19~~

For whatever reason, OpenBLAS does not multi-thread its dot product. Luckily, you can make your own dot product which outperforms OpenBLAS. (This is also why dot is so much faster with MKL for such a basic operation - it’s multi-threaded by default)

EDIT: This no longer seems to be the case, OpenBLAS is threaded. Make sure you are using BLAS.set_num_threads(Sys.CPU_THREADS) or whatever is appropriate given your architecture.

using LinearAlgebra
using Distributed

function pdot(a::AbstractVector{T}, b::AbstractVector{T}) where {T}
    N = Threads.nthreads()
    v = zeros(T, N)
    ranges = Distributed.splitrange(length(a), N)
    Threads.@threads for i in 1:N
        x = zero(T)
        P = Threads.threadid()
        subrange = ranges[P]
        @inbounds @simd for i in subrange
            x += a[i] * b[i]
        end
        v[P] = x
    end
    sum(v)
end

Benchmarks for 2 threads (start Julia with environment variable set JULIA_NUM_THREADS=2)

using BenchmarkTools
a = randn(10000); b = randn(10000)

BLAS.set_num_threads(2)
@btime dot($a, $b)
#  1.570 μs (0 allocations: 0 bytes)

@btime pdot($a, $b)
#  1.270 μs (3 allocations: 272 bytes)

There is some performance left on the table, but this is an alternate way to get the speed you need. I only have 2 threads on this machine but I imagine with yours you may get a better speed-up.

EDIT: With some more testing, it looks like OpenBLAS is threading the dot product:

a = randn(100000); b = randn(100000)
@btime dot($a, $b)
#  12.857 μs (0 allocations: 0 bytes)
@btime pdot($a, $b)
#  12.651 μs (3 allocations: 272 bytes)

BLAS.set_num_threads(1)
@btime dot($a, $b)
#  24.300 μs (0 allocations: 0 bytes)

BLAS.set_num_threads(2)
@btime dot($a, $b)
#  12.842 μs (0 allocations: 0 bytes)

BLAS.set_num_threads(4)
@btime dot($a, $b)
#  12.157 μs (0 allocations: 0 bytes)

pdot remains within a few percent of the BLAS call. So I think BLAS is using the threaded version now. Make sure you call BLAS.set_num_threads