Chunk an array

Hello to all,

I would like to know how the most efficient way to chunk a multi-dimensional array in julia.

arr = rand(Float64, (10, 100))

How can split it into chunks of size 30 so that I get 3 (10, 30) and 1 (10,10) arrays?

I have this function:

function chunk_array(arr, N)
    chunked = []
    n_columns = size(arr)[2]

    for i in 1:N:n_columns
        push!(chunked, arr[:, i:min(i+N-1, n_columns)])
    end
    return chunked
end

Is there something more performant?

Best Regards

You can replace size(arr)[2] by size(arr, 2) (this is more of a style improvement). You can pre-allocate the vector for the chunks.

function chunk_array_2(arr::AbstractMatrix, N::Integer)
    n_cols = size(arr, 2)
    n_chunks = ceil(Int, n_cols / N)
    chunks = Vector{typeof(arr)}(undef, n_chunks)
    for i in 1:n_chunks
        from = N * (i - 1) + 1
        to = min(i * N, n_cols)
        chunks[i] = arr[:, from:to]
    end
    return chunks
end

If it suits your needs, you can use a view when slicing the matrix.

function chunk_array_3(arr::AbstractMatrix, N::Integer)
    n_cols = size(arr, 2)
    n_chunks = ceil(Int, n_cols / N)
    chunks = Vector{AbstractMatrix}(undef, n_chunks)
    for i in 1:n_chunks
        from = N * (i - 1) + 1
        to = min(i * N, n_cols)
        chunks[i] = @view arr[:, from:to]
    end
    return chunks
end

The timings are

arr = rand(Float64, (10, 100))
@btime chunk_array($arr, 30);
  2.271 μs (6 allocations: 8.50 KiB)
@btime chunk_array_2($arr, 30);
  2.168 μs (5 allocations: 8.45 KiB)
@btime chunk_array_3($arr, 30);
  94.155 ns (5 allocations: 336 bytes)
1 Like

https://juliaml.github.io/MLUtils.jl/stable/api/#MLUtils.chunk

Loop in Julia are fine and what you wrote is fine except for one or two things:

  1. chunked isa Vector{Any} and will lead to poor downstream performance every time it is accessed, due to type instability. Note that Vector{AbstractMatrix} is basically just as bad. You want the type to be declared or inferred concretely.
  2. You are making copies of the data when you write arr[:, i:min(i+N-1, n_columns)]. If you want copies, this is fine. If it’s okay to alias the input data, consider using @view arr[:, i:min(i+N-1, n_columns)] instead. When aliased, changes to arr will be reflected in chunked and vice-versa, as they share memory.

I would probably write this function like this

function chunk_array(arr::AbstractVecOrMat, N)
    chunked = map(Iterators.partition(axes(arr,2),N)) do cols
        @view arr[:,cols] # remove @view if you want copies that do not alias `arr`
    end
    return chunked
end
1 Like

I actually want a copy of the data.
Is is sufficient to declare the type of variable for efficiency?

Yes, if you declared the type via chunked = THE_TYPE[] or chunked = Vector{THE_TYPE}(undef, num_chunks) it would be fine. The annoying part is that the type can be a little complicated at times. But in this case, since you want the data copied, it’s pretty easy and Matrix{eltype(arr)} in place of THE_TYPE will work.

But I would still recommend you use my suggested solution with @view deleted. It will make the copies and let the compiler determine the type for you (thanks to map). Declaring types manually can be tedious (and sometimes very difficult) to do correctly.

2 Likes